Unsupervised Learning Explained (+ Clustering, Manifold Learning, ...)
This article is brought to you
and thanks to Brilliant,
a problem-solving website that teaches you skills
essential to have in this age of automation.
In the last video in the series,
we began on a quest to clear up the misconceptions
between artificial intelligence and machine learning,
beginning with discussing supervised learning,
an essential foundational building block
in understanding the modern field of machine learning.
The focus of this video, then,
will continue right where the last one left off,
so sit back, relax, and join me once again
in an exploretion into the field of machine learning.
As a quick recap, the field of machine learning
is a subset of the grander field of artificial intelligence,
and takes place in the intersection
between big data and data science,
with data science composed of the fields
of statistics, mathematics, et cetera,
with the goal to make sense out of and structure data.
The intersection of data science and artificial intelligence
is where a particular subset
of machine learning takes place.
Supervised learning,
a type of learning
when we have the input and output to our data,
in other words, labeled structured data,
and we have to train our model
to maximize its predictive accuracy.
Supervised learning is then further subsectioned
comprised of two primary modes of learning models,
regression and classification.
Regression is for predicting continuous outputs.
In other words, outputs that lie
on a line of best fit of our model.
Classification on the other hand,
is for predicting discrete outputs.
In other words, mapping input variables
into discrete categories.
To add to this, many classification models
implement regression algorithms as well.
Essentially, supervised learning for the most part
is glorified statistical mathematics for
pattern recognition problems rebranded as machine learning
because they are applied in a way
in which we iterate through them
in other words train the models
to increase their predictive accuracy.
As a side note,
I highly recommend you watch the previous video
in the series,
to get a deeper understanding of supervised learning
as we walked through a quite an intensive example.
Additionally, there is one important bit of terminology
I wanna discuss
that we skipped over in the previous video.
In machine learning variables are referred to as features,
variables, attributes, properties, features.
They all mean the same thing
but for the sake of keeping our terminology consistent
with industry standard,
I will use features going forward.
Coming back on topic with this recap out of the way,
we can now move onto the next subset of machine learning.
Unsupervised learning.
Where as supervised learning is best suited for data
that is labeled and structured.
Unsupervised learning is for data that is unlabeled
and unstructured.
In other words, we have various input features
but don't know what their corresponding outputs will be.
In some cases we don't even know
what the input features mean.
Unsupervised learning is more representative
of most real world problems we have to solve
and primarily takes place in the crossover
between big data and the field of AI
where these unsupervised algorithms are given the task
of deriving structure from unstructured data.
Unsupervised learning, like supervised learning
is additionally sub sectioned into two primary types
of learning models it covers, association and clustering.
As in the case of supervised learning
where aggression is for predicting continuous data
and classification for discreet
in unsupervised learning association is for continuous data
and clustering for discrete.
To begin with, we will delve into more about clustering.
Whereas in classification we have predefined labels
and are trying to fit new data into the correct category
based on our decision boundaries,
in clustering, these labels must be derived by viewing
the relationships between many data points.
One of the most well known clustering algorithms
is K-means clustering.
This algorithm's job is to analyze a decision space
consisting of a number of data points denoted by N
and divide them into a number of discrete categories
denoted by K.
This number K can be predefined
or the algorithm can determine the best number
through the use of an error function.
Let's do a brief example,
say if data points consisting of the features of watch time
and engagement of various videos
with the goal to determine a way to decide if and when
they will be recommended or not.
This example is similar to the last video
except now this YouTube data is unlabeled and unstructured.
Now first off we have to decide
the amount of K clusters our data will be divided into.
This could be predefined,
but for our case let's use an error function.
In K-means clustering,
the sum of squared error function is often used
to find an optimal K value.
As you can see, while increasing K will give less error,
after a certain point known as a graph elbow,
increasing K yields diminishing returns.
Requiring more computing power
and increasing the risk of over fitting,
a concept we will discuss shortly.
The elbow of the error plot, for example, is four.
Therefore, we will divide our decision space
into four clusters.
This is done by first adding four centroids,
defining the centers of their respective clusters.
Now the initial centroid locations
are found by choosing areas
with a high density of points
with similar feature conditions.
Once the initial cluster points are chosen,
the algorithm then reassigns the data points
to their new respective clusters.
We then update the centroids
and once again reassign data points to their clusters.
These steps are iterated upon
until a centroid stop moving
or points stop switching clusters.
At the end of our example,
we now have four discreet clusters
with red defining not recommended,
blue as recommended within one day of upload,
yellow on one week
and purple in one month.
Now these labels,
once the clusters are defined
would be given by their respective data scientists
and machine learning engineers analyzing the results
after the decision space has been divided.
However, as you can see,
this unsupervised learning algorithm did its job
and derived structure from unstructured data
which allowed human scientists and engineers
to be able to decipher and utilize the data.
Now before continuing, keep in mind,
this was just for our two dimensional
in other words two feature example.
As seen in the last video,
with a more real world representative example
with many features,
this will get increasingly complicated.
As we go on to higher dimensional spaces,
we'll see how this issue is resolved
as we cover the next field
in unsupervised learning.
Association. To understand this concept a bit better,
think of it like this.
A clustering problem is where we try to group customers
based on their purchase behavior.
Whereas an association problem
is one we would wanna see
if a customer that bought product X
would also tend to buy product Y.
In other words the correlations
between features of a data set.
Viewing this in a different format,
a matrix where the columns each specify a feature
and the rows each correspond to a data point,
in clustering algorithms
like the example we recently went through,
the goal is to reduce complexity in the rows
that being clustering various similar data points together.
Going a step forward then,
in order for association algorithms
like Apriori to derive meaningful associations
between features also referred to as association rules,
the complexity and the columns must be reduced.
Another word for this column complexity reduction
is dimensionality reduction.
The dimensionality of data
is a number of features needed to uniquely represent
a single point of data.
As you saw in the previous video in the series,
when our example had two features,
we can represent it in two dimensions.
With three features we ended at three dimensions
and so the trend continues.
Every form of data has to be converted into a feature set
before it can be analyzed.
This process is called feature extraction
and there're many trade offs to be made
in the selection of the amount of features.
If you wanna keep the feature set simple,
in other words low dimensionality,
then you run the risk of n't being able
to uniquely identify every point of data in your data set,
meaning your algorithm of choice
will not be able to derive patterns from the data
in other words underfitting.
On the other hand, if your feature set is complex,
high dimensionality,
then we run into what is called
the curse of dimensionality.
This is when as more dimensions are added to a data point,
then the data set becomes too sparse
to find any meaningful patterns.
In other words,
adding additional dimensions
has made the data too spread out over the decision space.
Additionally, an other issue that arises
from high dimensionality is overfitting.
When the data set becomes too rigid to adopt a new data.
This because the algorithm you used
to analyze the decision space
has made correlations and associations between features
that actually have no intrinsic meaning.
Sparse and rigid data are a big reason
why expert systems failed to materialize promise results
back in the day
and why high dimensionality is a much harder issue
to solve than low.
Hence bringing us back to our starting point,
the need for dimensionality reduction
in order for association algorithms
to be able to extract meaningful correlations
from data.
Arising in popularity technique for dimensionality reduction
revolves around what is referred
to as The Manifold Hypothesis.
The Manifold Hypothesis States
the high dimensional data
actually lies on low dimensional manifolds
embedded in high dimensional space
with the manifold in layman's terms
being the surface of any shape.
Simply put, the manifold hypothesis States
that high dimensional data can be represented
as a shape's low dimensional data produces
after transformations are applied.
These transformations the data undergoes
must be homeomorphic.
Meaning that the data must be able
to be inversely transformed back into its original self
and not destroyed in the transformation.
This low dimensional representation of the original data set
then contains the reduced feature set needed
to represent the problem at hand
and still produce meaningful results
and associations and there are multiple algorithms
for manifold learning to derive
these low dimensional shapes.
To list two of the many,
one, Principal Component Analysis,
PCA for linear manifolds in other words, planes
and two, ISO maps for nonlinear manifolds.
Meaning any curved surface.
This process of dimensionality reduction,
feature selection and extraction
is an entire subfield in machine learning
referred to as Feature Engineering
and something that'll be touched on much more heavily
in this channel's upcoming deep learning series.
Now I wanna once again stress that for the sake of time
and explanation,
many generalizations are made in this video
with the goal to simplify in reality very complex topics
that have a lot of overlap.
As stated in the disclaimer
at the start of all these AI videos,
my goal here is to give introductory overviews
to core concepts.
After which point you can satisfy your curiosity
to learn more by watching other amazing creators
on this platform and resources on the web.
One such resource I use and I highly recommend
is Brilliant.
If you wanna learn more about machine learning
and I mean really learn how these algorithms work
from supervised methodologies such as regression
and classification to unsupervised learning
and more then brilliant.org is a place for you to go.
For instance, in this course on machine learning
it goes through many of the concepts we have discussed
in these past videos.
Now what I love about how these topics
and these courses are presented,
is that first an intuitive explanation is given
and then you're taking the related problems.
If you got a problem wrong,
you see an intuitive explanation for where you went wrong
and how to rectify that flaw.
My primary goal with this channel
is to inspire and educate about the various technologies
and innovations that are changing the world.
But to do so on a higher level
requires going a step beyond these videos
and actually learning the mathematics
and science beyond the concepts I discuss.
Brilliant does this by making math
and science learning exciting and cultivates curiosity
by showing the interconnectedness
between a variety of different topics.
Additionally, now with offline mode
on brilliance mobile apps,
you can learn on the go
with the ability to download any other interactive courses
to support futurology
and learn more about Brilliant,
go to brilliant.org futurology and sign up for free.
Additionally, the first 200 people that go to that link
will get 20% off their annual premium subscription.
(reggae music)
At this point, the video has concluded.
We'd like to thank you for taking the time to watch it.
If you enjoyed it,
consider supporting us on Patreon or YouTube membership
to keep this brand growing.
And if you have any topic suggestions,
please leave them in the comments below.
Consider subscribing for more content
and check out our website and our parent company, Earth-one
for more information.
This has been Honqua,
you've been watching futurology
and we'll see you again soon.
