Unsupervised Learning Explained (+ Clustering, Manifold Learning, ...)


 

This article is brought to you 

and thanks to Brilliant, 

a problem-solving website that teaches you skills 

essential to have in this age of automation. 

In the last video in the series, 

we began on a quest to clear up the misconceptions 

between artificial intelligence and machine learning, 

beginning with discussing supervised learning, 

an essential foundational building block 

in understanding the modern field of machine learning. 

The focus of this video, then, 

will continue right where the last one left off, 

so sit back, relax, and join me once again 

in an exploretion into the field of machine learning. 

As a quick recap, the field of machine learning 

is a subset of the grander field of artificial intelligence, 

and takes place in the intersection 

between big data and data science, 

with data science composed of the fields 

of statistics, mathematics, et cetera, 

with the goal to make sense out of and structure data. 

The intersection of data science and artificial intelligence 

is where a particular subset 

of machine learning takes place. 

Supervised learning, 

a type of learning 

when we have the input and output to our data, 

in other words, labeled structured data, 

and we have to train our model 

to maximize its predictive accuracy. 

Supervised learning is then further subsectioned 

comprised of two primary modes of learning models, 

regression and classification. 

Regression is for predicting continuous outputs. 

In other words, outputs that lie 

on a line of best fit of our model. 

Classification on the other hand, 

is for predicting discrete outputs. 

In other words, mapping input variables 

into discrete categories. 

To add to this, many classification models 

implement regression algorithms as well. 

Essentially, supervised learning for the most part 

is glorified statistical mathematics for 

pattern recognition problems rebranded as machine learning 

because they are applied in a way 

in which we iterate through them 

in other words train the models 

to increase their predictive accuracy. 

As a side note, 

I highly recommend you watch the previous video 

in the series, 

to get a deeper understanding of supervised learning 

as we walked through a quite an intensive example. 

Additionally, there is one important bit of terminology 

I wanna discuss 

that we skipped over in the previous video. 

In machine learning variables are referred to as features, 

variables, attributes, properties, features. 

They all mean the same thing 

but for the sake of keeping our terminology consistent 

with industry standard, 

I will use features going forward. 

Coming back on topic with this recap out of the way, 

we can now move onto the next subset of machine learning. 

Unsupervised learning. 

Where as supervised learning is best suited for data 

that is labeled and structured. 

Unsupervised learning is for data that is unlabeled 

and unstructured. 

In other words, we have various input features 

but don't know what their corresponding outputs will be. 

In some cases we don't even know 

what the input features mean. 

Unsupervised learning is more representative 

of most real world problems we have to solve 

and primarily takes place in the crossover 

between big data and the field of AI 

where these unsupervised algorithms are given the task 

of deriving structure from unstructured data. 

Unsupervised learning, like supervised learning 

is additionally sub sectioned into two primary types 

of learning models it covers, association and clustering. 

As in the case of supervised learning 

where aggression is for predicting continuous data 

and classification for discreet 

in unsupervised learning association is for continuous data 

and clustering for discrete. 

To begin with, we will delve into more about clustering. 

Whereas in classification we have predefined labels 

and are trying to fit new data into the correct category 

based on our decision boundaries, 

in clustering, these labels must be derived by viewing 

the relationships between many data points. 

One of the most well known clustering algorithms 

is K-means clustering. 

This algorithm's job is to analyze a decision space 

consisting of a number of data points denoted by N 

and divide them into a number of discrete categories 

denoted by K. 

This number K can be predefined 

or the algorithm can determine the best number 

through the use of an error function. 

Let's do a brief example, 

say if data points consisting of the features of watch time 

and engagement of various videos 

with the goal to determine a way to decide if and when 

they will be recommended or not. 

This example is similar to the last video 

except now this YouTube data is unlabeled and unstructured. 

Now first off we have to decide 

the amount of K clusters our data will be divided into. 

This could be predefined, 

but for our case let's use an error function. 

In K-means clustering, 

the sum of squared error function is often used 

to find an optimal K value. 

As you can see, while increasing K will give less error, 

after a certain point known as a graph elbow, 

increasing K yields diminishing returns. 

Requiring more computing power 

and increasing the risk of over fitting, 

a concept we will discuss shortly. 

The elbow of the error plot, for example, is four. 

Therefore, we will divide our decision space 

into four clusters. 

This is done by first adding four centroids, 

defining the centers of their respective clusters. 

Now the initial centroid locations 

are found by choosing areas 

with a high density of points 

with similar feature conditions. 

Once the initial cluster points are chosen, 

the algorithm then reassigns the data points 

to their new respective clusters. 

We then update the centroids 

and once again reassign data points to their clusters. 

These steps are iterated upon 

until a centroid stop moving 

or points stop switching clusters. 

At the end of our example, 

we now have four discreet clusters 

with red defining not recommended, 

blue as recommended within one day of upload, 

yellow on one week 

and purple in one month. 

Now these labels, 

once the clusters are defined 

would be given by their respective data scientists 

and machine learning engineers analyzing the results 

after the decision space has been divided. 

However, as you can see, 

this unsupervised learning algorithm did its job 

and derived structure from unstructured data 

which allowed human scientists and engineers 

to be able to decipher and utilize the data. 

Now before continuing, keep in mind, 

this was just for our two dimensional 

in other words two feature example. 

As seen in the last video, 

with a more real world representative example 

with many features, 

this will get increasingly complicated. 

As we go on to higher dimensional spaces, 

we'll see how this issue is resolved 

as we cover the next field 

in unsupervised learning. 

Association. To understand this concept a bit better, 

think of it like this. 

A clustering problem is where we try to group customers 

based on their purchase behavior. 

Whereas an association problem 

is one we would wanna see 

if a customer that bought product X 

would also tend to buy product Y. 

In other words the correlations 

between features of a data set. 

Viewing this in a different format, 

a matrix where the columns each specify a feature 

and the rows each correspond to a data point, 

in clustering algorithms 

like the example we recently went through, 

the goal is to reduce complexity in the rows 

that being clustering various similar data points together. 

Going a step forward then, 

in order for association algorithms 

like Apriori to derive meaningful associations 

between features also referred to as association rules, 

the complexity and the columns must be reduced. 

Another word for this column complexity reduction 

is dimensionality reduction. 

The dimensionality of data 

is a number of features needed to uniquely represent 

a single point of data. 

As you saw in the previous video in the series, 

when our example had two features, 

we can represent it in two dimensions. 

With three features we ended at three dimensions 

and so the trend continues. 

Every form of data has to be converted into a feature set 

before it can be analyzed. 

This process is called feature extraction 

and there're many trade offs to be made 

in the selection of the amount of features. 

If you wanna keep the feature set simple, 

in other words low dimensionality, 

then you run the risk of n't being able 

to uniquely identify every point of data in your data set, 

meaning your algorithm of choice 

will not be able to derive patterns from the data 

in other words underfitting. 

On the other hand, if your feature set is complex, 

high dimensionality, 

then we run into what is called 

the curse of dimensionality. 

This is when as more dimensions are added to a data point, 

then the data set becomes too sparse 

to find any meaningful patterns. 

In other words, 

adding additional dimensions 

has made the data too spread out over the decision space. 

Additionally, an other issue that arises 

from high dimensionality is overfitting. 

When the data set becomes too rigid to adopt a new data. 

This because the algorithm you used 

to analyze the decision space 

has made correlations and associations between features 

that actually have no intrinsic meaning. 

Sparse and rigid data are a big reason 

why expert systems failed to materialize promise results 

back in the day 

and why high dimensionality is a much harder issue 

to solve than low. 

Hence bringing us back to our starting point, 

the need for dimensionality reduction 

in order for association algorithms 

to be able to extract meaningful correlations 

from data. 

Arising in popularity technique for dimensionality reduction 

revolves around what is referred 

to as The Manifold Hypothesis. 

The Manifold Hypothesis States 

the high dimensional data 

actually lies on low dimensional manifolds 

embedded in high dimensional space 

with the manifold in layman's terms 

being the surface of any shape. 

Simply put, the manifold hypothesis States 

that high dimensional data can be represented 

as a shape's low dimensional data produces 

after transformations are applied. 

These transformations the data undergoes 

must be homeomorphic. 

Meaning that the data must be able 

to be inversely transformed back into its original self 

and not destroyed in the transformation. 

This low dimensional representation of the original data set 

then contains the reduced feature set needed 

to represent the problem at hand 

and still produce meaningful results 

and associations and there are multiple algorithms 

for manifold learning to derive 

these low dimensional shapes. 

To list two of the many, 

one, Principal Component Analysis, 

PCA for linear manifolds in other words, planes 

and two, ISO maps for nonlinear manifolds. 

Meaning any curved surface. 

This process of dimensionality reduction, 

feature selection and extraction 

is an entire subfield in machine learning 

referred to as Feature Engineering 

and something that'll be touched on much more heavily 

in this channel's upcoming deep learning series. 

Now I wanna once again stress that for the sake of time 

and explanation, 

many generalizations are made in this video 

with the goal to simplify in reality very complex topics 

that have a lot of overlap. 

As stated in the disclaimer 

at the start of all these AI videos, 

my goal here is to give introductory overviews 

to core concepts. 

After which point you can satisfy your curiosity 

to learn more by watching other amazing creators 

on this platform and resources on the web. 

One such resource I use and I highly recommend 

is Brilliant. 

If you wanna learn more about machine learning 

and I mean really learn how these algorithms work 

from supervised methodologies such as regression 

and classification to unsupervised learning 

and more then brilliant.org is a place for you to go. 

For instance, in this course on machine learning 

it goes through many of the concepts we have discussed 

in these past videos. 

Now what I love about how these topics 

and these courses are presented, 

is that first an intuitive explanation is given 

and then you're taking the related problems. 

If you got a problem wrong, 

you see an intuitive explanation for where you went wrong 

and how to rectify that flaw. 

My primary goal with this channel 

is to inspire and educate about the various technologies 

and innovations that are changing the world. 

But to do so on a higher level 

requires going a step beyond these videos 

and actually learning the mathematics 

and science beyond the concepts I discuss. 

Brilliant does this by making math 

and science learning exciting and cultivates curiosity 

by showing the interconnectedness 

between a variety of different topics. 

Additionally, now with offline mode 

on brilliance mobile apps, 

you can learn on the go 

with the ability to download any other interactive courses 

to support futurology 

and learn more about Brilliant, 

go to brilliant.org futurology and sign up for free. 

Additionally, the first 200 people that go to that link 

will get 20% off their annual premium subscription. 

(reggae music) 

At this point, the video has concluded. 

We'd like to thank you for taking the time to watch it. 

If you enjoyed it, 

consider supporting us on Patreon or YouTube membership 

to keep this brand growing. 

And if you have any topic suggestions, 

please leave them in the comments below. 

Consider subscribing for more content 

and check out our website and our parent company, Earth-one 

for more information. 

This has been Honqua, 

you've been watching futurology 

and we'll see you again soon. 

SUBSCRIBE TO OUR NEWSLETTER

Seorang Blogger pemula yang sedang belajar