Unsupervised Learning Explained (+ Clustering, Manifold Learning, ...)

This article is brought to you

and thanks to Brilliant,

a problem-solving website that teaches you skills

essential to have in this age of automation.

In the last video in the series,

we began on a quest to clear up the misconceptions

between artificial intelligence and machine learning,

beginning with discussing supervised learning,

an essential foundational building block

in understanding the modern field of machine learning.

The focus of this video, then,

will continue right where the last one left off,

so sit back, relax, and join me once again

in an exploretion into the field of machine learning.

As a quick recap, the field of machine learning

is a subset of the grander field of artificial intelligence,

and takes place in the intersection

between big data and data science,

with data science composed of the fields

of statistics, mathematics, et cetera,

with the goal to make sense out of and structure data.

The intersection of data science and artificial intelligence

is where a particular subset

of machine learning takes place.

Supervised learning,

a type of learning

when we have the input and output to our data,

in other words, labeled structured data,

and we have to train our model

to maximize its predictive accuracy.

Supervised learning is then further subsectioned

comprised of two primary modes of learning models,

regression and classification.

Regression is for predicting continuous outputs.

In other words, outputs that lie

on a line of best fit of our model.

Classification on the other hand,

is for predicting discrete outputs.

In other words, mapping input variables

into discrete categories.

To add to this, many classification models

implement regression algorithms as well.

Essentially, supervised learning for the most part

is glorified statistical mathematics for

pattern recognition problems rebranded as machine learning

because they are applied in a way

in which we iterate through them

in other words train the models

to increase their predictive accuracy.

As a side note,

I highly recommend you watch the previous video

in the series,

to get a deeper understanding of supervised learning

as we walked through a quite an intensive example.

Additionally, there is one important bit of terminology

I wanna discuss

that we skipped over in the previous video.

In machine learning variables are referred to as features,

variables, attributes, properties, features.

They all mean the same thing

but for the sake of keeping our terminology consistent

with industry standard,

I will use features going forward.

Coming back on topic with this recap out of the way,

we can now move onto the next subset of machine learning.

Unsupervised learning.

Where as supervised learning is best suited for data

that is labeled and structured.

Unsupervised learning is for data that is unlabeled

and unstructured.

In other words, we have various input features

but don't know what their corresponding outputs will be.

In some cases we don't even know

what the input features mean.

Unsupervised learning is more representative

of most real world problems we have to solve

and primarily takes place in the crossover

between big data and the field of AI

where these unsupervised algorithms are given the task

of deriving structure from unstructured data.

Unsupervised learning, like supervised learning

is additionally sub sectioned into two primary types

of learning models it covers, association and clustering.

As in the case of supervised learning

where aggression is for predicting continuous data

and classification for discreet

in unsupervised learning association is for continuous data

and clustering for discrete.

To begin with, we will delve into more about clustering.

Whereas in classification we have predefined labels

and are trying to fit new data into the correct category

based on our decision boundaries,

in clustering, these labels must be derived by viewing

the relationships between many data points.

One of the most well known clustering algorithms

is K-means clustering.

This algorithm's job is to analyze a decision space

consisting of a number of data points denoted by N

and divide them into a number of discrete categories

denoted by K.

This number K can be predefined

or the algorithm can determine the best number

through the use of an error function.

Let's do a brief example,

say if data points consisting of the features of watch time

and engagement of various videos

with the goal to determine a way to decide if and when

they will be recommended or not.

This example is similar to the last video

except now this YouTube data is unlabeled and unstructured.

Now first off we have to decide

the amount of K clusters our data will be divided into.

This could be predefined,

but for our case let's use an error function.

In K-means clustering,

the sum of squared error function is often used

to find an optimal K value.

As you can see, while increasing K will give less error,

after a certain point known as a graph elbow,

increasing K yields diminishing returns.

Requiring more computing power

and increasing the risk of over fitting,

a concept we will discuss shortly.

The elbow of the error plot, for example, is four.

Therefore, we will divide our decision space

into four clusters.

This is done by first adding four centroids,

defining the centers of their respective clusters.

Now the initial centroid locations

are found by choosing areas

with a high density of points

with similar feature conditions.

Once the initial cluster points are chosen,

the algorithm then reassigns the data points

to their new respective clusters.

We then update the centroids

and once again reassign data points to their clusters.

These steps are iterated upon

until a centroid stop moving

or points stop switching clusters.

At the end of our example,

we now have four discreet clusters

with red defining not recommended,

blue as recommended within one day of upload,

yellow on one week

and purple in one month.

Now these labels,

once the clusters are defined

would be given by their respective data scientists

and machine learning engineers analyzing the results

after the decision space has been divided.

However, as you can see,

this unsupervised learning algorithm did its job

and derived structure from unstructured data

which allowed human scientists and engineers

to be able to decipher and utilize the data.

Now before continuing, keep in mind,

this was just for our two dimensional

in other words two feature example.

As seen in the last video,

with a more real world representative example

with many features,

this will get increasingly complicated.

As we go on to higher dimensional spaces,

we'll see how this issue is resolved

as we cover the next field

in unsupervised learning.

Association. To understand this concept a bit better,

think of it like this.

A clustering problem is where we try to group customers

based on their purchase behavior.

Whereas an association problem

is one we would wanna see

if a customer that bought product X

would also tend to buy product Y.

In other words the correlations

between features of a data set.

Viewing this in a different format,

a matrix where the columns each specify a feature

and the rows each correspond to a data point,

in clustering algorithms

like the example we recently went through,

the goal is to reduce complexity in the rows

that being clustering various similar data points together.

Going a step forward then,

in order for association algorithms

like Apriori to derive meaningful associations

between features also referred to as association rules,

the complexity and the columns must be reduced.

Another word for this column complexity reduction

is dimensionality reduction.

The dimensionality of data

is a number of features needed to uniquely represent

a single point of data.

As you saw in the previous video in the series,

when our example had two features,

we can represent it in two dimensions.

With three features we ended at three dimensions

and so the trend continues.

Every form of data has to be converted into a feature set

before it can be analyzed.

This process is called feature extraction

and there're many trade offs to be made

in the selection of the amount of features.

If you wanna keep the feature set simple,

in other words low dimensionality,

then you run the risk of n't being able

to uniquely identify every point of data in your data set,

meaning your algorithm of choice

will not be able to derive patterns from the data

in other words underfitting.

On the other hand, if your feature set is complex,

high dimensionality,

then we run into what is called

the curse of dimensionality.

This is when as more dimensions are added to a data point,

then the data set becomes too sparse

to find any meaningful patterns.

In other words,

adding additional dimensions

has made the data too spread out over the decision space.

Additionally, an other issue that arises

from high dimensionality is overfitting.

When the data set becomes too rigid to adopt a new data.

This because the algorithm you used

to analyze the decision space

has made correlations and associations between features

that actually have no intrinsic meaning.

Sparse and rigid data are a big reason

why expert systems failed to materialize promise results

back in the day

and why high dimensionality is a much harder issue

to solve than low.

Hence bringing us back to our starting point,

the need for dimensionality reduction

in order for association algorithms

to be able to extract meaningful correlations

from data.

Arising in popularity technique for dimensionality reduction

revolves around what is referred

to as The Manifold Hypothesis.

The Manifold Hypothesis States

the high dimensional data

actually lies on low dimensional manifolds

embedded in high dimensional space

with the manifold in layman's terms

being the surface of any shape.

Simply put, the manifold hypothesis States

that high dimensional data can be represented

as a shape's low dimensional data produces

after transformations are applied.

These transformations the data undergoes

must be homeomorphic.

Meaning that the data must be able

to be inversely transformed back into its original self

and not destroyed in the transformation.

This low dimensional representation of the original data set

then contains the reduced feature set needed

to represent the problem at hand

and still produce meaningful results

and associations and there are multiple algorithms

for manifold learning to derive

these low dimensional shapes.

To list two of the many,

one, Principal Component Analysis,

PCA for linear manifolds in other words, planes

and two, ISO maps for nonlinear manifolds.

Meaning any curved surface.

This process of dimensionality reduction,

feature selection and extraction

is an entire subfield in machine learning

referred to as Feature Engineering

and something that'll be touched on much more heavily

in this channel's upcoming deep learning series.

Now I wanna once again stress that for the sake of time

and explanation,

many generalizations are made in this video

with the goal to simplify in reality very complex topics

that have a lot of overlap.

As stated in the disclaimer

at the start of all these AI videos,

my goal here is to give introductory overviews

to core concepts.

After which point you can satisfy your curiosity

to learn more by watching other amazing creators

on this platform and resources on the web.

One such resource I use and I highly recommend

is Brilliant.

If you wanna learn more about machine learning

and I mean really learn how these algorithms work

from supervised methodologies such as regression

and classification to unsupervised learning

and more then brilliant.org is a place for you to go.

For instance, in this course on machine learning

it goes through many of the concepts we have discussed

in these past videos.

Now what I love about how these topics

and these courses are presented,

is that first an intuitive explanation is given

and then you're taking the related problems.

If you got a problem wrong,

you see an intuitive explanation for where you went wrong

and how to rectify that flaw.

My primary goal with this channel

is to inspire and educate about the various technologies

and innovations that are changing the world.

But to do so on a higher level

requires going a step beyond these videos

and actually learning the mathematics

and science beyond the concepts I discuss.

Brilliant does this by making math

and science learning exciting and cultivates curiosity

by showing the interconnectedness

between a variety of different topics.

Additionally, now with offline mode

on brilliance mobile apps,

you can learn on the go

with the ability to download any other interactive courses

to support futurology

and learn more about Brilliant,

go to brilliant.org futurology and sign up for free.

Additionally, the first 200 people that go to that link

will get 20% off their annual premium subscription.

(reggae music)

At this point, the video has concluded.

We'd like to thank you for taking the time to watch it.

If you enjoyed it,

consider supporting us on Patreon or YouTube membership

to keep this brand growing.

And if you have any topic suggestions,

please leave them in the comments below.

Consider subscribing for more content

and check out our website and our parent company, Earth-one

for more information.

This has been Honqua,

you've been watching futurology

and we'll see you again soon.

Unsupervised Learning Explained (+ Clustering, Manifold Learning, ...)

Supervised learning is then further subsectioned

in unsupervised learning association is for continuous data

would be given by their respective data scientists

to uniquely identify every point of data in your data set,

Wiikohow