# CS计算机代考程序代写 algorithm database python flex UNSUPERVISED LEARNING

UNSUPERVISED LEARNING

Machine Learning for Financial Data

March 2021

Contents

◦ Introduction

◦ Cluster Analysis

◦ K-Means Clustering

◦ K-Modes Clustering

◦ Density-based Clustering

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2

Unsupervised Learning

Introduction

Machine learning focuses primarily on supervised learning but the vast majority of the available data is unlabelled!

▪ Most of the applications of ML today are based on supervised learning

▪ The vast majority of the available data is unlabeled ▪ Having the input features X but not the labels y

▪ To develop a regular binary classifier to predict whether an item shown in a picture is defective or not, you will need to label every single picture as “defective” or “normal”

▪ Labelling generally requires human experts to manually go through all the pictures

▪ A long, costly, and tedious task, so usually done on only a small subset of the available pictures

▪ The labeled dataset will be quite small and the classifier’s performance will be disappointing

▪ Every time any change is made to the system, the labelling process will need to be repeated

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4

Unsupervised Learning

Unsupervised Learning

Unsupervised learning refers to the use of ML algorithms to identify patterns in datasets containing data points that are neither classified nor labeled. The algorithms are thus allowed to classify, label and/or group the data points contained within the datasets without having any external guidance in performing that task. The ML algorithms will group data points according to similarities and differences even though there are no categories provided.

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 5

Unsupervised Learning

Unsupervised learning algorithms can only learn from samples themselves as there is no data labels to learn from

▪ In unsupervised learning, there is no hidden teacher, the main goals cannot be related to minimizing the prediction error with respect to the ground truth

▪ Unsupervised learning algorithms have to learn some pieces of information without any formal indication

▪ The only option is to learn from the samples themselves

▪ An unsupervised algorithm is usually aimed at discovering the similarities and patterns among samples or reproducing an input distribution given a set of features drawn from it

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 6

Unsupervised Learning

Unsupervised learning can be more unpredictable than supervised learning, such as creating clutter instead of order

▪ Unsupervised learning can be more unpredictable than a supervised learning model

▪ An unsupervised learning system might, for example, figure out on its own how to sort cats from dogs

▪ Such an unsupervised learning might also add unforeseen and undesired categories to deal with unusual breeds, creating clutter instead of order

▪ ML systems capable of unsupervised learning are often associated with generative learning models

▪ Chatbots, self-driving cars, facial recognition programs, expert systems and robots are among the systems that may use either supervised or unsupervised learning approaches, or both

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7

Unsupervised Learning

Cluster Analysis / Clustering

Clustering

The task of identifying like with like and assigning them to clusters or group of similar instances. Just like in classification, each instance gets assigned to a group. However, unlike classification, clustering is an unsupervised task. Also, clustering has no notion of correctness.

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9

Unsupervised Learning

Classification uses labelled data whereas clustering uses unlabelled data

Samples with labels

Samples without labels

Are you able to identify the clusters?

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

10

Unsupervised Learning

Clustering algorithms can identify the 3 clusters fairly well making only 5 mistakes out of 150 samples!

Data preparation use cases

▪ Data analysis

▪ When you analyze a new dataset, it can be helpful to run a clustering algorithm, and then analyze each cluster

separately

▪ Dimensionality reduction

▪ Once a dataset has been clustered, it is usually possible to measure each instance’s affinity with each cluster

(affinity is any measure of how well an instance fits into a cluster)

▪ Each instance’s feature vector x can then be replaced with the vector of its cluster affinities

▪ If there are k clusters, then this vector is k-dimensional

▪ This vector is typically much lower-dimensional than the original feature vector, but it can preserve enough information for further processing

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11

Unsupervised Learning

Data preparation use cases

▪ Semi-supervised learning

▪ If you only have a few labels, you could perform clustering and propagate the labels to all the instances in the same

cluster

▪ This technique can greatly increase the number of labels available for a subsequent supervised learning algorithm, and thus improve its performance

▪ Anomaly detection (outlier detection)

▪ Any instance that has a low affinity to all the clusters is likely to be an anomaly

▪ For example, if you have clustered the users of your website based on their behavior, you can detect users with unusual behavior, such as an unusual number of requests per second

▪ Anomaly detection is particularly useful in detecting defects in manufacturing, or for fraud detection

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12

Unsupervised Learning

Customer segmentation, recommendation system & image segmentation use cases

▪ Customer segmentation

▪ For marketing campaigns and recommender systems

▪ Search engines

▪ Some search engines let you search for images that are similar to a reference image

▪ To build such a system, you would first apply a clustering algorithm to all the images in your database; similar images would end up in the same cluster

▪ Then when a user provides a reference image, all you need to do is use the trained clustering model to find this image’s cluster, and you can then simply return all the images from this cluster

▪ Segment an image

▪ By clustering pixels according to their color, then replacing each pixel’s color with the mean color of its cluster, it is

possible to considerably reduce the number of different colors in the image

▪ Image segmentation is used in many object detection and tracking systems, as it makes it easier to detect the contour of each object

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 13

Unsupervised Learning

Clustering algorithms group samples according to their similarities, which capture the distances between samples

𝑑

𝑥ҧ ,𝑥ҧ 𝑠𝑖𝑚𝑖𝑗

=

1

𝛿 𝑥ҧ ,𝑥ҧ 𝑖𝑗

𝑑𝑠𝑖𝑚 measures the similarity between 2 vectors 𝑋 = 𝑥ҧ ,𝑥ҧ ,⋯,𝑥ҧ is dataset to be clustered

12 𝑁

+ 𝜖

𝑁 is the number of data points in the dataset

𝜖 is a constant introduced to avoid division by 0

𝑚

𝛿 measures the Euclidean distance between 2 vectors 𝑚 is the number features in a vector

𝑥ҧ𝑖 = 𝑥ҧ𝑖1, 𝑥ҧ𝑖2, ⋯ , 𝑥ҧ𝑖𝑚 is a sample vector

2 𝑖𝑗 𝑖𝑙𝑗𝑙

𝛿𝑥ҧ,𝑥ҧ = 𝑥 −𝑥 𝑙=1

𝐶 = 𝑥ҧ:𝑑 𝑥ҧ,𝜇ҧ 𝑖 𝑗 𝑠𝑖𝑚 𝑗 𝑖

>𝑑 𝑥ҧ,𝜇ҧ 𝑠𝑖𝑚 𝑗 𝑘

𝐶𝑖 , 𝐶𝑘 are clusters generated by the clustering algorithm 𝜇ҧ𝑖 is a representative vector of 𝐶𝑖

𝜇ҧ𝑘 is a representative vector of 𝐶𝑘

𝑘∈ 1,2,⋯,𝑖−1,𝑖+1,⋯,𝐾 𝐾 is the number of clusters

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 14

Unsupervised Learning

Cluster algorithms produce different types of clustering results

Hierarchy Exclusive Membership Hard Assignment Complete

Partitions VS Multiple Membership

Soft Assignment Incomplete

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 15

Unsupervised Learning

K-Means Clustering

Linear Space

The challenge is to get a computer to identify the same three clusters that are relatively obvious to the naked eyes

naked eyes

?

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 18

Unsupervised Learning

Select the number of clusters (K=3) to identify in the dataset and randomly select 3 data points as cluster centroids

Centroids of 3 new clusters

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19

Unsupervised Learning

For each data point, find the closest centroid to each data point and assign the corresponding cluster to the data point

Distance from the 1st data point to the blue cluster

Assign the 1st data point to the blue cluster

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20

Unsupervised Learning

For each cluster, calculate the new centroid using the cluster’s data points

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21

Unsupervised Learning

For each data point, re-cluster it to the cluster corresponding to the closest centroid

No change in this case

The clustering algorithm has converged!

Is that the end? No, not when working in a linear space!

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 22

Unsupervised Learning

Quality of the clustering can be assessed through adding up the variation within each cluster

total variation within the clusters

As far as the algorithm goes, it is not clear if this clustering is the best and therefore the predicted clustering.

The algorithm can only repeat the process with different initial centroids and rate its quality using the total variance within the clusters.

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 23

Unsupervised Learning

Calculate the total variation resulted from using the 3 randomly picked new centroids

total variation of another set of clusters

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24

Unsupervised Learning

Iterate the clustering with new centroids and record the corresponding total variation

total variation 1st clustering total variation 3rd clustering

The algorithm will do a few iterations of clustering (it will do as many as you tell it to do) and suggest the one with the least total variation.

total variation 2nd clustering

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 25

Unsupervised Learning

Multi-dimensional Space

In the same fashion, initial centroids are selected in the multi-dimensional space

X-AXIS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 27

Unsupervised Learning

Y-AXIS

The Euclidean distances of each data point from the three clusters are then measured to decide the clustering

𝑥2 + 𝑦2

y

x

data point is assigned to the closest cluster

X-AXIS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

28

Unsupervised Learning

Y-AXIS

The centre of each cluster is then calculated and all data points will be re-clustered using the new centres

X-AXIS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29

Unsupervised Learning

Y-AXIS

Repeat the process until the centroid values converge or maximum iteration limit has been achieved

X-AXIS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 30

Unsupervised Learning

Y-AXIS

Recalculating the centroids effectively formulate an optimal clustering but it may not be globally optimal

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31

Unsupervised Learning

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 32

Unsupervised Learning

Hyperparameter Tuning

Supervised learning has no ground truth to evaluate model performance

▪ Understanding the performance of unsupervised learning methods is inherently much more difficult than supervised learning methods because there is no ground truth available

▪ Moreover, K-means explicitly requests for the number of clusters as a hyperparameter

▪ K-means performance can be evaluated based on different K clusters

▪ We can also use the elbow method or the silhouette coefficient to find the

optimal K numbers of clusters for the unsupervised learning model

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34

Unsupervised Learning

It is genuinely ambiguous how many clusters there are in a dataset and there is no way to decide this automatically

4 clusters?

OR

2 clusters?

X-AXIS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 35

Unsupervised Learning

Y-AXIS

Sometimes the number of clusters to used is imposed by external constraints (e.g. later or downstream processing)

T-SHIRT SIZE

T-SHIRT SIZE

L

M

XL

L M

S XSS

HEIGHT HEIGHT

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36

Unsupervised Learning

WEIGHT

WEIGHT

Elbow Method

◦ The elbow method is used to select the optimal number of clusters by examining the visualization of the data

◦ Inertia is used as the cost functions σ𝑁 𝑥−𝜇2

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

37

Unsupervised Learning

𝑖=1 𝑖 𝑖

𝜇𝑖 is the centroid closest to the data point 𝑥𝑖

𝑁 is the number of data points in the dataset

◦ The elbow method requires drawing a line plot using the cost function against the number of clusters

◦ The elbow point is a point of the plot after which the plot starts to flatten out

Ideally, the average intra-cluster distance should be much much less than the inter-cluster distance to the nearest labour cluster

distance between 𝑥𝑖 and other points in the same cluster

𝑎 𝑥𝑖

= average intra-cluster distance for 𝑥𝑖

distance between 𝑥𝑖 and other points in the nearest other cluster

38

▪ Objectives

▪ Points in the same cluster should be as similar as possible

▪ Points in different clusters should be as dissimilar as possible

▪ When 𝑎 𝑥𝑖 > 𝑏 𝑥𝑖 , it is likely that the data point 𝑥𝑖 has been misclassified

𝑥𝑖

𝑏 𝑥𝑖

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

Unsupervised Learning

= average inter-cluster distance for 𝑥𝑖 to its nearest neighbor cluster

Silhouette Coefficient

◦ Evaluates the quality of clustering

◦ The coefficient ranges from -1 to 1

◦ Ideally, 𝑎 𝑥𝑖 = 0 and 𝑏 𝑥𝑖 = ∞ therefore S(𝑥𝑖)=1 suggesting dense & well separation between clusters

◦ In the worst case scenario, 𝑎 𝑥𝑖 = ∞ and 𝑏 𝑥𝑖 = 0 giving S(𝑥𝑖)=−1 suggesting wrong clustering

◦ S(𝑥𝑖) near 0 suggests overlapping clusters with data points very close to the cluster boundary of the nearest neighbor cluster

◦ The coefficient is calculated for each data point in the dataset

◦ Plotting the data points against their silhouette coefficients provides the silhouette plot

S(𝑥𝑖) =

𝑏(𝑥𝑖) − 𝑎(𝑥𝑖) max 𝑎(𝑥𝑖), 𝑏(𝑥𝑖)

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

39

Unsupervised Learning

Silhouette score is calculated for each data point in the dataset – that is for all data points in all clusters

S(𝑥𝑖) =

𝑏(𝑥𝑖) − 𝑎(𝑥𝑖) max 𝑎(𝑥𝑖), 𝑏(𝑥𝑖)

1 means the data point is far away from the neighboring clusters meaning minimal confusion and good clustering (positive means the data point is closer to the assigned cluster than it is to neighboring clusters)

0 means the data point lies on the boundary between the assigned cluster and the next closest cluster

-1 means the data point is assigned to an incorrect cluster and the data point in fact likely belongs to a neighboring cluster

𝑎 = Mean Intra−cluster Distance

Mean distance between a data point and all other data points in the same cluster

𝑏 = Mean Nearest−cluster Distance

Mean distance between a data point and all other data points of the nearest neighbour cluster

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 40

Unsupervised Learning

The Silhouette plot shows two clusters that are dense and well-separated

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 41

Unsupervised Learning

The Silhouette plot shows three clusters that are dense except for one cluster

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 42

Unsupervised Learning

The Silhouette plot shows four clusters that are also dense and well-separated

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 43

Unsupervised Learning

The Silhouette plot shows five clusters that are not so dense

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 44

Unsupervised Learning

The Silhouette plot shows six clusters that are not dense

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 45

Unsupervised Learning

Misclassified data points are shown on the left of the Silhouette Plot

SILHOUETTE PLOT average silhouette coefficient 0.16

CLUSTER 1

Number of data points: 16

Average Silhouette coefficient: 0.03

CLUSTER 2

Number of data points: 11

Average Silhouette coefficient: 0.26

CLUSTER 3

Number of data points: 20

Average Silhouette coefficient: 0.21

Outlier points are those with the Silhouette coefficient value less than 0

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 46

Unsupervised Learning

The optimal K is chosen based on the number of outliers and the average Silhouette coefficients

2 CLUSTERS

Average Silhouette coefficient: 0.705

3 CLUSTERS

Average Silhouette coefficient: 0.588

4 CLUSTERS

Average Silhouette coefficient: 0.651

5 CLUSTERS

Average Silhouette coefficient: 0.564

6 CLUSTERS

Average Silhouette coefficient: 0.450

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

47

Unsupervised Learning

K-Means in a Nutshell

1

2

3

4

5

6

7

8

Property

Feature Data Types

Target Data Types

Key Principles

Hyperparameters

Data Assumptions

Performance

Accuracy

Explainability

Description

Numerical.

Catagorical.

Likeness is described as a function of Euclidean distance. The goal is to find K centroids (therefore clusters) that minimize the within cluster Euclidean distances. Will group together all data points in the space until no points are left.

Number of clusters (K).

Distance metric assumes clusters are spheres. Features are uncorrelated. Normalized.

Fast. Very scalability due to linear time and memory complexity. Even cluster size.

Will always converge. Converges to local optimum. May not produce meaningful clusters in a sparse feature space with outliers. Intuition fails in high dimensions and dimensionality reduction is therefore advised as part of the pre-processing.

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 48

Unsupervised Learning

K-Modes Clustering

K-Modes Clustering

K-Modes clustering is an extension of K-Means clustering by replacing cluster means by cluster modes. Modes are updated based on frequency. It is widely used for grouping categorical data. It defines clusters based on the number of matching categories between data points using a simple similarity measure.

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50

Unsupervised Learning

The algorithm is essentially the same as K-Means except the cost function is based on equality over categories

𝐾𝑁

𝑃𝑊,𝑄 =𝑤𝑖𝑙∙𝑑𝑠𝑖𝑚 𝑥𝑖,𝑞𝑙 𝑙=1 𝑖=1

𝑃 is the cost function for the clustering

𝑊 is an 𝑁𝑥𝐾 matrix of either 0 or 1 representing cluster membership

𝑁 is the number of data points in the dataset

𝐾 is the number of clusters

𝑄 is the vectors of cluster centroids 𝑋 is dataset to be clustered

𝑚

𝑑𝑠𝑖𝑚 𝑥𝑖 , 𝑞𝑙 = 𝛿 𝑥𝑖𝑗 , 𝑞𝑙𝑗 𝑑𝑠𝑖𝑚 measures the similarity between 2 vectors 𝑗=1

𝛿𝑥𝑖𝑗,𝑞𝑙𝑗 =൝1if𝑥𝑖𝑗=𝑞𝑙𝑗 0 if 𝑥𝑖𝑗≠𝑞𝑙𝑗

𝛿 measures the similarity between 2 features

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 51

Unsupervised Learning

Density-based Clustering

Density-based Spatial Clustering of Applications with Noise (DBSCAN)

Shortcomings of Simple Clustering

▪ Clustering algorithms discussed so far are suitable for finding spherical-shaped clusters or convex clusters

▪ In other words, they work well only for compact and well-separated clusters

▪ Moreover, they are also severely affected by the presence of noise and outliers in the dataset

▪ Unfortunately, real life data may exhibit arbitrary shapes and properties (including multiple shapes)

▪

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

53

Unsupervised Learning

K-Means runs into problem with clusters of different sizes

GROUND TRUTH THREE CLUSTERS FROM K-MEANS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54

Unsupervised Learning

K-Means runs into problem with clusters of different densities

GROUND TRUTH THREE CLUSTERS FROM K-MEANS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55

Unsupervised Learning

K-Means runs into problem with clusters of non-spherical or non-convex shapes

GROUND TRUTH TWO CLUSTERS FROM K-MEANS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 56

Unsupervised Learning

Shortcomings of K-Means with cluster size can be dealt with using more clusters first and then put them together

GROUND TRUTH CLUSTERS FROM K-MEANS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 57

Unsupervised Learning

Shortcomings of K-Means with cluster densities can be dealt with using more clusters first and then put them together

GROUND TRUTH CLUSTERS FROM K-MEANS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 58

Unsupervised Learning

Shortcomings of K-Means with cluster shapes can be dealt with using more clusters first and then put them together

GROUND TRUTH CLUSTERS FROM K-MEANS

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 59

Unsupervised Learning

DBSCAN

DBSCAN is a density-based clustering algorithm. Given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors) and marks as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away).

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 60

Unsupervised Learning

DBSCAN provides a more flexible and direct solution to address the shape and size issues with K-Means

ORIGINAL DATA CLUSTERS & NOISE POINTS

FROM DBSCAN

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 61

Unsupervised Learning

DBSCAN

◦ From each unvisited data point, measure the distance to every other point in the dataset

◦ All points that fall within the radius of neighborhood will be considered as neighbors

◦ The number of neighbors reaches the minimum neighbor point threshold, the points should be grouped together as a new cluster

◦ Data points not reachable from any cluster will be considered as noise

◦ Repeat the process until all data points are categorized in clusters or marked as noise

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

62

Unsupervised Learning

Unlike other clustering algorithms, not all data points are classified – unclassified data points are considered noise

Core Point

𝜀

MinPts = 6

▪ Core Point

▪ At least a minimum number of data points (MinPts)

within its radius of neighborhood (𝜀-neighborhood)

▪ All core points within the 𝜀-neighborhood of a core point

are grouped as a cluster

▪ Border Point

▪ Lies within the 𝜀-neighborhood of a core point but not a core point itself due to not having enough MinPts in its 𝜀-neighborhood

▪ Will be grouped in the cluster of its nearest core point

▪ Noise Point

▪ Not reachable from any cluster

▪ A noise point, not enough MinPts in its neighborhood, not associated with a core point

▪ Excluded from clustering

Border Point

Noise Point

𝜀-neighborhood

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

63

Unsupervised Learning

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 64

Hyperparameters can be tricky to tune

▪ Radius of Neighborhood (𝜀 / epsilon)

▪ The maximum distance a point to the

nearest cluster

▪ The greater the value, the fewer clusters are found because clusters eventually merge into other clusters

▪ Minimum Neighbor Points

▪ Required to produce a new cluster

▪ A larger value assures a more robust cluster but may exclude some smaller clusters as it attempts to merge them in a larger one

▪ Increases with the size of the dataset

▪ A smaller value may extract many

clusters with possible inclusion of noise Unsupervised Learning

DBSCAN moves through all data points to form clusters based on neighbourhood and density

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 65

Unsupervised Learning

DBSCAN in a Nutshell

1

2

3

4

5

6

7

8

Property

Feature Data Types

Target Data Types

Key Principles

Hyperparameters

Data Assumptions

Performance

Accuracy

Explainability

Description

Numerical. Should be scaled.

Categorical.

Expands the distance metric with the notion of density and clusters are therefore high density areas. Cluster membership is based on neighbourhood radius and the number of data points in the neighbourhood. Identify core, boundary, and noise points. Noise points are excluded from clustering. Therefore less prone to the distortion caused by outliers.

K is not required. Neighbourhood radius (epsilon). Minimum data points per neighbourhood.

Will find clusters of arbitrary shapes and sizes including highly complex data.

It will often immensely outperform K-means (in practice, this often happens with highly intertwined, yet still discrete, data, such as a feature space containing two half-moons). Parameter tuning can be challenging. Finds non-convex and non-linearly separable clusters.

Difficulties with clusters of varying density and high-dimensional data.

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 66

Unsupervised Learning

References

References

“The Unsupervised Learning Workshop”, Aaron Jones, Christopher Kruger, Benjamin Johnston, Packt Publishing, July 2020

“Hands-On Unsupervised Learning Using Python”, Ankur A. Patel, O’Reilly Media, Inc., March 2019

Copyright (c) by Daniel K.C. Chan. All Rights Reserved.

68

Unsupervised Learning

References

▪ “K-Means Clustering using sklearn and Python”, Dhiraj K, October 2019 (https://heartbeat.fritz.ai/k-means-clustering-using-sklearn-and- python-4a054d67b187)

▪ “K-Means Clustering Explained with Python Example”, Ajitesh Kumar, September 2020 (https://vitalflux.com/k-means-clustering- explained-with-python-example/)

▪ “K-Means Clustering Elbow Method & SSE Plot – Python”, Ajitesh Kumar, September 2020 (https://vitalflux.com/k-means-elbow-point- method-sse-inertia-plot-python/)

▪ “K-Means Silhouette Score Explained with Python Example”, Ajitesh Kumar, September 2020 (https://vitalflux.com/kmeans-silhouette- score-explained-with-python-example/)

▪ “K-Modes Clustering”, Shailja Jaiswal, July 2020 (https://medium.com/@shailja.nitp2013/k-modesclustering-ef6d9ef06449)

▪ “How to Create an Unsupervised Learning Model with DBSCAN”, Anasse Bari, Mohamed Chaouchi & Tommy Jung

(https://www.dummies.com/programming/big-data/data-science/how-to-create-an-unsupervised-learning-model-with-dbscan/)

▪ ” Scikit-Learn – Clustering: Density-Based Clustering of Applications with Noise [DBSCAN]”, June 2020

(https://coderzcolumn.com/tutorials/machine-learning/scikit-learn-sklearn-clustering-dbscan)

▪ “A Step by Step approach to Solve DBSCAN Algorithms by tuning its hyper parameters”, Mohanty Sandip, Mar 2020 (https://medium.com/@mohantysandip/a-step-by-step-approach-to-solve-dbscan-algorithms-by-tuning-its-hyper-parameters- 93e693a91289)

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 69

Unsupervised Learning

References

▪ “DBSCAN Python Example: The Optimal Value For Epsilon (EPS)”, Cory Maklin, Jun 2019 (https://towardsdatascience.com/machine- learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc)

▪ “DBSCAN: Density-Based Clustering Essentials” (https://www.datanovia.com/en/lessons/dbscan-density-based-clustering-essentials/)

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 70

Unsupervised Learning

THANK YOU