# CS代考 APS1070 – cscodehelp代写

APS1070

Foundations of Data Analytics and Machine Learning

Fall 2021

Week 4:

• Clustering

• Probability Theory

• SummaryStatistics

• Gaussian Distribution

• Performance Metrics

Prof.

Slide Attribution

These slides contain materials from various sources. Special thanks to the following authors:

• Sinisa Colic

•

•

• Zadeh

•

Quick review!

Algorithm

➢For a single nearest neighbour

+

4

How do we measure distance?

Dimension of x or y

5

Decision Boundary

➢Can generate arbitrary test points on the plane and apply kNN

➢The boundary between regions of input space assigned to different categories.

6

Tradeoffs in choosing k?

➢Small k

➢Good at capturing fine-grained patterns ➢May overfit, i.e. be sensitive to random noise

Excellent for training data, not that good for new data, too complex

➢Large k

➢Makes stable predictions by averaging over lots of

examples

➢May underfit, i.e. fail to capture important regularities

Not that good for training data, not good for new data, too simple

7

Constructing a Decision Tree

➢Decision trees make predictions by recursively splitting on different attributes according to a tree structure

width (cm)

8

Part 1 K-Means Clustering (Unsupervised Learning)

Clustering

➢Clustering algorithms group samples/instances based on similarities in features

➢Input: set of samples/instances described by features

➢Output: assigned cluster (group) for each sample/instance

➢Clustering are unsupervised techniques and do not have access to the sample/instance labels

10

Clustering Strategies

➢k-Means Clustering

➢Assigns each point to the nearest cluster center

➢Agglomerative clustering

➢Assumes each point is a cluster and iteratively merges the closest clusters

K-Means

➢Most well-known clustering method ➢Distance-based unsupervised learning algorithm,

NOT to be confused with k-NN.

➢ Algorithm:

1. Assign each sample/instance to its closest mean

2. Update the means based on the assignment

3. Repeat until convergence

➢ Requires:

—Selection of the number of clusters ‘k’ (hyperparameter) —Centre of each cluster is randomly initialized at the start

12

K-Means Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

13

K-Means Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

14

K-Means Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

15

K-Means Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

16

K-Means Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

17

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

18

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

19

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

20

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

21

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

22

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

23

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

24

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

25

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

26

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

27

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

28

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

29

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

30

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

31

K-means

Example

• K is number of clusters

• STEP 1: Guess center locations

• STEP 2: Map out what data point is closest to what center

• STEP 3: Center moves to the centroid of all points it “owns”

• REPEAT!

32

Doesn’t always work… can get stuck!

33

What are we trying to do?

Note: N points, K clusters Source: Ethan Fetaya, , Emad Andrews

34

Slide credit: Ethan Fetaya, , Emad Andrews

35

k-Means Convergence

➢Whenever an assignment is changed, the sum squared distances, J, of data points from their assigned cluster centers is reduced

➢Whenever a cluster is moved, J is reduced

➢Test for convergence: if the assignments do not change in the assignment step, we have converged (to at least a local minimum).

credit: Ethan Fetaya, , Emad Andrews

36

Local Minima

➢There is nothing to prevent k-means from getting stuck at a local minimum

Options for avoiding local minimum:

➢we could try many random starting points ➢split a big cluster into two

➢merge to nearby clusters

credit: Ethan Fetaya, , Emad Andrews

37

How to choose number of clusters (k)?

➢As k increase the sum of squared distance goes towards zero

➢If the plot looks like an arm, then the elbow of the arm is the optimal k

➢e.g., the elbow is at k=3 indicating the optimal number of clusters is 3

sum of squared distance

“k” number of clusters

38

Shape of K-Means Clusters

➢K-means split the space according on the closest mean:

Source: scikit-learn

39

➢Note that the clusters form convex regions.

Convex Region

➢A region is convex if any line between two points in the region remains in the region.

convex

nonconvex

convex

40

K-Means with Non-Convex Clusters

K-Means (K = 2)

Source:

K-means is unable to separate non-convex clusters

41

Agglomerative Clustering

➢A type of Hierarchical Clustering

➢ Algorithm:

1. Starts with each point in its

own cluster

2. Each step merges the two “closest” clusters

Source: MachineLearningStories

42

Example: Agglomerative Clustering

Dendrogram

bd ae

Distance Threshold

# of Clusters

abcdef

g

g

f

c

Comparison of Clustering Methods

➢ There are many clustering

algorithms to chose from

➢ Performance will depend on your data

Source: scikit-learn

44

Applications in Computer Vision

➢Replace samples with the mean of their cluster (vector quantization)

➢Visual example:

➢Inputs: color pixels represented as RGB values

➢Outputs: cluster (average RGB) value obtained using k-means

➢Q: How can this be applied for compression?

➢Q: How can this be applied for image segmentation?

Source: PRML Bishop 2006 45

K-Means Code Example (Google Colab)

46

Part 2 Probability Theory

Readings:

• Chapter 6.1-5 MML Textbook

Overview

➢Compared several machine learning algorithms

➢Supervised Learning ➢k-Nearest Neighbours ➢Decision Trees

➢Unsupervised Learning ➢k-Means Clustering

➢Q: How did we measure performance?

48

Measuring Uncertainty

➢In machine learning it is important to understand how confident we can be about the decisions (or predictions) being made by our models/algorithms.

➢Q: A model/algorithm is tested on two new samples and classifies both correctly. What is the accuracy and how confident are you with that result?

49

Agenda for parts 2 and 3 of Lecture 4

➢Probability Theory ➢Examples

➢Summary Statistics

➢Gaussians ➢Mixture of Gaussians

➢Anomaly Detection ➢Performance Metrics

➢Precision and Recall ➢Confusion Matrix ➢ROC and AUC

Theme:

Measuring Uncertainty

50

Example: Random Variables (discrete)

P(X=3|Y=2) =5/20 Conditional Distribution

2

Y

1

P(X=5) Marginal Distribution

=(4+5)/35

P(X=2,Y=1) Joint Distribution

=3/35

51

12

34567

X

Example: Marginal Distribution of X

2

Y

1

P(X,Y) – Joint Distribution P(X) – Marginal Distribution

1234567 1234567

XX 8/35

8/35

1/35

1/35

6/35

6/35

3/35

X

52

Example: Marginal Distribution of Y

P(X,Y) – Joint Distribution 22

YY 11

P(Y) – Marginal Distribution

12

34567

X

Y

20/35 15/35

53

Example: Conditional Distribution X|Y

P(X,Y) – Joint Distribution

P(X|Y=1) – Conditional Distribution

2

Y

1

2

Y

1

34567 12

12

34567

X

X

2/15 2/15

3/15 0/15

4/15

X

54

2/15

1/15

Example: From 2 to 3 Random Variables

P(X,Y,Z)

Z=2 Z=1

P(X,Y)

2

Y

1

2

Y

1

2

1234567 Z11234567

XX

P(X=3,Y=2)

=4/35

P(X=3,Y=2) Marginal Distribution

=4/35

P(X=5,Y=2,Z=1) Joint Distribution

=3/35

55

Example: Continuous Distributions

p(x=2.75|y=2.20) Conditional Distribution

y

2

1

p(y=2.20)

Marginal Distribution

➢Probability at a point is

meaningless ➢Needs to add up

to 1 ➢Consider areas

12

34567

p(x=3.75,y=1.40) Joint Distribution

p(x=5.2)

Marginal Distribution

x

56

Probability with Real Data

pd.crosstab(df[‘Pclass’],df[‘Sex’])

Counts

Sex

0

1

All

1

80

136

216

2

97

87

184

3

372

119

491

All

549

342

891

Probabilities

Sex

0

1

All

1

0.090

0.153

0.242

2

0.109

0.098

0.207

3

0.418

0.134

0.551

All

0.616

0.384

1.000

pd.crosstab(df[‘Pclass’],df[‘Sex’])/891

57

Class

Class

Probability with Real Data

Counts

Sex

Probabilities

1 2 3 All

0.090 0.153 0.242 0.109 0.098 0.207 0.418 0.134 0.551 0.616 0.384 1.000

0

1 All

Q: What is the probability of a random passenger from the dataset belonging to the 3rd class?

Given that there were 70 first-class male passengers who survived and 10 first-class male passengers who did not, what is the probability of survival for first- class male passengers?

Q: What is the probability of there being a first-class male passenger that survived?

58

1 80 136 216 10 dead

70 survived

2 97 87 184

3 372 119 491 All 549 342 891

Sex

0 1 All

Class Class

Example: Permutations and Combinations

Permutation ➢arrangement of items in

which order matters Combination

➢selection of items in which order does not matter

n – number of items in a set

r – number of items selected from the set

Example: Permutations (n=5, r=3)

Q: How many ways can we fill 3 positions in a company using a pool of 5 applicants?

Q: How many ways can we fill 3 positions in a start-up using a pool of 5 applicants if each person can potentially be given more than one position?

60

Example: Combinations (n=5, r=3)

Q: How many ways can we select 3 people from a pool of 5 applicants to give them a tour of the company?

Q: How many ways can we distribute three identical coins among 5 individuals?

61

Let us Summarize…

62

Why Probability?

➢Probability theory is a mathematical framework for quantifying our uncertainty about the world.

➢It allows us (and our software) to reason effectively in situations where being certain is impossible.

➢Probability theory is at the foundation of many machine learning algorithms.

63

Perspectives on Probability

➢Objectivist perspective: randomness is fundamental to the universe.

➢They would say that the probability of a fair coin coming up heads is 0.5, because that’s the nature of fair coins.

➢Subjectivist perspective: probabilities represent our degree of belief that an event will occur.

➢If we knew the initial position of the coin and how the force was applied, then we could determine with certainty if it would come up heads or tails.

➢Under this perspective, probability is a measure of our ignorance (like not knowing how the force is applied to the coin).

64

Frequentists

➢Frequentist’s position: estimations come from experiments and experiments only.

➢e.g., if we want to estimate how likely a six-sided die is to roll a 4, we should roll the die many times and observe how frequently 4 appears.

➢This method works well when we have a large amount of data, but with fewer examples we can’t be confident in our estimates.

➢If we haven’t seen a 4 after five rolls, does that mean a 4 is impossible?

➢The other issue is that we can’t inject any of our prior knowledge about dice into our estimates. If we knew the die was fair, not seeing a 4 in the first five rolls is completely understandable.

➢Bayesian perspective allows us to combine our prior beliefs with our observations

65

Bayesian vs Frequentist

➢For example: Imagine that a coin we believe to be fair is flipped three times and results in three heads.

➢Frequentist calculation would suggest the coin is loaded (although with low confidence)

➢Bayesian our prior knowledge that the coin is fair allows us to maintain some degree of belief that a tails is still possible.

➢The actual mechanics of how we combine our prior belief relies on something called Bayes’ rule, which will be covered later.

66

Mathematical Framework

➢Probability theory is a mathematical framework.

➢As with any mathematical framework there is some vocabulary and important rules needed to fully leverage the theory as a tool for machine learning.

67

Probability Spaces

➢Probability is all about the possibility of various outcomes. The set of all possible outcomes is called the sample space.

➢e.g., sample space for coin flip is {heads, tails}.

➢e.g., the sample space for the temperature of water is all

values between the freezing and boiling point.

➢Only one outcome in the sample space is possible at a time, and the sample space must contain all possible values.

68

Random Variables

➢Are variables which randomly takes on values (discrete or continuous) from a sample space.

➢Probability of any event has to be between 0 (impossible) and 1 (certain), and the sum of the probabilities of all events should be 1.

0≤𝑝𝑥≤1

𝑝𝑥=1 𝑥

69

Discrete Probabilities

➢Discrete random variables are described with a probability mass function (PMF).

➢PMF maps each value in the variable’s sample space to a probability.

➢e.g., PMF for a loaded die and how does it compare with a normal die

PMF Loaded Die

X

PMF Regular Die

X

70

P(X) P(X)

Probability Distributions

➢Common discrete distribution is the Bernoulli.

—A Bernoulli distribution specifies the probability for a random variable which can take on one of two values.

—e.g., heads or tails

—We can specify the entire distribution with a single parameter p, the probability of the positive outcome.

—e.g., for a fair coin we have p = 0.5,

—e.g., given the probability of rain is p = 0.2, we can infer the probability of no rain is 0.8.

➢Other common discrete distributions are the binomial (e.g., handles multiple tosses of a coin) and multinomial distributions (e.g., rolling a die), and Poisson (events occurring in fixed interval of time).

71

Types of Probabilities

➢Joint Probability

➢a joint distribution over two random variables x, y specifies the probability of any setting of the random variables.

➢Marginal Probability

➢called the marginal probability distribution, since we’ve “marginalized” away the random variable y (uses the sum rule).

➢Conditional Probability

➢the probability of an event given that another event has already been observed.

72

Bayes’ Theorem

➢Product Rule:

P(x, y) = P(x|y) ⋅ P(y).

➢We can write the product rule for two variables in two equivalent ways:

P(x, y) = P(y|x) ⋅ P(x)

➢By setting both equations equal and divide by P(y), we get Bayes’ rule:

Note Bayes’s rule is crucially

important to much of statistics and machine learning. Driving force behind Bayesian statistics (Bayesian perspective).

This simple rule allows us to update our beliefs about quantities as we gather more observations from data.

73

Bayes’ Theorem Example (made-up numbers)

If a random person has a fever, what is the chance that it is COVID?

We know:

1. P(fever | COVID) = 25%: 25% of infected people have fever

2. P (COVID) = 2% : Fraction of world population having COVID

3. P (fever) = 1% : 1 in every 100 persons has fever

74

Bayes’ Theorem Example (made-up numbers)

If a random person feel tired, what is the chance that it is COVID?

We know:

1. P(tired | COVID) = 25%: 25% of infected people feel tired

2. P (COVID) = 2% : Fraction of world population having COVID

3. P (tired) = 25% : 1 in every 4 persons feels tired

Find the answer and discuss over Piazza, if you notice anything interesting…

75

Independence

➢Two variables x and y are said to be independent if P(x, y) = P(x) ⋅ P(y)

Q: Can you think of an example where this would happen?

76

Conditional Independence

➢Two variables x and y are called conditionally independent given another variable z if

P(x, y|z) = P(x|z) ⋅ P(y|z)

➢Q: Can you think of an example where this would happen?

77

Continuous Probabilities

➢Continuous random variables are described by probability density functions (PDF) which can be a bit more difficult to understand.

➢PDFs map an infinite sample space to relative likelihood values.

➢To understand this, let’s look at an example with one of the most famous continuous distributions, the Gaussian (aka Normal) distribution.

y

x

78

Gaussian Distribution

➢The Gaussian distribution is parameterized by two values: the mean μ (mu) and variance σ2 (sigma squared).

➢The mean specifies the center of the distribution, and the variance specifies the width of the distribution.

➢You may have also heard about the standard deviation σ, which is just the square root of the variance.

PDF of Gaussian

variance

mean

x

79

density

Relative Likelihood

➢The value of the PDF is not the actual probability of x.

➢Remember, the total probability for every possible value needs to sum to 1.

➢Q: How can we sum over infinite number of values?

➢A: Need to calculate the area under the PDF to obtain the probability

Since we are interested in the area, it is often more useful to work with a continuous random variable’s cumulative density function (CDF).

PDF of Gaussian

Area

1

p(0x 1)=f (x)dx

0

80

Relative Likelihood Con’t

➢We just determined that the area corresponds to the probability, so F(x) gives us P(X≤x).

➢Use the CDF to determine the probability of any given range [a,b] using P(a≤X≤b) = F(b)-F(a).

CDF

Cumulative distribution function

PDF

Probability density function

➢Note that asking for P(X=x) is equivalent to asking P(x≤X≤x) = F(x)-F(x) = 0

81

Functions of Random Variables

➢Often useful to create functions which take random variables as input.

➢e.g., it costs $2 to play the game, “guess a number between 1 and 10”. Correct guess = $10, Incorrect guess = $0, but it costs $2 to play. Let x be a random variable indicating whether you guessed correctly. We can write a function:

h(x) = {$8 if x = 1, and -$2 if x = 0}

➢You may be interested in knowing in advance what the expected outcome will be.

82

Expectation

➢The expected value, or expectation, of a function h(x) on a random variable x ~ P(x) is the average value of h(x) weighted by P(x). For a discrete x, we write this as:

➢The expectation acts as a weighted average over h(x), where the weights are the probabilities of each x.

e.g., 𝔼[h(x)] = P(winning) ⋅ h(winning) + P(loosing) ⋅ h(loosing) =(1/10) ⋅ $8 + (9/10) ⋅ (-$2) = $0.80 + (-$1.80)

= -$1

If x had been continuous, we would replace the summation with an integral

On average, we’ll lose $1 every time we play!

83

Expectation

➢A nice property of expectations is that they’re linear.

➢Let’s assume h and g are functions of x, and α and β are constants. Then we have:

84

Variance

➢We saw variance with respect to a Gaussian distribution when we were talking about continuous random variables. In general, variance is a measure of how much random values vary from their mean.

➢Similarly, for functions of random variables, the variance is a measure of the variability of the function’s output from its expected value.

85

Describing a Gaussian Distribution

86

Multivariate Data

➢Multiple measurements (e.g., different sensors) ➢d features/attributes (e.g., number of sensors) ➢N instances/observations/examples

columns => features for a single instance

rows => instances

87

Describing Multivariate Datasets

➢What do all these graphs have in common?

Covariance

➢Variances along each axis remain constant, but properties of the dataset change

➢Variances insufficient to characterize the relationship/correlation of two random variables -> we need cross-variance!

89

Covariance Matrix

➢Expectation (mean): centre of the dataset

➢Covariance: “variance” of a d-dimensional random variable is given by a covariance matrix.

Multivariate Gaussian Distribution

f

Determinant of Multivariate Covariance Matrix Multivariate Covariance Matrix Sample Mean

Covariance

➢When the absolute value of covariance is high, the two variables tend to vary far from their means at the same time.

➢When the sign of the covariance is positive, the two functions map to higher values together.

➢When the sign of the covariance is negative, the one function maps to higher values, the other maps to lower values (or vice versa)

Positive Covariance

Negative Covariance

92

Bivariate Normal

93

Bivariate Normal

94

Bivariate Normal

95

Bivariate Normal

96

97

Confidence

f

d = 1 (univariate)

d = 2 (bivariate)

98

Example: Calculate the Covariance Matrix

Joint Probability P(X1,X2)

X2

P(X1)

X1

P(X2)

01

-1 0.24 0.06 0.3

0 0.16 0.14 0.3 1 0.40 0 0.4

compute the mean

0.8 0.2 1

(-1)(0.3) + (0)(0.3) + (1)(0.4) = 0.1

(0)(0.8) + (1)(0.2) = 0.2

Example: Calculate the Covariance Matrix

Joint Probability P(X1,X2)

X2

P(X1)

X1

01

-1 0.24 0.06 0.3

0 0.16 0.14 0.3 1 0.40 0 0.4

P(X2) 0.8 compute the covariance

0.2 1

(-1-0.1)2(0.3) + (0-0.1)2(0.3) + (1-0.1)2(0.4) = 0.69

(0-0.2)2(0.8) + (1-0.2)2(0.2) = 0.16

(-1-0.1)(0-0.2)(0.24) + (-1-0.1)(1-0.2)(0.06) + (0-0.1)(0-0.2)(0.16)

+ (0-0.1)(1-0.2)(0.14) + (1-0.1)(0-0.2)(0.40) + (1-0.1)(1-0.2)(0) = -0.08

Example: Calculate the Covariance Matrix

Joint Probability P(X1,X2)

X2

P(X1)

X1

01

-1 0.24 0.06 0.3

0 0.16 0.14 0.3

1 0.40 0 0.4 P(X2) 0.8 0.2 1

putting it all together

Example: Calculate the Covariance Matrix

Joint Probability P(X1,X2)

X2

P(X1)

0

1

X1

-1

0.24

0.06

0.3

0

0.16

0.14

0.3

1

0.40

0

0.4

P(X2)

0.8

0.2

1

➢Normalizing the covariance matrix will give you the population correlation coefficient, a measure of how correlated two variables are.

Example: Visualization

f

Multivariate Mean

Bivariate Gaussian Example (See Sample Code)

X2

Covariance Matrix

Determinant of Covariance Matrix

0.104

Determinant Calculation (bivariate example)

(0.69 * 0.16) – (-0.08*-0.08)

Multivariate Sample

Determinant will be covered in week 7

X1

Gaussian Mixture Models (GMM)

104

Gaussian Mixture Models (GMM)

Predicted Loss (Error)

Source: Scikit-learn

➢Can be extended to multivariate data (e.g., bivariate GMM)

105

Anomaly Detection (Semi-Supervised)

What is the goal?

Time of day

4 AM 9 PM

9 AM

1$ 1M$

GMM

Transaction $

107

Outlier

108

Google Colab (Code Example)

Part 3 Performance Metrics

Why Performance?

➢Q: Why do we care about performance?

➢Identifying how well our models are doing is not trivial. Easy to fall into the trap of believing (or wanting to believe) our models are working based on weak assessment.

111

How to measure the performance of a model?

➢Assume a case where we want to detect outliers and we know: ➢Dataset has 100 points

➢98 are non-outlier ➢2 are outliers

➢If we detect all the points as non-outliers:

➢98 True predictions/100 = 98% accuracy for a model that is not working

➢Q: How can we improve our performance measurements?

112

Fraud Detection System

Model

Transaction Features

Positive: Fraud

Negative: Valid

113

Fraud Detection System

(Positive = Fraud)

➢If transaction is Valid:

➢Prediction : Valid (True Negative) ➢Prediction : Fraud (False Positive)

➢If transaction is Fraud:

➢Prediction : Fraud (True Positive) ➢Prediction : Valid (False Negative)

114

Precision and Recall

➢If transaction is Valid:

➢Prediction : Valid (True Negative) OK! ➢Prediction : Fraud (False Positive) Not that bad!

➢If transaction is Fraud:

➢Prediction : Fraud (True Positive) GOOD!

➢Prediction : Valid (False Negative) Super BAD!

P:Valid

P:Fraud

TP TP

(MISS)

(MISTAKE)

Fraud

TP+FP

TP+FN

Valid

115

Confusion Matrix

Actual Value

(as confirmed by experiments)

➢Table used to describe prediction performance on a set of test data

positives

TP

True Positive

FN

False Negative

negatives

FP

False Positive

TN

True Negative

Predicted Value

(predicted by the test)

negatives positives

116

F1 score

𝐹1 = 2 ∗ 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

➢A balanced measure of accuracy giving equal importance to recall and precision

➢The highest possible value of F1 is 1, and lowest possible value is 0

117

Confusion Matrix (by heart)

Recall

False positive rate

Precision

https://en.wikipedia.org/wiki/Confusion_matrix

118

Confusion Matrix

https://en.wikipedia.org/wiki/Confusion_matrix

119

Precision and Recall – tug of war

➢ Application – spam detection

➢ Improving precision usually reduces recall, vice-versa

You’re saying everything on this side IS NOT spam (N) You’re saying everything on this side IS spam (P)

Model output

What is the precision (emails flagged as spam)? tp/(tp+fp) = 8/(8+2) = 0.8

What is the recall (actual spam correctly classified)? tp/(tp+fn) = 8/(8+3) = 0.73 Source: ML Crash Course

120

Precision and Recall – tug of war

➢ Application – spam detection

➢ Improving precision usually reduces recall, vice-versa

You’re saying everything on this side IS NOT spam (N) You’re saying everything on this side IS spam (P)

Model output

What is the precision (emails flagged as spam)? tp/(tp+fp) = 7/(7+1) = 0.88 What is the recall (actual spam correctly classified)? tp/(tp+fn) = 7/(7+4) = 0.64

Source: ML Crash Course

121

how do we choose a threshold?

ROC (Receiver Operating Characteristic) Curve

You want this to be high

You want this to be low

Source: Wikipedia 122

ROC Curve

Closest point to the top left = best accuracy

You want this to be high

ROC:

Receiver Operating Characteristic (plot for binary classifiers)

Q: What would be a perfect classifier?

Random classifier

You want this to be low 123

ROC Curve

Negative Positive

TPR=1 FPR=0.95

FP TN

Negative

Positive

124

ROC Curve

TPR=1 FPR=0.8

Negative Positive

TN

FP

125

ROC Curve

Negative Positive

TPR=1 FPR=0.05

TN

FP

126

ROC Curve

Negative

Positive

TPR=1 FPR=0

127

ROC Curve

Negative

Positive

TPR=0.7 FPR=0

TN

FN

128

AUC (Area Under the Curve)

Source:

129

Data Selection

➢How we select our data is another aspect that is often overlooked.

➢Can have serious effects on the prediction’s performance.

➢Q: What are some issues with arbitrary data selection?

130

Splitting the Dataset

➢Training, Validation and Testing

60-80% 10-20% 10-20%

➢More training data => better model

➢More testing data => more confidence in assessment ➢Cross-Validation is an attempt to have both…

131

Independent and Identically Distributed

➢It is important that the generative process of data is the same for all data and the process has no memory of past generated samples.

➢Sampling of data needs to be independent

132

Limited Data

➢There are often situations where it is difficult to obtain sufficient data to train a machine learning algorithm.

➢For example: A common practice with medical data is to increase the sample size by obtaining multiple samples from the same patient.

➢Q: What are some concerns resulting from this?

133

Generalization

➢We have seen previously that correctly predicting on new data is what we’re interested in.

➢Golden Rule: No model selection decisions should come from the test data!

134

Next Time

➢Week 4 Q/A Support Session on 30 Sep ➢Project Questions

➢Project 1 is due on 1 Oct. at 21:00 ➢Reading assignment 4 (due 4 Oct at 21:00):

➢ -Read pages 33-41 (page numbers from the pdf file) Section 2.3 in Chapter 2 of “Mathematics for Machine Learning” by . Deisenroth et al., 2020 link

➢Week 5 Lecture – Data Processing ➢Linear Algebra

➢Analytical Geometry ➢Data Augmentation

135