# 程序代写代做代考 data mining python information retrieval algorithm CS373 Data Mining and�

CS373 Data Mining and�

Machine Learning�

Lecture 9

Jean Honorio

Purdue University

Goal of machine learning?

Goal of machine learning

• Use algorithms that will perform well in unseen data

Goal of machine learning

• Use algorithms that will perform well in unseen data

• How to measure performance?

• How to use unseen data?

Goal of machine learning

• Use algorithms that will perform well in unseen data

• How to measure performance?

• How to use unseen data?

• Variability?

• By-product: a way to set hyper-parameters

- e.g., C for SVMs, λ for kernel ridge regression, etc.

Measures of Performance: Regression

• Assume that for a point , we predict

• Mean square error:

• Root mean square error:

• Mean absolute error:

MSE(g) =

1

n

(g(xi )− yi )

2

i=1

n

∑

g(x)x

RMSE(g) = MSE(g)

1

n

g(xi )− yii=1

n

∑

Measures of Performance: Classification

• True Positive (TP)

• True Negative (TN)

• False Positive (FP)

• False Negative (FN)

• Accuracy

• Error

• Recall / Sensitivity

• Precision

• Specificity

• Use jointly: (Precision,Recall) or (Sensitivity,Specificity)

True Label

+1 -1

+1

-1

TP

TN

P

re

di

ct

ed

L

ab

el

FP

FN

(TP +TN ) / (TP +FP +FN +TN )

TP / (TP +FN )

TP / (TP +FP)

TN / (TN +FP)

(FP +FN ) / (TP +FP +FN +TN )

Precision and Recall

• Idea comes from information retrieval

Wikipedia

Sensitivity and Specificity

• Idea comes from signal detection theory

• Assume Gaussian distributions

• By sliding the offset we get different (TP, FP, TN, FN)

and thus, different sensitivity and specificity

Pattern Classification, Duda et al., 2nd Ed., Chapter 2

p(x | y = +1) = N(µ1,σ

2 )

p(x | y = −1) = N(µ2,σ

2 )

p(x | y = +1) p(x | y = −1)

θ0

g(x) =

+1, if x <θ0
−1, if x ≥θ0
#
$
%
&%
Classifier:
θ0
Receiver Operating Characteristic (ROC)
• By varying the
hyperparameter of a
classifier (e.g., C for SVM with
linear kernel, β for SVM with
RBF kernels) we can get
different:
- Sensitivity
- Specificity
• Summarized with an Area
Under the Curve (AUC)
- Random: 0.5
- Perfect classifier: 1
The Elements of Statistical Learning, Hastie et al.
SVM with linear kernel
SVM with RBF kernels
Logistic regression
Other Loss Functions
• Let +1 mean “diseased patient” and -1 mean “healthy
patient”
True Label
+1 -1
+1
-1
0
0
P
re
di
ct
ed
L
ab
el
1
1
True Label
+1 -1
+1
-1
0
0
P
re
di
ct
ed
L
ab
el
1
10
1
n
1[g(xi ) ≠ yi ]i=1
n
∑ 1
n
Cost(g(xi ), yi )i=1
n
∑
2) Using “Unseen” Data
• Overfitting:
- Bigger model class , better fit in training data (linear to
quadratic to cubic)
- Find hyper-parameters that better fit training data
- Usually poor performance in unseen data
• To prevent overfitting, how can we “see” unseen data?
- Simulate it !
Pattern Classification, Duda et al., 2nd Ed., Chapter 9
Training, Validation, Testing
• Three data sets:
Training
set
Validation
set
Test
set
Report measures
Try different algorithms
different hyper-parameters
k-Fold Cross Validation
• Split training data D into k disjoint sets S1,…,Sk
- Either randomly, or in a fixed fashion
- If D has n samples, then each fold has approximately n / k samples
- Popular choices: k=5, k=10, k=n (leave-one-out)
• For i = 1…k:
train with sets S1,…,Si–1, Si+1,…,Sk
test on set Si
let Mi be the test measure (e.g., accuracy, MSE)
• Mean and variance are:
µ̂ =
1
k
Mii=1
k
∑ σ̂ 2 =
1
k
(Mi −i=1
k
∑ µ̂)2
0.632 Bootstrapping
µ̂ =
1
B
Mii=1
B
∑ σ̂ 2 =
1
B
(Mi −i=1
B
∑ µ̂)2
• Let B>0, and n be the number of training samples in D

• For i = 1…B:

Pick n samples from D with replacement, call it Si

(Si might contain the same sample more than once)

train with set Si

test on the remaining samples (D – Si)

let Mi be the test measure (e.g., accuracy, MSE)

• Mean and variance are:

0.632 Bootstrapping

• Why 0.632 ?

• Recall that:

- We pick n items with replacement from out of n items

- We choose uniformly at random

• The probability of:

- not picking one particular item in 1 draw is

- not picking one particular item in n draws is

- picking one particular item in n draws is

• Finally: limn→∞1− 1−1/ n( )

n

=1−1/ e ≈ 0.632

1−1/ n

(1−1/ n)n

1− (1−1/ n)n

3) Variability

• How to compare two algorithms?

- Not only means, also variances !

• Statistical hypothesis testing

Statistical Hypothesis Testing

• How to compare two algorithms?

- Not only means, also variances !

• Let be mean and variance of algorithms 1 and 2.

• When to reject null hypothesis in favor of ?

• Let:

µ̂1,σ̂1

2, µ̂2,σ̂ 2

2

x =

(µ̂1 − µ̂2 ) n

σ̂1

2 + σ̂ 2

2

ν =

(σ̂1

2 + σ̂ 2

2 )2 (n−1)

σ̂1

4 + σ̂ 2

4

”

#

#

$

%

%

µ1 > µ2µ1 = µ2

Degrees of freedom of

Student’s t-distribution

Statistical Hypothesis Testing

• Student’s t-distribution:

• For significance level , degrees of freedom

- Find the value for which CDF =

- Python: from scipy.stats import t

t.ppf(1–alpha, v)

• If reject null hypothesis in favor of µ1 > µ2µ1 = µ2

να

Probability density function (PDF) Cumulative density function (CDF)

1−αx1−α,ν

x > x1−α,ν

1−α

x1−α,ν

Wikipedia

What is a Sample?

• In this lecture we assume that each sample is a different

“unit of interest” for the experimenter

• Never sample the same “unit of interest” several times

- In a medical application, we might be interested on knowing

the accuracy (and variance) with respect to patients.

- Taking two visits of the same patient as two different samples

would be incorrect.

• Collect more data, if necessary

• Never duplicate (copy & paste) data.