# 程序代写代做代考 decision tree Bayesian deep learning ECE 657A: Classification – Lecture 5: Classification, Training, Validation and Estimation

ECE 657A: Classification – Lecture 5: Classification, Training, Validation and Estimation

ECE 657A: Classification

Lecture 5: Classification, Training, Validation and Estimation

Mark Crowley

January 24, 2016

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 1 / 34

Class Admin Announcements

Today’s Class

Announcements

Supervised Learning / Classification

Project Pitch Session

Break

Naive Bayes and MAP estimation

Training, validation, error estimation

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 2 / 34

Class Admin Announcements

Announcements

Today (Feb 1): Project Pitch Session.

Assignment 1 due : Due Feb 10 at 11:59pm (do in groups)

Next week (Feb 8):

I’ll be travelling Feb 3-9

11:30am TA help session

Guest lecture on Data Analytics tools for R.

Oxt week

(Feb 15): TA help session for the assignment or project

(11:30-12:50) and Guest lecture on Data Analytics tools for R by one

of Prof. Fischmeister’s students (1-2:2).

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 3 / 34

http://www.urbandictionary.com/define.php?term=oxt

Class Admin Pitch Session

Pitch Session : Structure

Your group will be matched with two other groups – see table.

Your group will have 5-10 minutes to pitch your idea to two other

groups.

You can use any materials you want, I’ll have flip chart paper, some

can use chalboard or whiteboard. You could use laptop. You could just

talk.

Then 5-10 minutes of discussion and feedback.

Next group

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 4 / 34

Class Admin Pitch Session

Pitch Session : Grading

You will not be graded on your pitch quality or your topic.

You will be graded on your feedback for others. (1.5% out of 4% peer

review)

Peer review forms:

Explain the project idea: you need to provide a concise, fairly accurate

explanation of their idea (shows you listened)

Do you think the project is too easy, too hard or just right?

Give some substantial, constructive suggestion for their project.

Explain how your suggestion would improve the project’s ability to

teach the class about something new.

Submit form online : https://goo.gl/forms/YpMlLC2qAsC5YXJW2

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 5 / 34

Supervised vs. Unsupervised Learning

Outline of

Class Admin

Announcements

Pitch Session

Supervised vs. Unsupervised Learning

Classification

Training, Validation and Error Estimation

Dividing the Data

Confusion Matrix

ROC and AUC Curves

Other Performance Measures

Hypothesis Testing

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 6 / 34

Supervised vs. Unsupervised Learning

Descriptive versus Inferential Analysis

We have data (samples). This data is a sample of a population (more

than just the measured or observed sample).

Descriptive Analysis is the type of analysis and measures that seek

to describe and summarize the data, the available samples. We can

not in general use it for interpretation of unobserved data.

Inferential Analysis (predictive) is the type of analysis that can

describe measures over the population of data. That is observed and

unobserved.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 7 / 34

Supervised vs. Unsupervised Learning

Another way to look at it: Machine Learning [Murphy,

2012]

Supervised Learning Also called: predictive, inferential

Given inputs and outputs

Model based: model or distribution is known

Parameter Estimation: distribution known, fitting

parameters

Linear Regresion, least squares estimation

output is linear: regression, prediction

output is categorical (label) : classification

Unsupervised Learning Also called: descriptive, knowledge discovery

Given inputs only

output is categorical : clustering

Reinforcement Learning Input and output given only when action

(prediction, choice, activity) provided.

Given feedback about utility of choice made.

No given model or labels ahead of time.

Learning is interactive.Mark Crowley ECE 657A : Lecture 5 January 24, 2016 8 / 34

Supervised vs. Unsupervised Learning

Clustering vs. Classification

Clustering

Unsupervised

Uses unlabeled data

Organize patterns w.r.t. an

optimization criteria

Notion of similarity

Hard to evaluate

Example: K- means, Fuzzy C-

means, Hierarchical

Classification

Supervised

Uses labeled data

Requires training phase

Domain sensitive

Easy to evaluate

Examples: Bayesian, KNN,SVM,

Decision Trees

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 9 / 34

Classification

Classification Definition

Definition:

Classification is a learning method that uses training samples (with

known class labels) to learn how to assign the samples into their

proper classes. The task of the clasifier is to use the feature vector

provide by the feature extractor to assign the object (datapoint) to a

category.

This learning can then be used to label test samples (with unknown

labels).

Classification can be facilitated if the features (or a subset of them)

of samples in the same class share characteristics that discriminate

them from other classes.

It can be seen as the discrete form of Prediction, can you map the

input values to a discrete set of output values rather than to a

continuous number.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 10 / 34

Classification

Classification

How do we design a classifier that can deal with the variability in

feature values?

Designing or selecting a classifier for the data or application is not an

obvious task.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 11 / 34

Classification

Definition

Given a dataset X = {x1, x2, . . . , xn} where each datapoint xi ∈ X

contains F features denoted xi ,f ,

and given a set of classes C = {C1, . . . ,CK},

the classification problem is to define a mapping δ(xi ) : X → C

where each xi is assigned to one class.

A class, Cj , contains precisely those points mapped to it, no more

and no less.

Note: the classes are predefined, nonoverlapping and partition the

entire set of datapoints.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 12 / 34

Classification

Considerations for Designing a Classification Approach

1. Data, labelled, one class, multiclass, overlap, missing values

2. Training and Error Estimation procedures

3. Classification methods

4. Performance measures

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 13 / 34

Classification

Classifier Methods

Similarity Based Approach

Template matching

Nearest Mean classifier

Neural Network classifiers

Deep Learning

Other

Decision Trees

Random Forests

Probabilistic Approach

Bayesian Classification

Logistic regression

Decision Boundary Approach

Geometrical

Neural Networks

SVM

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 14 / 34

Training, Validation and Error Estimation Dividing the Data

Training and Error Estimation

Resubstitution

Holdout Method

Leave-one-out

K-fold Cross Validation

Bootstrap

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 15 / 34

Training, Validation and Error Estimation Dividing the Data

Error Estimation and training Methods

Resubstitution Method: Uses all the available data for training and then

the same data is used for testing

Optimistically biased estimate, especially when the ratio

of sample size to dimensionality is small

Holdout Method: Uses half the data for training and the other half is used

for testing; training and test sets are independent

Pessimistically biased estimate

Different partitionings may give different results

Leave-one-out Method: A classifier is designed using (n- 1) samples and

evaluated on the remaining one sample; this is repeated n

times with different training sets of size (n-1)

Estimate is unbiased but it has a large variance

Large computational requirement because n different

classifiers have to be designed

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 16 / 34

Training, Validation and Error Estimation Dividing the Data

Error Estimation Methods

K-fold cross validation (Rotation): divide the available samples into K

disjoint subsets for training and the remaining subset for

test. Can repeat the process with different partitioning.

Estimate has lower bias than the holdout method and is

cheaper to implement than leave-one-out method.

Bootstrap Method: Generate many bootstrap sets of size n by sampling

with replacement.

Bootstrap estimates can have lower variance than the

leave-one-out method.

Computationally more demanding.

Useful in small sample size situations.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 17 / 34

Training, Validation and Error Estimation Dividing the Data

Validation

Sometimes we need to tune certain parameters of the classifier such

as k in the K Nearest Neighbours classifier or parameters that control

the architecture of a neural network.

This can be done using a validation set, leaving the test set to test

the optimal choice

So, the data is divided into 3 subsets: Training, Validation and

Testing

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 18 / 34

Training, Validation and Error Estimation Confusion Matrix

Classification Confidence

For a binary classification problem (ie. C = {C0,C1}) our decision rule

could be

δ(x) = I(f (x) > τ)

where:

f (x) is a measure of confidence that x ∈ C1. This could be a

probability, a distance or other function

τ (“tau”) is a threshold parameter that is used to make a decision

on which class to assign to each x .

For different values of τ we will get a different number of xi ∈ C1.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 19 / 34

ECE 750 T17

Performance Measures

9

TP

True Positive

20

FN

False Negative

10

FP

False Positive

45

TN

True Negative

25

P

30

N

70

Retrieved

Relevant

Not Retrieved

Relevant

Retrieved

Not Relevant

Not Retrieved

Not Relevant

Example:

100 Students 30 Tall, 70 Not Tall

Classifier 65 Tall (20 true tall, 45 false positive)

35 not tall (25 true not tall)

From Retrieval Classifier

R NR

Rel

Not

Rel

65

Yes

35

No

P+N=100

Training, Validation and Error Estimation Confusion Matrix

Using a Confusion Matrix

For a particular value of τ we can build a confusion matrix:

True Value

1 0

Estimated

1 TP FP

0 FN TN

Sum N+ = TP + FN N− = FP + TN

TP: True Positive

FP: False Positive (False alarm)

TN: True Negative

FN: False Negative (Missed detection

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34

Training, Validation and Error Estimation Confusion Matrix

Using a Confusion Matrix

For a particular value of τ we can build a confusion matrix:

True Value

1 0

Estimated

1 TP FP

0 FN TN

Sum N+ = TP + FN N− = FP + TN

N = TP + FP + FN + TN

N̂+ = TP + FP

N̂− = FN + TN

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34

Training, Validation and Error Estimation Confusion Matrix

Using a Confusion Matrix

For a particular value of τ we can build a confusion matrix:

True Value

1 0

Estimated

1 TP FP

0 FN TN

Sum N+ = TP + FN N− = FP + TN

Notes:

For more than one class you could build this for each class, whether

the point is in the class or not.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 20 / 34

Training, Validation and Error Estimation Confusion Matrix

The False Positive vs False Negative Tradeoff

From this table we can also compute various success and error rates:

True Value

1 0

Estimated

1 TPR= TP/N+ FPR= FP/N−

0 FNR= FN/N+ TNR= TN/N−

TPR: True Positive Rate (Sensitivity, Recall, Hit rate)

FPR: False Positive Rate (False alarm rate, Type I Error)

TNR: True Negative Rate (Miss Rate, Type II Error)

FNR: True Negative Rate (Specificity)

Remember, this depends on τ , so how do we find the best value of τ?

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 21 / 34

ECE 750 T17

Performance Measures

10

rate)fication (misclassi 55%

NP

Error

rate)n recognitio (overall %45

100

2520

Accuracy

�

�

�

�

�

FPFN

NP

TNTP

%6.66

30

20

ySensitivit

or Recall

�

P

TP

FNTP

TP

tprate

How many of the

true

+ve are classified

+ve, completeness

raterate

raterate

rate

tp

P

FN

fn

fp

N

TN

tn

N

FP

fp

valuepredictivepositive

FPTP

TP

�

�

�

1

1%36

70

25

Specifity

rate) alarm (false %64

70

45

%31

65

20

Precision

How many of the

classified +ve are true

positive, exactness

ECE 750 T17

Performance Measures

11

1

RecallPrecision

Recall Precision 2

Recall

1

Precision

1

2

d

�

�

Fmeasure

Close to 1 is Better

Harmonic mean of Precision & Recall

ECE 750 T17

Performance Measures

For multiclass, C classes, there are 2 ways to

calculate overall average F-measure:

(a) Micro-averaged F-measure

Calculate the global F-measure over all classes

12

� �

� �

US

SU

U

S

�

�

�

¦

¦

¦

¦

2

averaged_micro_

Recall

Precision

1

1

1

1

F

FNTP

TP

FPTP

TP

C

i

ii

C

i

i

C

i

ii

C

i

i

ECE 750 T17

Performance Measures

b) Macro-averaged F-measure

Compute F-measure locally over each class

and then average over all classes

13

C

F

F

F

FNTP

TP

FPTP

TP

C

i

i

ii

ii

i

ii

i

ii

i

i

¦

�

�

�

1

i

averaged_macro_

2

US

US

US

Number of classes

Training, Validation and Error Estimation ROC and AUC Curves

Receiver Operating Characteristic (ROC) Curve

ROC Originally conceived during WW II to assess the performance of

radar systems

If we apply our decision rule δ(x) for a range of τ then we can draw a

curve of any of the success/error rates.

If we plot TPR vs FPR we get the ROC Curve

Say we have two classifiers or distributions. One for C0 and another

for C1

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 22 / 34

Training, Validation and Error Estimation ROC and AUC Curves

ROC vs PR Curves

Line A is better than line B in both curves.

0 1

0

1

fpr

tp

r

A

B

0 1

0

1

recall

p

re

ci

si

o

n

AB

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 23 / 34

ECE 750 T17

ROC Curve (Receiver Operating Characteristic)

T. Faweett, An Introduction to ROC Analysis, Pattern Recognition

Letters, 27, 2006, 861-874.

ROC Originally conceived during WW II to assess the performance

of radar systems

14

y 1.0 D

0.8 B

A C

tp rate 0.6

0.4

E

0.2

0 0.2 0.4 0.6 0.8 1.0

fp rate x

conservative

Underperforming

Worse than random

Liberal classifiers

(high tp but high fp)

• Discrete classifier (output only a class label) produces a point

• A is more conservative than B

• C is random guessing (tp=fp) the whole line

TP/P

FP/N

ECE 750 T17

ROC Curve (Receiver Operating Characteristic)

15

Can create a curve from a scoring classifier using a

threshold from extreme conservative to liberal

1.0

0.8

0.6

0.4

0.2

0 0.2 0.4 0.6 0.8 1.0

x

70% Accuracy

60% Accuracy

Instance True Class Score

1 p 0.9

2 n 0.7

3 p 0.55

4 p 0.54

5 n 0.52

6 p 0.51

7 p 0.38

8 n 0.35

9 n 0.33

10 n 0.1

Example: 10 Instance and if Classifier Output is np otherwise consider then Wt

5

1

,

5

1

1 , 1 7.0

0,

5

1

0 , 1 9.0

fptpnp

fptpnp

W

W

ECE 750 T17

m

False Acceptance Rate vs False

Rejection Rate

16

Rate (FN) FRR

Rate

(FP)

FAR

High accuracy

might mean

reject more

Training, Validation and Error Estimation ROC and AUC Curves

Another Confusion Matrix

We can also create a confusion matrix using our estimated count of

positives and negatives:

True Value

1 0

Estimated

1 TP/N̂+=PPV=precision FP/N̂+=FDP

1 = FN/N̂− TNR= TN/N̂−

Precision measures what fraction of our detections are actually positive.

Recall measures what fraction of all the positives that we actually

detected.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 24 / 34

Training, Validation and Error Estimation ROC and AUC Curves

Precision-Recall Curve

If we plot precision P vs recall R as we vary τ then we get a

precision-recall curve which can often show us different performance

behaviour from the ROC curve.

The P-R only uses statistics based on TP, so the curve is useful when

there is a very small number of positive cases in your classifier or

when the number of negatives could scale based on some parameter.

Curve bends the other way.

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 25 / 34

Training, Validation and Error Estimation ROC and AUC Curves

ROC vs PR Curves

Line A is better than line B in both curves.

0 1

0

1

fpr

tp

r

A

B

0 1

0

1

recall

p

re

ci

si

o

n

AB

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 26 / 34

Training, Validation and Error Estimation Other Performance Measures

Summary

Naive Bayes Classification

Cross Validation

Evaluation Metrics

Confusion Matricies and ROC curves

Mark Crowley ECE 657A : Lecture 5 January 24, 2016 27 / 34

Class Admin

Announcements

Pitch Session

Supervised vs. Unsupervised Learning

Classification

Training, Validation and Error Estimation

Dividing the Data

Confusion Matrix

ROC and AUC Curves

Other Performance Measures

Hypothesis Testing