# 代写代考 Lecture 8: Feature Selection and Analysis – cscodehelp代写

Lecture 8: Feature Selection and Analysis

Introduction to Machine Learning Semester 1, 2022

Copyright @ University of Melbourne 2021. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.

Acknowledgement: , &

Copyright By cscodehelp代写 加微信 cscodehelp

Features in Machine Learning

Machine Learning Workflow

Evaluation Task

??? ? Model

Depends on features

Evaluation Task

Data Preparation vs Feature Selection

GIGO: Garbage In, Garbage Out

Data Preparation and Cleaning (discussed before)

• Data Cleaning

• Data Aggregation

• Dealing with missing values

• Transformation (e.g., log transform) • Binarization

• Scaling or Normalization

Feature Selection (this lecture)

• Wrapper methods (aka recursive elimination) • Filtering (aka univariate filtering)

• Glance into some other common approaches

Data Preparation vs Feature Selection

Our job as Machine Learning experts:

• Inspect / clean the data

• Choose a model suitable for classifying the data according to the attributes

• Choose attributes suitable for classifying the data according to the model

• Inspection • Intuition

Data Preparation vs Feature Selection

Our job as Machine Learning experts:

• Inspect / clean the data

• Choose a model suitable for classifying the data according to the attributes

• Choose attributes suitable for classifying the data according to the model

• Inspection

• Intuition

• Neither possible in practice

Feature Selection

What makes features good?

Lead to better models

• Better performance according to some evaluation metric

Side-goal 1

• Seeing important features can suggest other important features • Tell us interesting things about the problem

Side-goal 2

• Fewer features → smaller models → faster answer • More accurate answer >> faster answer

Iterative feature selection: Wrappers

Choosing a good feature set

“Wrapper” methods

• Choose subset of attributes that give best performance on the development data

• For example: for the Weather data set:

• Train model on {Outlook}

• Train model on {Temperature}

• Train model on {Outlook, Temperature}

• Train model on {Outlook, Temperature, Humidity}

• Train model on {Outlook, Temperature, Humidity, Windy}

Choosing a good feature set

“Wrapper” methods

• Choose subset of attributes that give best performance on the development data

• For example: for the Weather data set:

• Evaluate model on {Outlook}

• Evaluate model on {Temperature}

• Evaluate model on {Outlook, Temperature}

• Evaluate model on {Outlook, Temperature, Humidity}

• Evaluate model on {Outlook, Temperature, Humidity, Windy}

• Best performance on data set → best feature set

Choosing a good feature set

“Wrapper” methods

• Choose subset of attributes that give best performance on the development data

• Advantages:

• Feature set with optimal performance on development data

• Disadvantages:

• Takes a long time

Aside: how long does the full wrapper method take?

Assume we have a fast method (e.g. Naive Bayes) over a data set of non-trivial size (∼10K instances):

• Assume: train–evaluate cycle takes 10 sec to complete How many cycles? For m features:

• 2m subsets = 2m minutes 6

• m=10→3hours

• m = 60 → heat death of universe

Only practical for very small data sets.

More practical wrapper methods: Greedy search

Greedy approach

• Train and evaluate model on each single attribute

• Choose best attribute

• Until convergence:

• Train and evaluate model on best attribute(s), plus each remaining

single attribute

• Choose best attribute out of the remaining set

• Iterate until performance (e.g. accuracy) stops increasing

More practical wrapper methods: Greedy search

Greedy approach

• Bad news:

• Takes 21 m2 cycles, for m attributes • In theory, 386 attributes → days

• Good news:

• In practice, converges much more quickly than this

• Bad news again:

• Convergences to a sub-optimal (and often very bad) solution

More practical wrapper methods: Ablation

“Ablation” approach

• Start with all attributes

• Remove one attribute, train and evaluate model

• Until divergence:

• From remaining attributes, remove each attribute, train and evaluate model

• Remove attribute that causes least performance degradation

• Termination condition usually: performance (e.g. accuracy) starts to degrade by more than ε

More practical wrapper methods: Ablation

“Ablation” approach

• Good news:

• Mostly removes irrelevant attributes (at the start)

• Bad news:

• Assumes independence of attributes

(Actually, both approaches do this)

• Takes O(m2) time; cycles are slower with more attributes • Not feasible on non-trivial data sets.

Feature Filtering

Feature filtering

Intuition: Evaluate the “goodness” of each feature, separate from other features

• Consider each feature separately: linear time in number of attributes • Possible (but difficult) to control for inter-dependence of features

• Typically most popular strategy

Feature “goodness”

What makes a feature set single feature good?

Toy example

a1 a2 c YYY YNY NYN NNN

Which of a1, a2 is good?

Toy example

a1 a2 c YYY YNY NYN NNN

Toy example

a1 a2 c YYY YNY NYN NNN

Pointwise Mutual Information

Discrepancy between the observed joint probability of two random variables A and C and the expected joint probability if A and C were independent.

Recall independence: P(C|A) = P(C)

Pointwise Mutual Information

Discrepancy between the observed joint probability of two random variables A and C and the expected joint probability if A and C were independent.

Recall independence: P(C|A) = P(C)

PMI is defined as

We want to find attributes that are not independent of the class.

• If PMI >> 0, attribute and class occur together much more often than randomly.

• If RHS ∼ 0, attribute and class occur together as often as we would expect from random chance

• If RHS << 0, attribute and class are negatively correlated. (More on that later!)
Attributes with greatest PMI: best attributes
PMI(A, C) = log P(A, C) 2 P(A)P(C)
Toy example, revisited
a1 a2 c YYY YNY NYN NNN
Calculate PMI of a1, a2 with respect to c
Toy example, revisited
a1 a2 c YYY YNY NYN NNN
P(a1) = P(c) = P(a1,c) = PMI(a1,c) =
Toy example, revisited
a1 a2 c YYY YNY NYN NNN
P(a1) = P(c) = P(a1,c) = PMI(a1,c) =
Toy example, revisited
a1 a2 c YYY YNY NYN NNN
P ( a 2 ) = 42 P ( c ) = 42 P ( a 2 , c ) = 14
Toy example, revisited
a1 a2 c YYY YNY NYN NNN
P ( a 2 ) = 42
P ( c ) = 42
P ( a 2 , c ) = 14
PMI(a2,c) = =
log2(1) = 0
Feature “goodness”, revisited
What makes a single feature good?
• Well correlated with class
• Knowing a lets us predict c with more confidence
• Reverse correlated with class
• Knowing a ̄ lets us predict c with more confidence
• Well correlated (or reverse correlated) with not class
• Knowing a lets us predict c ̄ with more confidence • Usually not quite as good, but still useful
Mutual Information
• Expected value of PMI over all possible events
• For our example: Combine PMI of all possible combinations: a, a ̄, c, c ̄
Aside: Contingency tables
Contingency tables: compact representation of these frequency counts
a ̄ Total σ(c) σ(c ̄)
σ(a, c) σ(a ̄, c) σ(a, c ̄) σ(a ̄, c ̄)
P(a, c) = σ(a,c) , etc. N
Aside: Contingency tables
Contingency tables for toy example:
a1 a=Y a=N Total c=Y 2 c=N 2 Total 4
Total c=Y 1 1 2 c=N 1 1 2 Total 2 2 4
22 a=Y a=N
Mutual Information
Combine PMI of all possible combinations: a, a ̄, c, c ̄
MI(A, C) =P(a, c)PMI(a, c) + P(a ̄, c)PMI(a ̄, c)+ P(a, c ̄)PMI(a, c ̄) + P(a ̄, c ̄)PMI(a ̄, c ̄)
MI(A, C) =P(a, c) log P(a, c) 2 P(a)P(c)
P(a, c ̄) log P(a, c ̄) 2 P(a)P(c ̄)
+ P(a ̄, c) log + P(a ̄, c ̄) log
P(a ̄, c) + 2 P(a ̄)P(c)
P(a ̄, c ̄) 2 P(a ̄)P(c ̄)
Mutual Information
Combine PMI of all possible combinations: a, a ̄, c, c ̄
MI(A, C) =P(a, c)PMI(a, c) + P(a ̄, c)PMI(a ̄, c)+ P(a, c ̄)PMI(a, c ̄) + P(a ̄, c ̄)PMI(a ̄, c ̄)
MI(A, C) =P(a, c) log P(a, c) 2 P(a)P(c)
P(a, c ̄) log P(a, c ̄) 2 P(a)P(c ̄)
Often written more compactly as:
MI(A,C) = i∈{a,a ̄} j∈{c,c ̄}
We define that 0log0 ≡ 0.
+ P(a ̄, c) log + P(a ̄, c ̄) log
P(a ̄, c) + 2 P(a ̄)P(c)
P(a ̄, c ̄) 2 P(a ̄)P(c ̄)
P(i,j) 2 P(i)P(j)
Mutual Information Example
Contingency Table for attribute a1
a1 a=Y a=N Total
2 Total 2 2 4
P(a,c)= 24; P(a)= 42; P(c)= 42; P(a,c ̄)=0 P(a ̄,c ̄) = 42; P(a ̄) = 42; P(c ̄) = 42; P(a ̄,c) = 0
Mutual Information Example
Contingency Table for attribute a1
a1 a=Y a=N Total
2 Total 2 2 4
P(a,c)= 24; P(a)= 42; P(c)= 42; P(a,c ̄)=0 P(a ̄,c ̄) = 42; P(a ̄) = 42; P(c ̄) = 42; P(a ̄,c) = 0
MI(A1, C) = P(a1, c) log P(a1, c) 2 P(a1)P(c)
P(a1, c ̄) log P(a1, c ̄) 2 P(a1)P(c ̄)
+ P(a ̄1, c) log + P(a ̄1, c ̄) log
P(a ̄1, c) + 2 P(a ̄1)P(c)
P(a ̄1, c ̄) 2 P(a ̄1)P(c ̄)
Mutual Information Example
Contingency Table for attribute a1
a1 a=Y a=N Total
2 Total 2 2 4
P(a,c)= 24; P(a)= 42; P(c)= 42; P(a,c ̄)=0 P(a ̄,c ̄) = 42; P(a ̄) = 42; P(c ̄) = 42; P(a ̄,c) = 0
MI(a1, C) = P(a1, c) log P(a1, c) 2 P(a1)P(c)
P(a1, c ̄) log P(a1, c ̄) 2 P(a1)P(c ̄)
+ P(a ̄1, c) log + P(a ̄1, c ̄) log
P(a ̄1, c) + 2 P(a ̄1)P(c)
P(a ̄1, c ̄) 2 P(a ̄1)P(c ̄)
112 00112 = 2log2 11 +0log2 11 +0log2 11 +2log2 11
22222222 = 12(1)+0+0+12(1)=1
Mutual Information Example continued
Contingency Table for attribute a2
a2 a=Y a=N Total
2 Total 2 2 4
Mutual Information Example continued
Contingency Table for attribute a2
a2 a=Y a=N Total
2 Total 2 2 4
P(a) = 42; P(c) = 42; P(a ̄,c) = 41 P(a ̄) = 42; P(c ̄) = 42; P(a,c ̄) = 41
P(a,c) = 14; P(a ̄,c ̄) = 41;
Mutual Information Example continued
Contingency Table for attribute a2
a2 a=Y a=N Total
2 Total 2 2 4
P(a) = 42; P(c) = 42; P(a ̄,c) = 41 P(a ̄) = 42; P(c ̄) = 42; P(a,c ̄) = 41
P(a,c) = 14; P(a ̄,c ̄) = 41;
= P(a2, c) log P(a2, c) 2 P(a2)P(c)
P(a2, c ̄) log P(a2, c ̄) 2 P(a2)P(c ̄)
+ P(a ̄2, c) log + P(a ̄2, c ̄) log
P(a ̄2, c) + 2 P(a ̄2)P(c)
P(a ̄2, c ̄) 2 P(a ̄2)P(c ̄)
1 14 1 14 1 41 1 14 = 4log2 11 +4log2 11 +4log2 11 +4log2 11 22222222
= 41(0)+ 14(0)+ 14(0)+ 41(0) = 0
Mutual Information Example continued
Contingency Table for attribute a2
a2 a=Y a=N Total
2 Total 2 2 4
P(a) = 42; P(c) = 42; P(a ̄,c) = 41 P(a ̄) = 42; P(c ̄) = 42; P(a,c ̄) = 41
P(a,c) = 14; P(a ̄,c ̄) = 41;
χ2 (“Chi-square”)
Similar idea, different solution:
Contingency table (shorthand):
σ(c) σ(c ̄) N
σ(a, c) σ(a ̄, c) σ(a, c ̄) σ(a ̄, c ̄)
c W+X c ̄ Y+Z
Total W+Y X+Z N=W+X+Y+Z
If a, c were independent (uncorrelated), what value would you expect in W ?
Denote the expected value as E(W).
χ2 (“Chi-square”)
If a, c were independent, then P(a, c) = P(a)P(c) P(a, c) = P(a)P(c)
σ(a, c) = σ(a) σ(c) NNN
σ(a,c) = E(W) =
σ(a)σ(c) N
(W +Y)(W +X) W+X+Y+Z
χ2 (“Chi-square”)
Compare the value we actually observed O(W) with the expected value E(W):
• If the observed value is much greater than the expected value, a occurs more often with c than we would expect at random — predictive
• If the observed value is much smaller than the expected value, a occurs less often with c than we would expect at random — predictive
• If the observed value is close to the expected value, a occurs as often with c as we would expect randomly — not predictive
Similarly with X, Y, Z
χ2 (“Chi-square”)
Actual calculation (to fit to a chi-square distribution)
2 (O(W) − E(W))2 (O(X) − E(X))2
χ = E(W) + E(X) +
(O(Y) − E(Y))2 (O(Z) − E(Z))2 E(Y) + E(Z)
r c (Oi,j −Ei,j)2
i=1 j=1 Ei,j
• i sums over rows and j sums over columns.
• Because the values are squared, χ2 becomes much greater when |O−E |islarge,evenifE isalsolarge.
Chi-square Example
Contingency table for toy example (observed values):
a1 a=Y a=N Total c =Y 2
Total 2 2 4
Contingency table for toy example (expected values): a1 a=Y a=N Total
2 Total 2 2 4
Chi-square Example
(Oa,c −Ea,c)2 Ea,c
(Oa,c ̄ − Ea,c ̄)2 Ea,c ̄
(Oa ̄,c −Ea ̄,c)2 Ea ̄,c (Oa ̄,c ̄ − Ea ̄,c ̄)2 Ea ̄,c ̄
(2−1)2 (0−1)2 (0−1)2 (2−1)2 =1+1+1+1
= 1+1+1+1=4
χ2(A2, C) is obviously 0, because all observed values are equal to expected values.
Common Issues
Types of Attribute
So far, we’ve only looked at binary (Y/N) attributes:
• Nominal attributes
• Continuous attributes • Ordinal attributes
Types of Attributes: Nominal
Two common strategies
1. Treat as multiple binary attributes:
• e.g. sunny=Y, overcast=N, rainy=N, etc.
• Can just use the formulae as given
• Results sometimes difficult to interpret
• Forexample,Outlook=sunnyisuseful,butOutlook=overcast and Outlook=rainy are not useful... Should we use Outlook?
2. Modify contingency tables (and formulae)
Osor c=Y U V W c=N X Y Z
Types of Attributes: Nominal
Modified MI:
MI(O,C) = P(i,j)log P(i,j)
i∈{s,o,r} j∈{c,c ̄}
= P(s,c)log P(s,c)
2 P(i)P(j) + P(s,c ̄)log + P(o, c ̄) log
2 P(s)P(c) P(o, c) log P(o, c)
P(r, c) log
P(s,c ̄ + 2 P(s)P(c ̄)
P(o, c ̄) + 2 P(o)P(c ̄)
P(r, c ̄) 2 P(r)P(c ̄)
2 P(o)P(c) P(r, c)
+ P(r, c ̄) log
2 P(r)P(c)
• Biased towards attributes with many values.
Types of Attributes: Nominal
Chi-square can be used as normal, with 6 observed/expected values.
• To control for score inflation, we need to consider “number of degrees of freedom”, and then use the significance test explicitly (beyond the scope of this subject)
Types of Attributes: Continuous
Continuous attributes
• Usually dealt with by estimating probability based on a Gaussian (normal) distribution
• With a large number of values, most random variables are normally distributed due to the Central Limit Theorem
• For small data sets or pathological features, we may need to use binomial/multinomial distributions
All of this is beyond the scope of this subject
Types of Attributes: Ordinal
Three possibilities, roughly in order of popularity:
1. Treat as binary
• Particularly appropriate for frequency counts where events are
low-frequency (e.g. words in tweets) 2. Treat as continuous
• The fact that we haven’t seen any intermediate values is usually not important
• Does have all of the technical downsides of continuous attributes, however
3. Treat as nominal (i.e. throw away ordering)
Multi-class problems
So far, we’ve only looked at binary (Y/N) classification tasks.
Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult.
What makes a single feature good?
• Highly correlated with class
• Highly reverse correlated with class
• Highly correlated (or reverse correlated) with not class
... What if there are many classes?
What makes a feature bad?
• Irrelevant
• Correlated with other features
• Good at only predicting one class (but is this truly bad?)
Multi-class problems
So far, we’ve only looked at binary (Y/N) classification tasks.
Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult.
• PMI, MI, χ2 are all calculated per-class
• (Some other feature selection metrics, e.g. Information Gain, work for all
classes at once)
• Need to make a point of selecting (hopefully uncorrelated) features for each class to give our classifier the best chance of predicting everything correctly.
Multi-class problems
So far, we’ve only looked at binary (Y/N) classification tasks.
Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult.
Actual example (MI):
LA NY C At SF
chicago atlanta sf
hollywood atlanta atlanta yankees
httpbitlyczmk lol san cubs u u
la georgia lol chi chicago save
bears atl il ga
httpdealnaycom
Multi-class problems
So far, we’ve only looked at binary (Y/N) classification tasks.
Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult.
Intuitive features:
LA NY C At SF
la angeles los chicago hollywood atlanta lakers
nyc chicago atlanta sf
york bears atl ny il ga
httpdealnaycom
francisco chicago httpbitlyczmk lol san
atlanta cubs u u
yankees la sf chi
georgia lol chicago save
Multi-class problems
So far, we’ve only looked at binary (Y/N) classification tasks.
Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult.
Features for predicting not class:
LA NY C At SF
la angeles los chicago hollywood atlanta lakers
nyc chicago atlanta sf
york bears atl ny il ga
httpdealnaycom
chicago httpbitlyczmk lol san atlanta cubs u u
yankees la sf chi
georgia lol chicago save
Multi-class problems
So far, we’ve only looked at binary (Y/N) classification tasks.
Multiclass (e.g. LA, NY, C, At, SF) classification tasks are usually much more difficult.
Unintuitive features:
LA NY C At SF
chicago atlanta sf
hollywood atlanta atlanta yankees
cubs u u la georgia lol
chi chicago save
bears atl il ga
httpdealnaycom
francisco httpbitlyczmk lol san
What’s going on with MI?
Mutual Information is biased toward rare, uninformative features
• All probabilities: no notion of the raw frequency of events
• If a feature is seen rarely, but always with a given class, it will be seen as “good”
• Best features in the Twitter dataset only had MI of about 0.01 bits; 100th best for a given class had MI of about 0.002 bits
Glance into a few other common approaches to feature selection
A common (unsupervised) alternative
Term Frequency Inverse Document Frequency (TFIDF)
• Detect important words / Natural Language Processing
• Find words that are relevant to a document in a given document
collection
• To be relevant, a word should be
• Frequent enough in the corpus (TF). A word that occurs only 5 times in a corpus of 5,000,000 words is probably not too interesting
• Special enough (IDF). A very common word (“the”, “you”, ...) that occurs in (almost) every document is probably not too interesting
A common (unsupervised) alternative
Term Frequency Inverse Document Frequency (TFIDF)
• Detect important words / Natural Language Processing
• Find words that are relevant to a document in a given document
collection
• To be relevant, a word should be
• Frequent enough in the corpus (TF). A word that occurs only 5 times in a corpus of 5,000,000 words is probably not too interesting
• Special enough (IDF). A very common word (“the”, “you”, ...) that occurs in (almost) every document is probably not too interesting
tfidf(d,t,D)=tf +idf
tf =log(1+freq(t,d))
idf = log |D| count(d ∈ D : t ∈ d)
d=document, t=term, D=document collection; |D|=number of documents in D
Embedded Methods
Some ML models include feature selection inherently 1. Decision trees: Generalization of 1-R
2. Regression models with regularization
house price = β0 +β1 ×size+β2 ×location+β3 ×age
Regularization (or ‘penalty’) nudges the weight β of unimportant features towards zero
https://towardsdatascience.com/a- beginners- guide- to- decision- tree- classification- 6d3209353ea?gi=e0
程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com