CS代考计算机代写 decision tree scheme flex algorithm Theory and Applications of Boosting
Theory and Applications of Boosting
Rob Schapire
Princeton University
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance please (Collect)
• operator I need to make a call but I need to bill ittomyoffice (ThirdNumber)
• yes I’d like to place a call on my master card please (CallingCard)
• I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of mybill (BillingCredit)
Example: “How May I Help You?”
[Gorin et al.]
• goal: automatically categorize type of call requested by phone customer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance please (Collect)
• operator I need to make a call but I need to bill ittomyoffice (ThirdNumber)
• yes I’d like to place a call on my master card please (CallingCard)
• I just called a number in sioux city and I musta rang the wrong number because I got the wrong party and I would like to have that taken off of mybill (BillingCredit)
• observation:
• easy to find “rules of thumb” that are “often” correct
• e.g.: “IF ‘card’ occurs in utterance THEN predict ‘CallingCard’ ”
• hard to find single highly accurate prediction rule
The Boosting Approach
• devise computer program for deriving rough rules of thumb • apply procedure to subset of examples
• obtain rule of thumb
• apply to 2nd subset of examples
• obtain 2nd rule of thumb • repeat T times
Details
• how to choose examples on each round? • concentrate on “hardest” examples
(those most often misclassified by previous rules of thumb)
• how to combine rules of thumb into single prediction rule? • take (weighted) majority vote of rules of thumb
Boosting
• boosting = general method of converting rough rules of
thumb into highly accurate prediction rule
• technically:
• assume given “weak” learning algorithm that can
consistently find classifiers (“rules of thumb”) at least slightly better than random, say, accuracy ≥ 55%
(in two-class setting)
• given sufficient data, a boosting algorithm can provably construct single classifier with very high accuracy, say, 99%
Outline of Tutorial
• brief background
• basic algorithm and core theory
• other ways of understanding boosting
• experiments, applications and extensions
Brief Background
Strong and Weak Learnability
• boosting’s roots are in “PAC” (Valiant) learning model
• get random examples from unknown, arbitrary distribution • strong PAC learning algorithm:
• for any distribution
with high probability
given polynomially many examples (and polynomial time) can find classifier with arbitrarily small generalization error
• weak PAC learning algorithm
• same, but generalization error only needs to be slightly
better than random guessing (1 − γ) 2
• [Kearns & Valiant ’88]:
• does weak learnability imply strong learnability?
Early Boosting Algorithms
• [Schapire ’89]:
• first provable boosting algorithm
• [Freund ’90]:
• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:
• first experiments using boosting • limited by practical drawbacks
AdaBoost
• [Freund & Schapire ’95]:
• introduced “AdaBoost” algorithm
• strong practical advantages over previous boosting
algorithms
• experiments and applications using AdaBoost:
[Drucker & Cortes ’96] [Jackson & Craven ’96] [Freund & Schapire ’96] [Quinlan ’96]
[Breiman ’96]
[Maclin & Opitz ’97]
[Bauer & Kohavi ’97]
[Schwenk & Bengio ’98] [Schapire, Singer & Singhal ’98]
[Abney, Schapire & Singer ’99] [Haruno, Shirai & Ooyama ’99] [Cohen & Singer’ 99] [Dietterich ’00]
[Schapire & Singer ’00]
[Collins ’00]
[Escudero, Ma`rquez & Rigau ’00] [Iyer, Lewis, Schapire et al. ’00] [Onoda, Ra ̈tsch & Mu ̈ller ’00]
[Tieu & Viola ’00]
[Walker, Rambow & Rogati ’01] [Rochery, Schapire, Rahim & Gupta ’01] [Merler, Furlanello, Larcher & Sboner ’01] [Di Fabbrizio, Dutton, Gupta et al. ’02] [Qu, Adam, Yasui et al. ’02]
[Tur, Schapire & Hakkani-Tu ̈r ’03]
[Viola & Jones ’04]
[Middendorf, Kundaje, Wiggins et al. ’04]
. .
• continuing development of theory and algorithms:
[Breiman ’98, ’99]
[Schapire, Freund, Bartlett & Lee ’98] [Grove & Schuurmans ’98]
[Mason, Bartlett & Baxter ’98] [Schapire & Singer ’99]
[Cohen & Singer ’99]
[Freund & Mason ’99]
[Domingo & Watanabe ’99]
[Mason, Baxter, Bartlett & Frean ’99]
[Duffy & Helmbold ’99, ’02]
[Freund & Mason ’99]
[Ridgeway, Madigan & Richardson ’99] [Kivinen & Warmuth ’99]
[Friedman, Hastie & Tibshirani ’00] [Ra ̈tsch, Onoda & Mu ̈ller ’00] [Ra ̈tsch, Warmuth, Mika et al. ’00] [Allwein, Schapire & Singer ’00] [Friedman ’01]
[Koltchinskii, Panchenko & Lozano ’01] [Collins, Schapire & Singer ’02] [Demiriz, Bennett & Shawe-Taylor ’02] [Lebanon & Lafferty ’02]
[Wyner ’02]
[Rudin, Daubechies & Schapire ’03] [Jiang ’04]
[Lugosi & Vayatis ’04]
[Zhang ’04]
.
Basic Algorithm and Core Theory
• introduction to AdaBoost
• analysis of training error
• analysis of test error based on margins theory
A Formal Description of Boosting
• given training set (x1, y1), . . . , (xm, ym)
• yi ∈ {−1, +1} correct label of instance xi ∈ X
A Formal Description of Boosting
• given training set (x1, y1), . . . , (xm, ym)
• yi ∈ {−1, +1} correct label of instance xi ∈ X
• fort=1,…,T:
• constructdistributionDt on{1,…,m}
• find weak classifier (“rule of thumb”) ht : X → {−1, +1}
with small error εt on Dt:
εt =Pri∼Dt[ht(xi)̸=yi]
A Formal Description of Boosting
• given training set (x1, y1), . . . , (xm, ym)
• yi ∈ {−1, +1} correct label of instance xi ∈ X
• fort=1,…,T:
• constructdistributionDt on{1,…,m}
• find weak classifier (“rule of thumb”) ht : X → {−1, +1}
with small error εt on Dt:
εt =Pri∼Dt[ht(xi)̸=yi] • output final classifier Hfinal
AdaBoost
• constructing Dt: • D1(i)=1/m
[with Freund]
AdaBoost
• constructing Dt:
• D1(i)=1/m
• given Dt and ht:
[with Freund]
D (i) = Dt(i)×e−αt ifyi=ht(xi) t+1 Zt eαt ifyi ̸=ht(xi)
= Dt(i) exp(−αt yi ht(xi)) Zt
where Zt = normalization constant
αt =1ln1−εt>0 2 εt
AdaBoost
• constructing Dt:
• D1(i)=1/m
• given Dt and ht:
[with Freund]
D (i) = Dt(i)×e−αt ifyi=ht(xi) t+1 Zt eαt ifyi ̸=ht(xi)
= Dt(i) exp(−αt yi ht(xi)) Zt
where Zt = normalization constant αt =1ln1−εt>0
• final classifier:
• Hfinal(x) = signαtht(x) t
2 εt
Toy Example
D1
weak classifiers = vertical or horizontal half-planes
Round 1
h1
D2
ε1 =0.30
α =0.42 1
Round 2
h2 D3
ε2 =0.21
α =0.65 2
Round 3
h3
ε3 =0.14 α3=0.92
Final Classifier
H = sign 0.42 + 0.65 + 0.92 final
=
‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$ ‘& ‘&%$ %$ %$ %$ %$ %$ %$ %$ %$
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
#” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #” )( #” #” )( #” #” )( #” #” )( #” #” )( #” #” )( #” #” )( #” #” )( #” #” )( #” #” )(
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )(
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )(
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” #”
#” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )( #” )(
#” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” #” “ #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )( #” )( #” )( ” )(
)( )( )( )( )( )( )( )( )( )( )( )( )( )( )( )( )( )( )( )(
Analyzing the training error
• Theorem:
• write εt as 1/2−γt
Analyzing the training error
• Theorem:
• write εt as 1/2−γt • then
training error(Hfinal )
≤ 2εt (1 − εt ) t
= 1−4γt2 t
≤ exp−2γt2 t
Analyzing the training error
• Theorem:
• write εt as 1/2−γt • then
training error(Hfinal )
≤ 2εt (1 − εt ) t
= 1−4γt2 t
≤ exp−2γt2 t
• so:if∀t: γt ≥γ>0
then training error(Hfinal) ≤ e−2γ2T
• AdaBoost is adaptive:
• doesnotneedtoknowγorTapriori • can exploit γt ≫ γ
Proof
• letf(x)=αtht(x)⇒Hfinal(x)=sign(f(x)) t
• Step 1: unwrapping recurrence:
1 exp−yi αtht(xi)
t Dfinal(i) = m Zt
t
1 exp(−yif(xi)) =m Zt
t
Proof (cont.)
• Step 2: training error(Hfinal) ≤ Zt t
Proof (cont.)
• Step 2: training error(Hfinal) ≤ Zt
t
11 ifyi̸=Hfinal(xi) m i 0 else
• Proof:
training error(Hfinal) =
Proof (cont.)
• Step 2: training error(Hfinal) ≤ Zt
t
11 ifyi̸=Hfinal(xi) m i 0 else
11 ifyif(xi)≤0
=
• Proof:
training error(Hfinal) =
m 0else i
Proof (cont.)
• Step 2: training error(Hfinal) ≤ Zt
t
11 ifyi̸=Hfinal(xi) m i 0 else
11 ifyif(xi)≤0 m 0else
i
1 exp(−yif (xi)) mi
• Proof:
training error(Hfinal) =
=
≤
Proof (cont.)
• Step 2: training error(Hfinal) ≤ Zt
t
11 ifyi̸=Hfinal(xi) m i 0 else
11 ifyif(xi)≤0 m 0else
• Proof:
training error(Hfinal) =
=
≤
=
i
1 exp(−yif (xi))
Dfinal(i) Zt it
mi
Proof (cont.)
• Step 2: training error(Hfinal) ≤ Zt
t
11 ifyi̸=Hfinal(xi) m i 0 else
11 ifyif(xi)≤0 m 0else
• Proof:
training error(Hfinal) =
=
≤
= =
i
1 exp(−yif(xi))
mi
D fi n a l ( i ) Z t
it Zt
t
Proof (cont.)
• Step3: Zt =2εt(1−εt)
Proof (cont.)
• Step3: Zt =2εt(1−εt)
• Proof:
Zt = Dt(i)exp(−αt yi ht(xi))
i
= Dt(i)eαt + Dt(i)e−αt
i:yi̸=ht(xi) i:yi=ht(xi) = εteαt+(1−εt)e−αt
= 2εt(1−εt)
How Will Test Error Behave? (A First Guess)
1 0.8 0.6 0.4 0.2
test
train
20 40 60 80 100
# of rounds (T)
expect:
• training error to continue to drop (or reach zero)
• test error to increase when Hfinal becomes “too complex” • “Occam’s razor”
• overfitting
• hard to know when to stop training
error
Actual Typical Run 20 15 10 5 0
C4.5 test error
test
train
10 100 1000
# of rounds (T)
(boosting C4.5 on “letter” dataset)
• test error does not increase, even after 1000 rounds • (total size > 2,000,000 nodes)
• test error continues to drop even after training error is zero!
# rounds
5 100 1000 train error 0.0 test error 8.4 3.3 3.1
0.0
0.0
• Occam’s razor wrongly predicts “simpler” rule is better
error
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
A Better Story: The Margins Explanation
[with Freund, Bartlett & Lee]
• key idea:
• training error only measures whether classifications are
right or wrong
• should also consider confidence of classifications
• recall: Hfinal is weighted majority vote of weak classifiers
• measure confidence by margin = strength of the vote
= (fraction voting correctly) − (fraction voting incorrectly)
high conf. incorrect
Hfinal
−1 incorrect
low conf.
0
Hfinal correct
high conf. correct
+1
Empirical Evidence: The Margin Distribution • margin distribution
= cumulative distribution of margins of training examples
20 15 10
5 0
test
train
10 100 1000
# of rounds (T)
1.0
0.5
margin
1000 100
5
train error
test error
% margins ≤ 0.5 minimum margin 0.14
5
# rounds
100 1000
0.0 3.1 0.0
0.52 0.55
-1 -0.5
0.5 1
0.0
0.0
8.4
3.3
7.7
0.0
error
cumulative distribution
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election)
• Theorem: boosting tends to increase margins of training examples (given weak learning assumption)
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election)
• Theorem: boosting tends to increase margins of training examples (given weak learning assumption)
• proof idea: similar to training error proof
Theoretical Evidence: Analyzing Boosting Using Margins
• Theorem: large margins ⇒ better bound on generalization error (independent of number of rounds)
• proof idea: if all margins are large, then can approximate final classifier by a much smaller classifier (just as polls can predict not-too-close election)
• Theorem: boosting tends to increase margins of training examples (given weak learning assumption)
• proof idea: similar to training error proof
• so:
although final classifier is getting larger,
margins are likely to be increasing,
so final classifier actually getting close to a simpler classifier, driving down the test error
More Technically…
• with high probability, ∀θ > 0 :
generalization error ≤ Pˆr[margin ≤ θ] + O ̃ d/m θ
(Pˆr[ ] = empirical probability)
• bound depends on
• m = # training examples
• d = “complexity” of weak classifiers
• entire distribution of margins of training examples
• Pˆr[margin ≤ θ] → 0 exponentially fast (in T) if (errorofht onDt)<1/2−θ(∀t)
• so: if weak learning assumption holds, then all examples will quickly have “large” margins
Other Ways of Understanding AdaBoost
• game theory
• loss minimization
• estimating conditional probabilities
Game Theory
• game defined by matrix M:
Rock Paper Scissors
Rock 1/2 1 0 Paper 0 1/2 1 Scissors 1 0 1/2
• row player chooses row i
• column player chooses column j
(simultaneously)
• row player’s goal: minimize loss M(i,j)
Game Theory
• game defined by matrix M:
Rock Paper Scissors
Rock 1/2 1 0 Paper 0 1/2 1 Scissors 1 0 1/2
• row player chooses row i
• column player chooses column j
(simultaneously)
• row player’s goal: minimize loss M(i,j)
• usually allow randomized play:
• players choose distributions P and Q over rows and columns
• learner’s (expected) loss
= P(i)M(i, j)Q(j)
i,j
= PTMQ ≡ M(P, Q)
The Minmax Theorem
• von Neumann’s minmax theorem:
min max M(P, Q) = max min M(P, Q) PQ QP
=v
= “value” of game M
• in words:
• v = min max means:
• row player has strategy P∗ such that ∀ column strategy Q loss M(P∗, Q) ≤ v
• v = max min means:
• this is optimal in sense that
column player has strategy Q∗ such that ∀ row strategy P loss M(P, Q∗) ≥ v
The Boosting Game
• let {g1,...,gN} = space of all weak classifiers • row player ↔ booster
• column player ↔ weak learner
• matrix M:
• row ↔ example (xi,yi)
• column ↔ weak classifier gj •M(i,j)=1 ifyi=gj(xi)
0 else
weak learner
ggg 1jN
xy
11
xy ii
xy mm
M(i,j)
booster
Boosting and the Minmax Theorem • if:
• then:
• ∀ distributions over examples ∃h with accuracy ≥ 1 + γ
2
• minmaxM(P,j)≥1+γ
Pj2 • by minmax theorem:
• maxminM(i,Q)≥1+γ>1 Qi22
• which means:
• ∃ weighted majority of classifiers which correctly classifies
all examples with positive margin (2γ) • optimal margin ↔ “value” of game
AdaBoost and Game Theory
• AdaBoost is special case of general algorithm for solving games through repeated play
• can show
• distribution over examples converges to (approximate) minmax strategy for boosting game
• weights on weak classifiers converge to (approximate) maxmin strategy
• different instantiation of game-playing algorithm gives on-line learning algorithms (such as weighted majority algorithm)
[with Freund]
AdaBoost and Exponential Loss
• many (most?) learning algorithms minimize a “loss” function • e.g. least squares regression
• training error proof shows AdaBoost actually minimizes Zt = 1 exp(−yif(xi))
tmi where f (x ) = αt ht (x )
t
• on each round, AdaBoost greedily chooses αt and ht to
minimize loss
AdaBoost and Exponential Loss
• many (most?) learning algorithms minimize a “loss” function • e.g. least squares regression
• training error proof shows AdaBoost actually minimizes Zt = 1 exp(−yif(xi))
tmi where f (x ) = αt ht (x )
t
• on each round, AdaBoost greedily chooses αt and ht to
minimize loss
• exponential loss is an upper bound on 0-1 (classification) loss
• AdaBoost provably minimizes exponential loss
yf(x)
Coordinate Descent
• {g1,…,gN} = space of all weak classifiers
• want to find λ1,…,λN to minimize
[Breiman]
L(λ1,…,λN) = exp−yi λjgj(xi) ij
Coordinate Descent
• {g1,…,gN} = space of all weak classifiers
[Breiman]
• want to find λ1,…,λN to minimize
L(λ1,…,λN) = exp−yi λjgj(xi)
ij
• AdaBoost is actually doing coordinate descent on this optimization problem:
• initially, all λj = 0
• each round: choose one coordinate λj (corresponding to
ht) and update (increment by αt)
• choose update causing biggest decrease in loss
• powerful technique for minimizing over huge space of functions
Functional Gradient Descent • want to minimize
[Friedman][Mason et al.]
L(f) = L(f(x1),…,f(xm)) = exp(−yif(xi)) i
Functional Gradient Descent • want to minimize
[Friedman][Mason et al.]
L(f) = L(f(x1),…,f(xm)) = exp(−yif(xi)) i
• say have current estimate f and want to improve • to do gradient descent, would like update
f ← f − α∇f L(f )
Functional Gradient Descent • want to minimize
[Friedman][Mason et al.]
L(f) = L(f(x1),…,f(xm)) = exp(−yif(xi)) i
• say have current estimate f and want to improve • to do gradient descent, would like update
f ← f − α∇f L(f )
• but update restricted in class of weak classifiers f ← f + αht
Functional Gradient Descent • want to minimize
[Friedman][Mason et al.]
L(f) = L(f(x1),…,f(xm)) = exp(−yif(xi)) i
• say have current estimate f and want to improve • to do gradient descent, would like update
f ← f − α∇f L(f )
• but update restricted in class of weak classifiers
f ← f + αht
• so choose ht “closest” to −∇f L(f ) • equivalent to AdaBoost
Benefits of Model Fitting View
• immediate generalization to other loss functions
• e.g. squared error for regression
• e.g. logistic regression (by only changing one line of
AdaBoost)
• sensible approach for converting output of boosting into conditional probability estimates
Benefits of Model Fitting View
• immediate generalization to other loss functions
• e.g. squared error for regression
• e.g. logistic regression (by only changing one line of
AdaBoost)
• sensible approach for converting output of boosting into
conditional probability estimates
• caveat: wrong to view AdaBoost as just an algorithm for minimizing exponential loss
• other algorithms for minimizing same loss will (provably) give very poor performance
• thus, this loss function cannot explain why AdaBoost “works”
Estimating Conditional Probabilities
[Friedman, Hastie & Tibshirani]
• often want to estimate probability that y = +1 given x • AdaBoost minimizes (empirical version of):
Ex,y e−yf(x)=Ex P[y =+1|x]e−f(x) +P[y =−1|x]ef(x) where x,y random from true distribution
Estimating Conditional Probabilities
[Friedman, Hastie & Tibshirani]
• often want to estimate probability that y = +1 given x • AdaBoost minimizes (empirical version of):
Ex,y e−yf(x)=Ex P[y =+1|x]e−f(x) +P[y =−1|x]ef(x) where x,y random from true distribution
• over all f , minimized when
f (x ) = 1 · ln P [y = +1|x ]
2 P[y = −1|x] P[y = +1|x] = 1
• so, to convert f output by AdaBoost to probability estimate, use same formula
or
1+e−2f(x)
Calibration Curve 1
0.8 x ’test’ ’train’
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
• order examples by f value output by AdaBoost • break into bins of size r
• for each bin, plot a point:
• x-value: average estimated probability of examples in bin • y-value: actual fraction of positive examples in bin
Other Ways to Think about AdaBoost
• dynamical systems
• statistical consistency • maximum entropy
Experiments, Applications and Extensions
• basic experiments
• multiclass classification
• confidence-rated predictions
• text categorization / spoken-dialogue systems
• incorporating prior knowledge
• active learning
• face detection
Practical Advantages of AdaBoost
• fast
• simple and easy to program
• no parameters to tune (except T)
• flexible — can combine with any learning algorithm
• no prior knowledge needed about weak learner
• provably effective, provided can consistently find rough rules of thumb
→ shift in mind set — goal now is merely to find classifiers barely better than random guessing
• versatile
• can use with data that is textual, numeric, discrete, etc. • has been extended to learning problems well beyond
binary classification
Caveats
• performance of AdaBoost depends on data and weak learner
• consistent with theory, AdaBoost can fail if • weak classifiers too complex
→ overfitting
• weak classifiers too weak (γt → 0 too quickly)
→ underfitting
→ low margins → overfitting
• empirically, AdaBoost seems especially susceptible to uniform noise
UCI Experiments
• tested AdaBoost on UCI benchmarks
[with Freund]
• used:
• C4.5 (Quinlan’s decision tree algorithm)
• “decision stumps”: very simple rules of thumb that test
on single attributes
eye color = brown ?
yes no
height > 5 feet ?
yes no
predict predict +1 -1 -1 +1
predict
predict
UCI Results
30 30 25 25 20 20 15 15 10 10
55 00
0 5 10 15 20 25 30 0
boosting Stumps
5 10 15 20 25 30
boosting C4.5
C4.5
C4.5
Multiclass Problems
• say y ∈ Y = {1,…,k}
• direct approach (AdaBoost.M1):
ht : X → Y
[with Freund]
D
t+1
(i)=Dt(i)·e−αt Zt eαt
Hfinal(x) = arg max y∈Y
ifyi =ht(xi) ifyi ̸=ht(xi)
αt t:ht(x)=y
Multiclass Problems
• say y ∈ Y = {1,…,k}
• direct approach (AdaBoost.M1):
ht : X → Y
[with Freund]
D
t+1
(i)=Dt(i)·e−αt Zt eαt
Hfinal(x) = arg max y∈Y
ifyi =ht(xi) ifyi ̸=ht(xi)
αt t:ht(x)=y
• can prove same bound on error if ∀t : εt ≤ 1/2
• in practice, not usually a problem for “strong” weak learners (e.g., C4.5)
• significant problem for “weak” weak learners (e.g., decision stumps)
• instead, reduce to binary
Reducing Multiclass to Binary
• saypossiblelabelsare{a,b,c,d,e}
• each training example replaced by five {−1, +1}-labeled examples:
( x , a ) , − 1
(x,b) , −1 x , c →(x,c) , +1
(x,d) , −1
( x , e ) , − 1
• predict with label receiving most (weighted) votes
[with Singer]
AdaBoost.MH
• can prove:
training error(Hfinal ) ≤ k · Zt
2
• reflects fact that small number of errors in binary predictors can cause overall prediction to be incorrect
• extends immediately to multi-label case (more than one correct label per example)
Using Output Codes
• alternative: choose “code word” for each label
π1 π2 π3 π4 a−+−+ b−++− c+−−+ d+−++ e−+−−
• each training example mapped to one example per column
[with Allwein & Singer][Dietterich & Bakiri]
x,c→
(x,π2) , −1 (x,π3) , −1 (x,π4) , +1
• to classify new example x:
• evaluate classifier on (x,π1),…,(x,π4)
• choose label “most consistent” with results
(x,π) , +1 1
Output Codes (cont.)
• training error bounds independent of # of classes
• overall prediction robust to large number of errors in binary
predictors
• but: binary problems may be harder
Ranking Problems
• other problems can also be handled by reducing to binary • e.g.: want to learn to rank objects (say, movies) from
examples
• can reduce to multiple binary questions of form:
“is or is not object A preferred to object B?”
• now apply (binary) AdaBoost
[with Freund, Iyer & Singer]
“Hard” Predictions Can Slow Learning
L
• ideally, want weak classifier that says:
h(x) = +1 if x above L “don’t know” else
“Hard” Predictions Can Slow Learning
L
• ideally, want weak classifier that says:
h(x) = +1 if x above L
“don’t know” else
• problem: cannot express using “hard” predictions
• if must predict ±1 below L, will introduce many “bad”
predictions
• need to “clean up” on later rounds
• dramatically increases time to convergence
Confidence-rated Predictions
• useful to allow weak classifiers to assign confidences to
predictions
• formally, allow ht : X → R
sign(ht (x )) = prediction |ht (x )| = “confidence”
[with Singer]
Confidence-rated Predictions
• useful to allow weak classifiers to assign confidences to
predictions
• formally, allow ht : X → R
sign(ht (x )) = prediction |ht (x )| = “confidence”
• use identical update:
Dt+1(i)=Dt(i)·exp(−αt yi ht(xi))
Zt
and identical rule for combining weak classifiers • question: how to choose αt and ht on each round
[with Singer]
Confidence-rated Predictions (cont.) • saw earlier:
training error(Hfinal) ≤ Zt = 1 exp−yi αtht(xi) tmit
• therefore, on each round t, should choose αtht to minimize: Zt = Dt(i)exp(−αt yi ht(xi))
i
• in many cases (e.g., decision stumps), best confidence-rated weak classifier has simple form that can be found efficiently
Confidence-rated Predictions Help a Lot 70
test no conf
train no conf
test conf train conf
60 50 40 30 20 10
1
10
100 1000 Number of rounds
10000
round first reached
% error 40 35
conf. 268
no conf. speedup
598
30 1,888 >80,000
16,938
63.2 109.2 –
65,292
% Error
Application: Boosting for Text Categorization
[with Singer]
• weak classifiers: very simple weak classifiers that test on simple patterns, namely, (sparse) n-grams
• find parameter αt and rule ht of given form which minimize Zt
• use efficiently implemented exhaustive search
• “How may I help you” data:
• 7844 training examples
• 1000 test examples
• categories: AreaCode, AttService, BillingCredit, CallingCard,
Collect, Competitor, DialForMe, Directory, HowToDial, PersonToPerson, Rate, ThirdNumber, Time, TimeCharge, Other.
Weak Classifiers rnd term
1 collect
2 card
3 my home
4 person ? person
5 code
6I
ACASBCCCCOCMDMDIHOPPRA3NTITCOT
More Weak Classifiers
rnd term
7 time
8 wrong number
9 how
10 call
11 seven
12 trying to
13 and
ACASBCCCCOCMDMDIHOPPRA3NTITCOT
More Weak Classifiers
rnd term
14 third 15 to
16 for
17 charges 18 dial
19 just
ACASBCCCCOCMDMDIHOPPRA3NTITCOT
Finding Outliers
examples with most weight are often outliers (mislabeled and/or ambiguous)
•I’mtryingtomakeacreditcardcall (Collect)
• hello (Rate)
• yes I’d like to make a long distance collect call
please (CallingCard)
•callingcardplease (Collect) •yeahI’dliketousemycallingcardnumber (Collect) •canIgetacollectcall (CallingCard)
• yes I would like to make a long distant telephone call
and have the charges billed to another number
(CallingCard DialForMe)
• yeah I can not stand it this morning I did oversea
call is so bad (BillingCredit)
• yeah special offers going on for long distance
(AttService Rate)
• mister allen please william allen (PersonToPerson)
• yes ma’am I I’m trying to make a long distance call to
a non dialable point in san miguel philippines
(AttService Other)
•
Application: Human-computer Spoken Dialogue
[with Rahim, Di Fabbrizio, Dutton, Gupta, Hollister & Riccardi]
• application: automatic “store front” or “help desk” for AT&T Labs’ Natural Voices business
• caller can request demo, pricing information, technical support, sales agent, etc.
• interactive dialogue
How It Works
speech
text−to−speech
Human
raw utterance
automatic speech recognizer
text
computer
text response
dialogue manager
predicted category
natural language understanding
• NLU’s job: classify caller utterances into 24 categories (demo, sales rep, pricing info, yes, no, etc.)
• weak classifiers: test for presence of word or phrase
Need for Prior, Human Knowledge
[with Rochery, Rahim & Gupta]
• building NLU: standard text categorization problem
• need lots of data, but for cheap, rapid deployment, can’t wait
for it
• bootstrapping problem:
• need labeled data to deploy
• need to deploy to get labeled data
• idea: use human knowledge to compensate for insufficient data
• modify loss function to balance fit to data against fit to prior model
Results: AP-Titles 80
70
60
50
40
30
20
10
data+knowledge
knowledge only
data only
100 1000 # training examples
10000
% error rate
Results: Helpdesk
90
85
80
75
70
65
60
55
50
45
0 500
1000 1500
2000
2500
data +
knowledge
data
knowledge
# Training Examples
Classification Accuracy
Problem: Labels are Expensive
• for spoken-dialogue task
• getting examples is cheap • getting labels is expensive
• must be annotated by humans • how to reduce number of labels needed?
Active Learning
• idea:
• use selective sampling to choose which examples to label • focus on least confident examples [Lewis & Gale]
• for boosting, use (absolute) margin |f (x)| as natural confidence measure
[Abe & Mamitsuka]
Labeling Scheme
• start with pool of unlabeled examples
• choose (say) 500 examples at random for labeling • run boosting on all labeled examples
• get combined classifier f
• pick (say) 250 additional examples from pool for labeling
• choose examples with minimum |f (x)| • repeat
Results: How-May-I-Help-You? 34
32
30
28
26
24
0 5000 10000 15000 20000 25000 30000 35000 40000
random
active
# labeled examples
first reached
% label savings 50
57 68
% error random
active 5,500 9,500 13,000
28 26 25
11,000 22,000 40,000
% error rate
Results: Letter 25
20
15
10
5
0
random
active
0 2000
4000 6000 8000
# labeled examples
first reached
14000 16000
% label savings 57
69 73
10000 12000
% error random
active 1,500 2,750 3,500
10 5 4
3,500
9,000 13,000
% error rate
Application: Detecting Faces
• problem: find faces in photograph or movie
[Viola & Jones]
• weak classifiers: detect light/dark rectangles in image
• many clever tricks to make extremely fast and accurate
Conclusions
• boosting is a practical tool for classification and other learning problems
• grounded in rich theory
• performs well experimentally
• often (but not always!) resistant to overfitting • many applications and extensions
• many ways to think about boosting
• none is entirely satisfactory by itself,
but each useful in its own way
• considerable room for further theoretical and
experimental work
References
• Ron Meir and Gunnar R ̈atsch.
An Introduction to Boosting and Leveraging.
In Advanced Lectures on Machine Learning (LNAI2600), 2003. http://www.boosting.org/papers/MeiRae03.pdf
• Robert E. Schapire.
The boosting approach to machine learning: An overview.
In MSRI Workshop on Nonlinear Estimation and Classification, 2002. http://www.cs.princeton.edu/∼schapire/boost.html