Supervised Learning

Linear Regression

Define the set- up of Supervised Learning

Copyright By cscodehelp代写 加微信 cscodehelp

Discuss basic regression models

Supervised Learning

|The set-up: the given training data consist of

|Consider two types of problems: -Regression: y continuous

-Classification: y is discrete, e.g., class labels.

The Task of Regression

|Given: A training set of n samples

|To learn a model for predicting y for any new sample x.

|A simple model is linear regression: modeling the relation between y and x via a linear function.

y ≈ 𝑤 + 𝑤 𝑥 + … + 𝑤 𝑥 = 𝐰tx 011𝑑𝑑

(Note: x is augmented by adding a dimension of constant 1 to the original sample.)

Linear Regression

|We can introduce an error term to capture the residual 𝐲 = 𝐰tx + e

|Applying this to all n samples, we have: 𝐲 = 𝐗𝐰+𝐞

|Learning in this case is to figure out a good w.

Linear Regression (cont’d)

|Find an optimal w by minimizing the squared error

||e||2 = ||y − 𝑋 w||2 |The solution can be found to be:

wෝ = 𝑋 𝑡 𝑋 − 1 𝑋 𝑡 y

|In practice, some iterative approaches may be used (e.g., gradient descent search).

A simple example

Generalizing Linear Regression

|Introducing some basis functions 𝝓𝒋(𝐱):

𝑦=𝑤+𝑤𝜙 𝐱+…+𝑤 𝜙

0 1 1 𝑀−1 𝑀−1

➢ Blue: Linear Regression

➢Red: With 𝜙𝑗(𝑥) = 𝑥𝑗

Regularized Least Squares

|E.g., use a new error function: 𝑬𝑫 𝒘 -𝜆 is the regularization coefficient

-𝐸 𝐰 is the data-dependent error 𝐷

-𝐸 𝐰 is the 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑡𝑒𝑟𝑚, e.g., 𝑬 𝐖𝐖

|Help to alleviate overfitting.

Supervised Learning

Density Estimation in Supervised Learning

Illustrate classification in Supervised Learning

Discuss basic density estimation techniques

Supervised Learning

|The set-up: the given training data consist of

|Consider two types of problems: -Regression: y continuous

-Classification: y is discrete, e.g., class labels.

Examples of Image Classification

|The MNIST training |The Extended Yale B images of hand-written Face Images

How do we model the training images?

|Parametric: each class of images (the feature vectors) may be modeled by a density function pθ(x) with parameter θ.

-To emphasize the density is for images from class/label y, we may write pθ(x|y).

-We may also use the notation p(x|θ), if the discussion is true for any y.

➔ How to estimate θ from the training images?

|Note: We may also consider non-parametric approaches.

MLE for Density Estimation (1/3)

x xxxxxxx xx

|Given some training data; Assuming a parametric model p(x|θ); What specific θ will fit/explain the data best?

-E.g., Consider a simple 1-D normal density with only a parameter μ (assuming the variance is known)

|Given a sample xi, p(xi | μ) gives an indication of how likely xi is from p(xi |μ)

➔the concept of the likelihood function.

MLE for Density Estimation (2/3)

|The likelihood function: the density function p(x|θ) evaluated at the given data sample xi, and viewed as a function of the parameter θ.

-Assessing how likely the parameter θ (defining the corresponding p(x|θ)) gives arise to the sample xi.

-We often use L(θ) to denote the likelihood function, and l(θ) = log(L(θ)) is called the log-likelihood.

|Maximum Likelihood Estimation (MLE): Finding the parameter that maximizes the likelihood

𝛉 = argmax𝛉p(x|θ)

MLE for Density Estimation (3/3)

|How to consider all the given samples D={xi, i=1,…,n} ?

|The concept of i.i.d. samples: the samples are assumed to be independent and identically distributed

|So, the data likelihood is given by 𝐿𝛉=𝑃𝐷𝛉=

MLE Example 1

|Tossing a coin for n times, observing n1 times for head.

-Estimate the probability θ for head

|The likelihood function is:

𝐿𝜃 =𝑃𝐷𝜃 =𝜃𝑛1 1−𝜃𝑛−𝑛1

MLE Example 1 (cont’d)

|We want to find what θ maximizes this likelihood, or equivalently, the log-likelihood

𝑙𝜃 =log𝑃𝐷𝜃 =log𝜃𝑛1 1−𝜃𝑛−𝑛1 =⋯

|Take the derivative and set to 0: 𝑑𝑙𝜃=0

|This will give us:

MLE Example 2

|Given n i.i.d. samples {xi} from the 1-D normal distribution N(𝝁, σ2), find the MLE for 𝝁 𝐚𝐧𝐝 σ2

|The likelihood function is:

1 𝑛 𝑛 −𝑥𝑖−𝜇2

𝐿𝜇,𝜎=𝑝𝐷𝜇,𝜎=

|The log-likelihood is: 𝑙𝜇,𝜎 =log𝑃𝐷𝜇,𝜎

= log 1 𝑛 ς𝑛

ෑ 𝑒 2𝜎2 𝑖=1

=−𝑛log 𝜎 2𝜋 −σ𝑛 𝑥𝑖−𝜇 2 𝑖=1 2𝜎2

𝑒− 𝑥𝑖−𝜇 2 2𝜎2

MLE Example 2 (cont’d)

|The MLE solution for 𝝁 𝜇Ƹ = argmax𝜇𝑙 𝜇,𝜎

=argmax {−𝑛log 𝜎 2𝜋 −σ𝑛 𝑥𝑖−𝜇 2} 𝜇 𝑖=1 2𝜎2

| Set the derivative to 0: 𝜕𝑙𝜇,𝜎 =0

| The solution is:

σ𝑛 𝑥𝑖 𝜇Ƹ= 𝑖=1

MLE Example 2 (cont’d)

|The MLE solution for 𝝁 𝜎ො=argmax𝜎𝑙 𝜇,𝜎

=argmax {−𝑛log 𝜎 2𝜋 −σ𝑛 𝑥𝑖−𝜇 2} 𝜎 𝑖=1 2𝜎2

| Set the derivative to 0: 𝜕𝑙𝜇,𝜎 =0

| The solution is:

σ𝑛 𝑥𝑖 − 𝜇 2 𝑖=1

Supervised Learning

Generative vs Discriminative Models in Supervised Learning

Differentiate between generative and discriminative models of supervised learning

Discuss challenges in Bayesian learning

Supervised Learning

|The set-up: the given training data consist of

-E.g., Given n pairs

|Equivalently, to find P(y|x)

Two Types of Models

|Generative Model -P(y|x) ∝ P(y) p(x|y)’ -➔To learn P(y) and p(x|y).

|Discriminative Model -Directly learn P(y|x)

-No assumption made on p(x|y)

Two Types of Models

|Generative Model

-P(y|x) ∝ P(y) p(x|y)’

-➔To learn P(y) and p(x|y).

-➔Bayesian learning, Bayes classifiers.

-Example: Naïve Bayes Classifier

|Discriminative Model

-Directly learn P(y|x)

-No assumption made on p(x|y)

-Example: Logistic Regression

Practical Difficulty of Bayesian Learning

|Consider doing Bayesian learning without making simplifying assumptions.

-Given n training pairs

-We need to learn P(y) and p(x|y)

➔ p(x|y) can be very difficult to estimate:

➔Consider a very simple case: binary features, and y is also binary. How many probabilities do we need to estimate?

Supervised Learning

Naïve Bayes Classifier

Implement the fundamental learning algorithm Naive Bayes

Naïve Bayesian Classifier

|The “naive” conditional independence assumption: each feature is (conditionally) independent of every other feature, given the label, i.e., p(xi| {xj for any j≠ 𝒊},y) = p(xi|y)

|How does this assumption simplify the problem? -Consider the previous example again: d-dimensional

binary features, and y is also binary.

-How many probabilities do we need to estimate now?

p(x|y) = p(x1, x2, …, xd |y) = …

Naïve Bayesian Classifier (cont’d)

|The naïve Bayes classifier: the predicted label is given by

𝑦ො = argmax P(y) ෑ 𝑝(x𝑖 |𝑦) 𝑦 𝑖=1

|“Parameters” of the classifier:

-p(xi |y) for all i, y

Naïve Bayesian Classifier (cont’d)

|E.g., estimating the “parameters” of the classifier – P(y) & p(xi |y) for all i, y –

for the following familiar example

Discrete Feature Example

|x =

|In this case, the “parameters” of the classifier are

-P(xi =vk|y), for all i, k, and y

|Given: A training set of n labelled samples

➔How to estimate the model parameters?

Discrete Feature Example (cont’d)

|Given: A training set of n labelled samples

➔How to estimate the model parameters? P(y) =

P(xi =vk|y)=

|These are in fact the MLE solutions for the corresponding parameters.

Supervised Learning

Logistic Regression

Implement the fundamental learning algorithm Logistic Regression

Discriminative Model: Example

|Again, we are given a training set of n labelled samples

|Why not directly model/learn P(y|x)? -Discriminative model

|Further assume P(y|x) takes the form of a logistic sigmoid function

→Logistic Regression

Logistic Regression

|Logistic regression: use the logistic function for modeling P(y|x), considering only the case of 𝐲 ∈

𝑃 𝑦 = 0|𝐱 = 𝑃𝑦=1|𝐱=

1+exp𝑤0+σ𝑑 𝑤𝑖𝑥𝑖 𝑖=1

exp𝑤+σ𝑑 𝑤𝑥 0 𝑖=1 𝑖𝑖

1+exp𝑤0+σ𝑑 𝑤𝑖𝑥𝑖 𝑖=1

|The logistic function

𝜎𝑡=1 =𝑒𝑡 1+𝑒 −𝑡 1+𝑒 𝑡

Logistic Regression→Linear Classifier

|Given a sample x, we classify it as 0 (i.e., predicting y=0) if

P(y=0|x) ≥ P(y=1|x)

➔This is a linear classifier.

The Parameters of the Model

|What are the model parameters in logistic regression?

|Given a parameter w, we have P(y|x) =

|Suppose we have two different sets of parameters, w(1) and w(2), whichever giving a larger P(y|x) should be a better parameter.

The Conditional Likelihood

|Given n training samples,

➔For a given w, the probability of getting all those y(1), y(2) …,y(n) from the corresponding data x(i), i=1,…,n, is

➔Call this L(w), the (conditional) likelihood.

The Conditional Log Likelihood

Maximizing Conditional Log Likelihood

|Optimal parameters 𝐰∗ = argmax𝐰𝑙 𝐰

= argmax σ𝑛 [𝑦(𝑖)𝐰𝑡𝐱(𝑖) − log 1 + exp 𝐰𝑡𝐱(𝑖) ] 𝐰 𝑖=1

|We cannot really solve for w* analytically (no closed-form solution)

– We can use a commonly-used optimization technique, gradient descent/ascent, to find a solution.

Finding the Gradient of l(w)

Gradient Ascent Algorithm

The algorithm

Iterate until converge

𝐰(𝑘+1) =𝐰(𝑘) +𝜂𝛻 𝑘 𝑙 𝐰 𝐰

𝜂 > 0 is a constant called the learning rate.

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com