# CS作业代写 Supervised Learning – cscodehelp代写

Supervised Learning
Linear Regression

Define the set- up of Supervised Learning

Discuss basic regression models

Supervised Learning
|The set-up: the given training data consist of pairs, or (x, y); the objective of learning is to figure out a way to predict label y for any new sample x.
|Consider two types of problems: -Regression: y continuous
-Classification: y is discrete, e.g., class labels.

|Given: A training set of n samples where y(i) is a continuous “label” (or target value) for x(i)
|To learn a model for predicting y for any new sample x.
|A simple model is linear regression: modeling the relation between y and x via a linear function.
y ≈ 𝑤 + 𝑤 𝑥 + … + 𝑤 𝑥 = 𝐰tx 011𝑑𝑑
(Note: x is augmented by adding a dimension of constant 1 to the original sample.)

Linear Regression
|We can introduce an error term to capture the residual 𝐲 = 𝐰tx + e
|Applying this to all n samples, we have: 𝐲 = 𝐗𝐰+𝐞
|Learning in this case is to figure out a good w.

Linear Regression (cont’d)
|Find an optimal w by minimizing the squared error
||e||2 = ||y − 𝑋 w||2 |The solution can be found to be:
wෝ = 𝑋 𝑡 𝑋 − 1 𝑋 𝑡 y
|In practice, some iterative approaches may be used (e.g., gradient descent search).

A simple example

Generalizing Linear Regression
|Introducing some basis functions 𝝓𝒋(𝐱):
𝑦=𝑤+𝑤𝜙 𝐱+…+𝑤 𝜙
0 1 1 𝑀−1 𝑀−1
➢ Blue: Linear Regression
➢Red: With 𝜙𝑗(𝑥) = 𝑥𝑗

Regularized Least Squares
|E.g., use a new error function: 𝑬𝑫 𝒘 -𝜆 is the regularization coefficient
-𝐸 𝐰 is the data-dependent error 𝐷
-𝐸 𝐰 is the 𝑟𝑒𝑔𝑢𝑙𝑎𝑟𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑡𝑒𝑟𝑚, e.g., 𝑬 𝐖𝐖
|Help to alleviate overfitting.

Supervised Learning
Density Estimation in Supervised Learning

Illustrate classification in Supervised Learning
Discuss basic density estimation techniques

Supervised Learning
|The set-up: the given training data consist of pairs, or (x, y); the objective of learning is to figure out a way to predict label y for any new sample x.
|Consider two types of problems: -Regression: y continuous
-Classification: y is discrete, e.g., class labels.

Examples of Image Classification
|The MNIST training |The Extended Yale B images of hand-written Face Images

How do we model the training images?
|Parametric: each class of images (the feature vectors) may be modeled by a density function pθ(x) with parameter θ.
-To emphasize the density is for images from class/label y, we may write pθ(x|y).
-We may also use the notation p(x|θ), if the discussion is true for any y.
➔ How to estimate θ from the training images?
|Note: We may also consider non-parametric approaches.

MLE for Density Estimation (1/3)
x xxxxxxx xx
|Given some training data; Assuming a parametric model p(x|θ); What specific θ will fit/explain the data best?
-E.g., Consider a simple 1-D normal density with only a parameter μ (assuming the variance is known)
|Given a sample xi, p(xi | μ) gives an indication of how likely xi is from p(xi |μ)
➔the concept of the likelihood function.

MLE for Density Estimation (2/3)
|The likelihood function: the density function p(x|θ) evaluated at the given data sample xi, and viewed as a function of the parameter θ.
-Assessing how likely the parameter θ (defining the corresponding p(x|θ)) gives arise to the sample xi.
-We often use L(θ) to denote the likelihood function, and l(θ) = log(L(θ)) is called the log-likelihood.
|Maximum Likelihood Estimation (MLE): Finding the parameter that maximizes the likelihood
𝛉 = argmax𝛉p(x|θ)

MLE for Density Estimation (3/3)
|How to consider all the given samples D={xi, i=1,…,n} ?
|The concept of i.i.d. samples: the samples are assumed to be independent and identically distributed
|So, the data likelihood is given by 𝐿𝛉=𝑃𝐷𝛉=

MLE Example 1
|Tossing a coin for n times, observing n1 times for head.
-Estimate the probability θ for head
|The likelihood function is:
𝐿𝜃 =𝑃𝐷𝜃 =𝜃𝑛1 1−𝜃𝑛−𝑛1

MLE Example 1 (cont’d)
|We want to find what θ maximizes this likelihood, or equivalently, the log-likelihood
𝑙𝜃 =log𝑃𝐷𝜃 =log𝜃𝑛1 1−𝜃𝑛−𝑛1 =⋯
|Take the derivative and set to 0: 𝑑𝑙𝜃=0
|This will give us:

MLE Example 2
|Given n i.i.d. samples {xi} from the 1-D normal distribution N(𝝁, σ2), find the MLE for 𝝁 𝐚𝐧𝐝 σ2
|The likelihood function is:
1 𝑛 𝑛 −𝑥𝑖−𝜇2
𝐿𝜇,𝜎=𝑝𝐷𝜇,𝜎=
|The log-likelihood is: 𝑙𝜇,𝜎 =log𝑃𝐷𝜇,𝜎
= log 1 𝑛 ς𝑛
ෑ 𝑒 2𝜎2 𝑖=1
=−𝑛log 𝜎 2𝜋 −σ𝑛 𝑥𝑖−𝜇 2 𝑖=1 2𝜎2
𝑒− 𝑥𝑖−𝜇 2 2𝜎2

MLE Example 2 (cont’d)
|The MLE solution for 𝝁 𝜇Ƹ = argmax𝜇𝑙 𝜇,𝜎
=argmax {−𝑛log 𝜎 2𝜋 −σ𝑛 𝑥𝑖−𝜇 2} 𝜇 𝑖=1 2𝜎2
| Set the derivative to 0: 𝜕𝑙𝜇,𝜎 =0
| The solution is:
σ𝑛 𝑥𝑖 𝜇Ƹ= 𝑖=1

MLE Example 2 (cont’d)
|The MLE solution for 𝝁 𝜎ො=argmax𝜎𝑙 𝜇,𝜎
=argmax {−𝑛log 𝜎 2𝜋 −σ𝑛 𝑥𝑖−𝜇 2} 𝜎 𝑖=1 2𝜎2
| Set the derivative to 0: 𝜕𝑙𝜇,𝜎 =0
| The solution is:
σ𝑛 𝑥𝑖 − 𝜇 2 𝑖=1

Supervised Learning
Generative vs Discriminative Models in Supervised Learning

Differentiate between generative and discriminative models of supervised learning
Discuss challenges in Bayesian learning

Supervised Learning
|The set-up: the given training data consist of pairs, or (x, y); the objective of learning is to figure out a way to predict label y for any new sample x.
-E.g., Given n pairs , i=1, …, n; x(i) : i-th sample represented as d-dimensional vectors; y(i) : corresponding labels.
|Equivalently, to find P(y|x)

Two Types of Models
|Generative Model -P(y|x) ∝ P(y) p(x|y)’ -➔To learn P(y) and p(x|y).
|Discriminative Model -Directly learn P(y|x)

Two Types of Models
|Generative Model
-P(y|x) ∝ P(y) p(x|y)’
-➔To learn P(y) and p(x|y).
-➔Bayesian learning, Bayes classifiers.
-Example: Naïve Bayes Classifier
|Discriminative Model
-Directly learn P(y|x)
-Example: Logistic Regression

Practical Difficulty of Bayesian Learning
|Consider doing Bayesian learning without making simplifying assumptions.
-Given n training pairs , i=1, …, n. Each x(i) is d- dimensional.
-We need to learn P(y) and p(x|y)
➔ p(x|y) can be very difficult to estimate:
➔Consider a very simple case: binary features, and y is also binary. How many probabilities do we need to estimate?

Supervised Learning
Naïve Bayes Classifier

Implement the fundamental learning algorithm Naive Bayes

Naïve Bayesian Classifier
|The “naive” conditional independence assumption: each feature is (conditionally) independent of every other feature, given the label, i.e., p(xi| {xj for any j≠ 𝒊},y) = p(xi|y)
|How does this assumption simplify the problem? -Consider the previous example again: d-dimensional
binary features, and y is also binary.
-How many probabilities do we need to estimate now?
p(x|y) = p(x1, x2, …, xd |y) = …

Naïve Bayesian Classifier (cont’d)
|The naïve Bayes classifier: the predicted label is given by
𝑦ො = argmax P(y) ෑ 𝑝(x𝑖 |𝑦) 𝑦 𝑖=1
|“Parameters” of the classifier:
-p(xi |y) for all i, y

Naïve Bayesian Classifier (cont’d)
|E.g., estimating the “parameters” of the classifier – P(y) & p(xi |y) for all i, y –
for the following familiar example

Discrete Feature Example
|x = where each xi can take only a finite number of values from {v1, v2, …, vm}:
|In this case, the “parameters” of the classifier are
-P(xi =vk|y), for all i, k, and y
|Given: A training set of n labelled samples , i=1, …, n
➔How to estimate the model parameters?

Discrete Feature Example (cont’d)
|Given: A training set of n labelled samples , i=1, …, n
➔How to estimate the model parameters? P(y) =
P(xi =vk|y)=
|These are in fact the MLE solutions for the corresponding parameters.

Supervised Learning
Logistic Regression

Implement the fundamental learning algorithm Logistic Regression

Discriminative Model: Example
|Again, we are given a training set of n labelled samples
|Why not directly model/learn P(y|x)? -Discriminative model
|Further assume P(y|x) takes the form of a logistic sigmoid function
→Logistic Regression

Logistic Regression
|Logistic regression: use the logistic function for modeling P(y|x), considering only the case of 𝐲 ∈
𝑃 𝑦 = 0|𝐱 = 𝑃𝑦=1|𝐱=
1+exp𝑤0+σ𝑑 𝑤𝑖𝑥𝑖 𝑖=1
exp𝑤+σ𝑑 𝑤𝑥 0 𝑖=1 𝑖𝑖
1+exp𝑤0+σ𝑑 𝑤𝑖𝑥𝑖 𝑖=1
|The logistic function
𝜎𝑡=1 =𝑒𝑡 1+𝑒 −𝑡 1+𝑒 𝑡

Logistic Regression→Linear Classifier
|Given a sample x, we classify it as 0 (i.e., predicting y=0) if
P(y=0|x) ≥ P(y=1|x)
➔This is a linear classifier.

The Parameters of the Model
|What are the model parameters in logistic regression?
|Given a parameter w, we have P(y|x) =
|Suppose we have two different sets of parameters, w(1) and w(2), whichever giving a larger P(y|x) should be a better parameter.

The Conditional Likelihood
|Given n training samples, , i=1,…,n, how can we use them to estimate the parameters?
➔For a given w, the probability of getting all those y(1), y(2) …,y(n) from the corresponding data x(i), i=1,…,n, is
➔Call this L(w), the (conditional) likelihood.

The Conditional Log Likelihood

Maximizing Conditional Log Likelihood
|Optimal parameters 𝐰∗ = argmax𝐰𝑙 𝐰
= argmax σ𝑛 [𝑦(𝑖)𝐰𝑡𝐱(𝑖) − log 1 + exp 𝐰𝑡𝐱(𝑖) ] 𝐰 𝑖=1
|We cannot really solve for w* analytically (no closed-form solution)
– We can use a commonly-used optimization technique, gradient descent/ascent, to find a solution.