# 程序代写 COMP9417 Machine Learning and Data Mining Term 2, 2022 – cscodehelp代写

Regression (2)

COMP9417 Machine Learning and Data Mining Term 2, 2022

COMP9417 ML & DM Regression (2) Term 2, 2022 1 / 40

Acknowledgements

Copyright By cscodehelp代写 加微信 cscodehelp

Material derived from slides for the book

“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/

Material derived from slides for the book

“Machine Learning: A Probabilistic Perspective” by P. IT Press (2012)

http://www.cs.ubc.ca/~murphyk/MLbook

Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook

Material derived from slides for the book

“Bayesian Reasoning and Machine Learning” by D. University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml

Material derived from slides for the book “Machine Learning” by T. Graw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html

Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)

COMP9417 ML & DM

Regression (2)

Term 2, 2022

Optimization by gradient descent

Optimization by gradient descent

COMP9417 ML & DM Regression (2) Term 2, 2022 3 / 40

Optimization by gradient descent

Optimization

Basic problem1:

minimize f (x) x

subject to x 2 X

• The problem is to find some x which is called a design point, which is

a p-dimensional vector of values for p design variables.

• These values are manipulated by an optimization algorithm to find a

minimizer, a solution x∗ that minimizes the objective function f.

• A solution must be an element of the feasible set X, and may be

subject to additional constraints, called constrained optimization.

• Optimization is widely studied and applied in many fields such as

engineering, science, economics, . . .

1See: Kochenderfer and Wheeler (2019).

COMP9417 ML & DM Regression (2) Term 2, 2022 4 / 40

design points

Optimization by gradient descent

Optimization

The formulation is general, for constrained or unconstrained optimization, since we can always replace a problem where we want to maximize the objective function f:

maximize f(x) subject to x

by an equivalent minimization expression:

minimize f(x) subject to x

COMP9417 ML & DM

Regression (2)

Term 2, 2022

P” ” t.MY#flectiion

Min . 3¥ at a

first- order FONC second – order SONC

minimum if d) f’ t

(but not ,

” “⇔≥o}÷÷÷

Optimization by gradient descent

Optimization

Usually, we would like an optimization algorithm t-o quickly reach an answer that is close to being the right one.

There are many possible approaches to optimization. In machine learning methods based on finding the derivative, or gradient, are widely used.

• typically, need to minimize a function • e.g.,

• optimization is known as gradient descent or steepest descent • sometimes, need to maximize a function

• optimization is known as gradient ascent or steepest ascent

Requires function to be differentiable.

COMP9417 ML & DM Regression (2) Term 2, 2022 6 / 40

error or loss

probability or likelihood

Optimization by gradient descent

What is an optimization algorithm ?

• many kinds of optimization algorithm

• depends on objective function, constraints, data

• approaches based on derivatives or gradients are widely used • derivatives provide the direction of the search for a minimizer • may be obtained analytically or estimated numerically

• can also used automatic differentiation

• not all approaches require derivatives . . .

COMP9417 ML & DM Regression (2) Term 2, 2022 7 / 40

Optimization by gradient descent

Optimization

A general iterative algorithm 2 to optimise some function f:

1 start with initial point x = x0

2 select a search direction g, usually to decrease f(x)

3 select a step length ⌘ learning rate

5 setx=x+s

6 go to step 2, unless convergence criteria are met

For example, could minimize a real-valued function f : Rn ! R Note: convergence criteria will be problem-specific.

2See: Ripley (1996).

COMP9417 ML & DM Regression (2) Term 2, 2022 8 / 40

Optimization by gradient descent

Least-Squares as Loss Minimization

• we saw before that the minimum value can be obtained analytically

by the usual process of differentiating and equating to 0

• an alternative is to apply an iterative gradient descent approach

• with MSE as a loss function, given data (x1, y1), . . . , (xn, yn)

Loss(✓) = n

• notethatLossisafunctionof✓,andyˆ=f(x),sohere✓isthe

• consider MSE as a loss function

• this is what we want to minimize in OLS regression

• finding the least-squPares solution is in effect finding the value of a and

b that minimizes 1 n (y yˆ)2, where n i=1 i i

parameters a and b = atbsci COMP9417 ML & DM Regression (2)

Term 2, 2022

(yi fθ(xi))2 iθi

yˆ = a + b x ii

Optimization by gradient descent

Least-Squares as Loss Minimization

• the gradient descent alternative to the analytical approach is to take (small) steps that decreases the value of the function to be

minimised, stopping when we reach a minimum

• recall that at a point the gradient vector points in the direction of greatest increase of a function. So, the opposite direction to the gradient vector gives the direction of greatest decrease

• for univariate linear regression, suppose gb, ga give the gradient, then the iterative steps are:

•bi+1=bi-⌘⇥gb 0, . .. , i-7, i, it1, … • ai+1 =ai -⌘⇥ga

• Stop when bi+1 ⇡bi and ai+1 ⇡ai

are the steps in iteration

COMP9417 ML & DM Regression (2) Term 2, 2022 10 / 40

-¥” I:f i. i

” it 1 2+7,, Dci

Multivariate linear regression

Multivariate linear regression

COMP9417 ML & DM Regression (2) Term 2, 2022 11 / 40

Multivariate linear regression

Many variables

• Often, we are interesting in modelling the relationship of Y to several other variables

• In observational studies, the value of Y may be affected by the values of several variables. For example, carcinogenicity may be gender-specific. A regression model that ignores gender may find that carcinogenicity to be related to some surrogate variable (height, for example)3

• Including more variables can give a narrower confidence interval on the prediction being made

• However, more complex models are not always better

3A surrogate variable is a variable for which we have data that replaces one that cannot be observed, or is otherwise not in the dataset.

COMP9417 ML & DM Regression (2) Term 2, 2022 12 / 40

Multivariate linear regression

Multivariate linear model

• We now have multiple variables to estimate a model of the form Y =0 +1X1 +2X2 +···+pXp

• As before, this linear model is estimated from a sample by the equation yˆ = b0 + b1x1 + b2x2 + · · · + bpxp

• With many variables, we often need to ensure the regression does not overfit the training data, so we control the bi by regularisation

COMP9417 ML & DM Regression (2) Term 2, 2022 13 / 40

COMP9417 ML & DM Regression (2) Term 2, 2022 14 / 40

Regularisation

Parameter Estimation by Optimization

Regularisation is a general method to avoid overfitting by applying additional constraints to the weight vector. A common approach is to make sure the weights are, on average, small in magnitude: this is referred to as shrinkage.

Recall the setting for regression in terms of loss minimization.

• Can add penalty terms to a loss function, forcing coefficients to

shrink to zero

fθ0,θ1,…,θp (X1, X2, . . . , Xp) = ↑↑ ↑ ↑ ↑ ↑

Regression (2)

9.” large”

Term 2, 2022

Y ” sensitive to” Oj

COMP9417 ML & DM

Regularisation

Parameter Estimation by Optimization

• MSE as a loss function, given data 1 Xn

(yifθ(xi))2

L(✓) = n and with a penalty function:

” ridge” ←↳

sum: error+ size of 0’s

1 L(✓) = n

Xt e r m s 2 ✓ 2

e r r o ror: and

(yifθ(xi)) + ✓ 0

(x1,y1),…,(xn,yn):

(yi fθ(xi))2 +

• Parameter estimation by optimisation will attempt to find values for

✓0, ✓1, . . . , ✓p s.t. loss L(✓) is minimized variable selection COMP9417 ML & DM Regression (2) Term 2, 2022 16 / 40

complexity

lasso ” Ll

regression

The multivariate least-squares regression problem is often written as an optimisation problem using vector notation:

where w represents the parameters as “weights”.

The regularised version of this optimisation is then as follows:

w∗ = arg min (y Xw)T(y Xw) + ||w||2 ridge w

where ||w||2 = Pi wi2 is the squared norm of the parameters or weights vector w, or, equivalently, the dot product wTw; is a scalar determining the amount of regularisation.

w∗ =argmin(yXw)T(yXw) 0h5

COMP9417 ML & DM Regression (2) Term 2, 2022 17 / 40

regression

This regularised problem still has a closed-form solution:

wˆ = (XTX + I)−1XTy

where I denotes the identity matrix. Regularisation amounts to adding to the diagonal of XTX, a well-known trick to improve the numerical stability of matrix inversion. This form of least-squares regression is known as ridge regression.

COMP9417 ML & DM Regression (2) Term 2, 2022 18 / 40

regression

The alternative “absolute value” form of regularised regression is provided by the LASSO, which stands for ‘Least Absolute Shrinkage and Selection Operator’4.

It replacPes the ridge regularisation term Pi wi2 with the sum of absolute weights i |wi|. The result is that some weights are shrunk, but others are set to 0, and so the lasso form of regularised regression favours sparse solutions. ” some parameters are set to zero ”

Unlike ridge regression, the lasso form of regularised regression has to be solved by an iterative optimization algorithm.

4(Tibshirani, 1996).

COMP9417 ML & DM Regression (2) Term 2, 2022 19 / 40

Regularisation

Bias-Variance Decomposition

Bias-Variance Decomposition

COMP9417 ML & DM Regression (2) Term 2, 2022 20 / 40

Regularisation

Bias-Variance Decomposition

The Bias-Variance Tradeoff

iid independent , identically distributed

Linear regression assume f is ” linear” ✗

“→%≥. xp . ‘”‘”

different function 2 / =f( a) → I 3- I

Other models

• we can view the issue of error using a probabilistic model Y=f(X)+✏, (a)→ I

[“” ° ] ::

21 t C- (2)

• briefy, this model assumes the data we observe was generated according to the target function f and Gaussian or normally-distributed noise with zero mean and variance 2 was added to each data point

• this assumption leads to several useful properties maximum likelihood

• for regression, we assume that f is the same as before

COMP9417 ML & DM Regression (2) Term 2, 2022 21 / 40

✏ ⇠ N(0,2)

Regularisation

Bias-Variance Decomposition

The Bias-Variance Tradeoff

• When comparing unbiased estimators, we would like to select the one with minimum variance

• In general, we would be comparing estimators that have some bias and some variance

• We can combine the bias and variance of an estimator by obtaining the mean square error of the estimator, or MSE. This is the average value of squared deviations of an estimated value V from the true value of the parameter ✓. That is:

MSE = Avg. value of (V ✓)2

• Now, it can be shown that:

MSE = (variance) + (bias)2

• If, as sample size increases, the bias and the variance of an estimator approaches 0, then the estimator is said to be consistent.

COMP9417 ML & DM Regression (2) Term 2, 2022 22 / 40

Regularisation

Bias-Variance Decomposition

The Bias-Variance Tradeoff

MSE = (variance) + (bias)2

the lowest possible value of MSE is 0

• In general, we may not be able to get to the ideal MSE of 0.

Sampling theory tells us the minimum value of the variance of an estimator. This value is known as the Cramer-Rao bound. So, given an estimator with bias b, we can calculate the minimum value of the variance of the estimator using the CR bound (say, vmin). Then:

MSE vmin+b2

The value of vmin depends on whether the estimator is biased or

unbiased (that is b=0 or b6=0)

• It is not the case that vmin for an unbiased (b = 0) estimator is less

than vmin for a biased estimator. So, the MSE of a biased estimator

can end up being lower than the MSE of an unbiased estimator.

COMP9417 ML & DM Regression (2) Term 2, 2022 23 / 40

Regularisation

Bias-Variance Decomposition

Decomposition of MSE

” bias ” 1 bias term in a linear mod. Overloaded 2 bias part of error

3 inductive bias 4 Unfairness Suppose f(x) is the value of the (unknown) target function for input x, in

and yˆ = g(x) is the prediction of the learned regression model.

prediction

Imagine evaluating predictions yˆ of the model g(x) trained on dataset D of size n sampled at random from the target distribution, where error is based on the squared difference between predicted and actual values.

Averaged over all such datasets D, the MSE can be decomposed like this:

M S E = E D [ ( yˆ f ( x ) ) 2 ] –

= ED[(yˆ ED[yˆ])2] + (ED[yˆ f(x)])2

variance term bias term

Note that the first term in the error decomposition (variance) does not

refer to the actual value at all, although the second term (bias) does.

COMP9417 ML & DM Regression (2) Term 2, 2022 24 / 40

Some further issues in learning linear regression models

Some further issues in learning linear regression models

COMP9417 ML & DM Regression (2) Term 2, 2022 25 / 40

Some further issues in learning linear regression models

What do the Coefficients bi Mean?

•-Consider the two equations:

Yˆ = a + b X

Y = b0 + b1X1 + b2X2

act independently • b: change in Y that accompanies a unit change in X

• b1: change in Y that accompanies a unit change in X1 provided X2 remains constant

• More generally, bj (j > 0) is the change in Y that accompanies a unit change in Xj provided all other X’s are constant

• So: if all relevant variables are included, then we can assess the effect of each one in a controlled manner

COMP9417 ML & DM Regression (2) Term 2, 2022 26 / 40

Some further issues in learning linear regression models

Categoric Variables: X’s

• “Indicator” variables are those that take on the values 0 or 1

• They are used to include the effects of categoric variables

• For example, if D is a variable that takes the value 1 if a patient takes a drug and 0 if the patient does not. Suppose you want to know the effect of drug D on blood pressure Y keeping age (X) constant

Yˆ =70+5D+0.44X

• So, taking the drug (a unit change in D) makes a difference of 5

units, provided age is held constant

COMP9417 ML & DM Regression (2) Term 2, 2022 27 / 40

Some further issues in learning linear regression models

Categoric Variables: X’s

• How do we capture any interaction effect between age and drug

• Introduce a new indicator variable DX = D ⇥ X

engineering

based on knowledge

Yˆ = 70 + 5D + 0.44X + 0.21DX

engineering

construct a new

COMP9417 ML & DM

Regression (2)

Term 2, 2022 28 / 40

Some further issues in learning linear regression models

Categoric Variables: Y values

• Sometimes Y values may just be one of two values (say, 0 and 1)

• We can’t use the regression model as we described earlier, in which

the Y ’s can take any real value

• But, we can define a new linear regression model in which predicts

not the value of Y, but what are called the log odds of Y: = Odds =

• Once Odds are estimated, they can be used to calculate the probability of Y :

0-2 = Pr(Y =1)= eOdds

(1 + eOdds)

We can then use the value of Pr(Y =1) to decide if Y =1

• This procedure is called

log odds Y

b0 +b1X1 +···+bpXp

logistic regression

COMP9417 ML & DM Regression (2) Term 2, 2022 29 / 40

Some further issues in learning linear regression models

Is the Model Appropriate ?

• It should also be the case that there should be no pattern to the residual scatter all along the line. If the average size of the residuals varies along the line (this condition is called heteroscedasticity) then the relationship is probably more complex than a straight line

• Residuals from a well-fitting line should show an approximate symmetric, bell-shaped frequency distribution with a mean of 0

COMP9417 ML & DM Regression (2) Term 2, 2022 30 / 40

• If there is no systematic pattern to the residuals—that is, there are approximately half of them that are positive and half that are negative, then the line is a good fit

Ñ residua

Some further issues in learning linear regression models

Non-linear Relationships

A question: is it possible to do better than the line of best fit? Maybe. Linear regression assumes that the (xi,yi) examples in the data

are “generated” by the true (but unknown) function Y = f(X).

So any training set is a sample from the true distribution E(Y ) = f(X).

But what if f is non-linear ?

WPe may be able to reduce the mean squared error (MSE) value i(yi yˆ)2 by trying a different function.

COMP9417 ML & DM Regression (2) Term 2, 2022 31 / 40

Some further issues in learning linear regression models

Non-linear Relationships

” linear in the ” Y^=botb,✗, tbzxz parameters

feature engineering]

Some non-linear relationships can be captured in a linear model by a transformation (“trick”). For example, the curved model

Yˆ = b0 + b1X1 + b2X12 can be transformed by X2 = X12 into a linear model. This works for polynomial relationships.

Some other non-linear relationships may require more complicated transformations. For example, the relationship is Y = b0Xb1 Xb2 can

be transformed into the linear relationship

log(Y)=logb0 +b1logX1 +b2logX2

Other relationships cannot be transformed quite so easily, and will require full non-linear estimation (in subsequent topics in the ML course we will find out more about these)

COMP9417 ML & DM Regression (2) Term 2, 2022 32 / 40

Some further issues in learning linear regression models

Non-Linear Relationships

• Main difficulty with non-linear relationships is choice of function • How to learn ?

• Can use a form of gradient descent to estimate the parameters

• After a point, almost any sufficiently complex mathematical function

will do the job in a sufficiently small range

• Some kind of prior knowledge or theory is the only way to help here. • Otherwise, it becomes a process of trial-and-error, in which case,

beware of conclusions that can be drawn

COMP9417 ML & DM Regression (2) Term 2, 2022 33 / 40

Some further issues in learning linear regression models

Model Selection

• Suppose there are a lot of variables Xi, some of which may be representing products, powers, etc.

• Taking all the Xi will lead to an overly complex model. There are 3 ways to reduce complexity:

1 Subset-selection, by search over subset lattice. Each subset results in a new model, and the problem is one of model-selection

2 Shrinkage, or regularization of coefficients to zero, by optimization. There is a single model, and unimportant variables have near-zero coefficients.

3 Dimensionality-reduction, by projecting points into a lower dimensional space (this is different to subset-selection, and we will look at it later)

COMP9417 ML & DM Regression (2) Term 2, 2022 34 / 40

Some further issues in learning linear regression models

Model Selection as Search I

variable subset selection

• The subsets of the set of possible variables form a lattice with S1 S2 as the g.l.b. or meet and S1 [ S2 as the l.u.b. or join

• Each subset refers to a model, and a pair of subsets are connected if they differ by just 1 element

• A lattice is a graph, and we know how to search a graph

• A∗, greedy, randomised etc. cross-validation

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com