# 程序代写 COMP9417 Machine Learning and Data Mining Term 2, 2022 – cscodehelp代写

Regression (2)
COMP9417 Machine Learning and Data Mining Term 2, 2022
COMP9417 ML & DM Regression (2) Term 2, 2022 1 / 40
Acknowledgements

Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. IT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from slides for the book “Machine Learning” by T. Graw-Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM
Regression (2)
Term 2, 2022
COMP9417 ML & DM Regression (2) Term 2, 2022 3 / 40
Optimization
Basic problem1:
minimize f (x) x
subject to x 2 X
• The problem is to find some x which is called a design point, which is
a p-dimensional vector of values for p design variables.
• These values are manipulated by an optimization algorithm to find a
minimizer, a solution x∗ that minimizes the objective function f.
• A solution must be an element of the feasible set X, and may be
subject to additional constraints, called constrained optimization.
• Optimization is widely studied and applied in many fields such as
engineering, science, economics, . . .
1See: Kochenderfer and Wheeler (2019).
COMP9417 ML & DM Regression (2) Term 2, 2022 4 / 40

design points
Optimization
The formulation is general, for constrained or unconstrained optimization, since we can always replace a problem where we want to maximize the objective function f:
maximize f(x) subject to x
by an equivalent minimization expression:
minimize f(x) subject to x
COMP9417 ML & DM
Regression (2)
Term 2, 2022
P” ” t.MY#flectiion
Min . 3¥ at a
first- order FONC second – order SONC
minimum if d) f’ t
(but not ,
” “⇔≥o}÷÷÷
Optimization
Usually, we would like an optimization algorithm t-o quickly reach an answer that is close to being the right one.
There are many possible approaches to optimization. In machine learning methods based on finding the derivative, or gradient, are widely used.
• typically, need to minimize a function • e.g.,
• optimization is known as gradient descent or steepest descent • sometimes, need to maximize a function
• optimization is known as gradient ascent or steepest ascent
Requires function to be differentiable.
COMP9417 ML & DM Regression (2) Term 2, 2022 6 / 40
error or loss
probability or likelihood

What is an optimization algorithm ?
• many kinds of optimization algorithm
• depends on objective function, constraints, data
• approaches based on derivatives or gradients are widely used • derivatives provide the direction of the search for a minimizer • may be obtained analytically or estimated numerically
• can also used automatic differentiation
• not all approaches require derivatives . . .
COMP9417 ML & DM Regression (2) Term 2, 2022 7 / 40
Optimization
A general iterative algorithm 2 to optimise some function f:
2 select a search direction g, usually to decrease f(x)
3 select a step length ⌘ learning rate
5 setx=x+s
6 go to step 2, unless convergence criteria are met
For example, could minimize a real-valued function f : Rn ! R Note: convergence criteria will be problem-specific.
2See: Ripley (1996).
COMP9417 ML & DM Regression (2) Term 2, 2022 8 / 40
Least-Squares as Loss Minimization
• we saw before that the minimum value can be obtained analytically
by the usual process of differentiating and equating to 0
• an alternative is to apply an iterative gradient descent approach
• with MSE as a loss function, given data (x1, y1), . . . , (xn, yn)
Loss(✓) = n
• notethatLossisafunctionof✓,andyˆ=f(x),sohere✓isthe
• consider MSE as a loss function
• this is what we want to minimize in OLS regression
• finding the least-squPares solution is in effect finding the value of a and
b that minimizes 1 n (y yˆ)2, where n i=1 i i
parameters a and b = atbsci COMP9417 ML & DM Regression (2)
Term 2, 2022
(yi fθ(xi))2 iθi
yˆ = a + b x ii
Least-Squares as Loss Minimization
• the gradient descent alternative to the analytical approach is to take (small) steps that decreases the value of the function to be
minimised, stopping when we reach a minimum
• recall that at a point the gradient vector points in the direction of greatest increase of a function. So, the opposite direction to the gradient vector gives the direction of greatest decrease
• for univariate linear regression, suppose gb, ga give the gradient, then the iterative steps are:
•bi+1=bi-⌘⇥gb 0, . .. , i-7, i, it1, … • ai+1 =ai -⌘⇥ga
• Stop when bi+1 ⇡bi and ai+1 ⇡ai
are the steps in iteration
COMP9417 ML & DM Regression (2) Term 2, 2022 10 / 40

-¥” I:f i. i
” it 1 2+7,, Dci
Multivariate linear regression
Multivariate linear regression
COMP9417 ML & DM Regression (2) Term 2, 2022 11 / 40
Multivariate linear regression
Many variables
• Often, we are interesting in modelling the relationship of Y to several other variables
• In observational studies, the value of Y may be affected by the values of several variables. For example, carcinogenicity may be gender-specific. A regression model that ignores gender may find that carcinogenicity to be related to some surrogate variable (height, for example)3
• Including more variables can give a narrower confidence interval on the prediction being made
• However, more complex models are not always better
3A surrogate variable is a variable for which we have data that replaces one that cannot be observed, or is otherwise not in the dataset.
COMP9417 ML & DM Regression (2) Term 2, 2022 12 / 40
Multivariate linear regression
Multivariate linear model
• We now have multiple variables to estimate a model of the form Y =0 +1X1 +2X2 +···+pXp
• As before, this linear model is estimated from a sample by the equation yˆ = b0 + b1x1 + b2x2 + · · · + bpxp
• With many variables, we often need to ensure the regression does not overfit the training data, so we control the bi by regularisation
COMP9417 ML & DM Regression (2) Term 2, 2022 13 / 40

COMP9417 ML & DM Regression (2) Term 2, 2022 14 / 40
Regularisation
Parameter Estimation by Optimization
Regularisation is a general method to avoid overfitting by applying additional constraints to the weight vector. A common approach is to make sure the weights are, on average, small in magnitude: this is referred to as shrinkage.
Recall the setting for regression in terms of loss minimization.
• Can add penalty terms to a loss function, forcing coefficients to
shrink to zero
fθ0,θ1,…,θp (X1, X2, . . . , Xp) = ↑↑ ↑ ↑ ↑ ↑
Regression (2)
9.” large”
Term 2, 2022
Y ” sensitive to” Oj
COMP9417 ML & DM
Regularisation
Parameter Estimation by Optimization
• MSE as a loss function, given data 1 Xn
(yifθ(xi))2
L(✓) = n and with a penalty function:
” ridge” ←↳
sum: error+ size of 0’s
1 L(✓) = n
Xt e r m s 2 ✓ 2
e r r o ror: and
(yifθ(xi)) + ✓ 0
(x1,y1),…,(xn,yn):
(yi fθ(xi))2 +
• Parameter estimation by optimisation will attempt to find values for
✓0, ✓1, . . . , ✓p s.t. loss L(✓) is minimized variable selection COMP9417 ML & DM Regression (2) Term 2, 2022 16 / 40
complexity
lasso ” Ll
regression
The multivariate least-squares regression problem is often written as an optimisation problem using vector notation:
where w represents the parameters as “weights”.
The regularised version of this optimisation is then as follows:
w∗ = arg min (y Xw)T(y Xw) + ||w||2 ridge w
where ||w||2 = Pi wi2 is the squared norm of the parameters or weights vector w, or, equivalently, the dot product wTw; is a scalar determining the amount of regularisation.
w∗ =argmin(yXw)T(yXw) 0h5
COMP9417 ML & DM Regression (2) Term 2, 2022 17 / 40

regression
This regularised problem still has a closed-form solution:
wˆ = (XTX + I)−1XTy
where I denotes the identity matrix. Regularisation amounts to adding to the diagonal of XTX, a well-known trick to improve the numerical stability of matrix inversion. This form of least-squares regression is known as ridge regression.
COMP9417 ML & DM Regression (2) Term 2, 2022 18 / 40
regression
The alternative “absolute value” form of regularised regression is provided by the LASSO, which stands for ‘Least Absolute Shrinkage and Selection Operator’4.
It replacPes the ridge regularisation term Pi wi2 with the sum of absolute weights i |wi|. The result is that some weights are shrunk, but others are set to 0, and so the lasso form of regularised regression favours sparse solutions. ” some parameters are set to zero ”
Unlike ridge regression, the lasso form of regularised regression has to be solved by an iterative optimization algorithm.
4(Tibshirani, 1996).
COMP9417 ML & DM Regression (2) Term 2, 2022 19 / 40
Regularisation
Bias-Variance Decomposition
Bias-Variance Decomposition
COMP9417 ML & DM Regression (2) Term 2, 2022 20 / 40
Regularisation
Bias-Variance Decomposition
iid independent , identically distributed
Linear regression assume f is ” linear” ✗
“→%≥. xp . ‘”‘”
different function 2 / =f( a) → I 3- I
Other models
• we can view the issue of error using a probabilistic model Y=f(X)+✏, (a)→ I
[“” ° ] ::
21 t C- (2)
• briefy, this model assumes the data we observe was generated according to the target function f and Gaussian or normally-distributed noise with zero mean and variance 2 was added to each data point
• this assumption leads to several useful properties maximum likelihood
• for regression, we assume that f is the same as before
COMP9417 ML & DM Regression (2) Term 2, 2022 21 / 40
✏ ⇠ N(0,2)

Regularisation
Bias-Variance Decomposition
• When comparing unbiased estimators, we would like to select the one with minimum variance
• In general, we would be comparing estimators that have some bias and some variance
• We can combine the bias and variance of an estimator by obtaining the mean square error of the estimator, or MSE. This is the average value of squared deviations of an estimated value V from the true value of the parameter ✓. That is:
MSE = Avg. value of (V ✓)2
• Now, it can be shown that:
MSE = (variance) + (bias)2
• If, as sample size increases, the bias and the variance of an estimator approaches 0, then the estimator is said to be consistent.
COMP9417 ML & DM Regression (2) Term 2, 2022 22 / 40
Regularisation
Bias-Variance Decomposition
MSE = (variance) + (bias)2
the lowest possible value of MSE is 0
• In general, we may not be able to get to the ideal MSE of 0.
Sampling theory tells us the minimum value of the variance of an estimator. This value is known as the Cramer-Rao bound. So, given an estimator with bias b, we can calculate the minimum value of the variance of the estimator using the CR bound (say, vmin). Then:
MSE vmin+b2
The value of vmin depends on whether the estimator is biased or
unbiased (that is b=0 or b6=0)
• It is not the case that vmin for an unbiased (b = 0) estimator is less
than vmin for a biased estimator. So, the MSE of a biased estimator
can end up being lower than the MSE of an unbiased estimator.
COMP9417 ML & DM Regression (2) Term 2, 2022 23 / 40
Regularisation
Bias-Variance Decomposition
Decomposition of MSE
” bias ” 1 bias term in a linear mod. Overloaded 2 bias part of error
3 inductive bias 4 Unfairness Suppose f(x) is the value of the (unknown) target function for input x, in
and yˆ = g(x) is the prediction of the learned regression model.
prediction
Imagine evaluating predictions yˆ of the model g(x) trained on dataset D of size n sampled at random from the target distribution, where error is based on the squared difference between predicted and actual values.
Averaged over all such datasets D, the MSE can be decomposed like this:
M S E = E D [ ( yˆ f ( x ) ) 2 ] –
= ED[(yˆ ED[yˆ])2] + (ED[yˆ f(x)])2
variance term bias term
Note that the first term in the error decomposition (variance) does not
refer to the actual value at all, although the second term (bias) does.
COMP9417 ML & DM Regression (2) Term 2, 2022 24 / 40
Some further issues in learning linear regression models
Some further issues in learning linear regression models
COMP9417 ML & DM Regression (2) Term 2, 2022 25 / 40

Some further issues in learning linear regression models
What do the Coefficients bi Mean?
•-Consider the two equations:
Yˆ = a + b X
Y = b0 + b1X1 + b2X2
act independently • b: change in Y that accompanies a unit change in X
• b1: change in Y that accompanies a unit change in X1 provided X2 remains constant
• More generally, bj (j > 0) is the change in Y that accompanies a unit change in Xj provided all other X’s are constant
• So: if all relevant variables are included, then we can assess the effect of each one in a controlled manner
COMP9417 ML & DM Regression (2) Term 2, 2022 26 / 40
Some further issues in learning linear regression models
Categoric Variables: X’s
• “Indicator” variables are those that take on the values 0 or 1
• They are used to include the effects of categoric variables
• For example, if D is a variable that takes the value 1 if a patient takes a drug and 0 if the patient does not. Suppose you want to know the effect of drug D on blood pressure Y keeping age (X) constant
Yˆ =70+5D+0.44X
• So, taking the drug (a unit change in D) makes a difference of 5
units, provided age is held constant
COMP9417 ML & DM Regression (2) Term 2, 2022 27 / 40
Some further issues in learning linear regression models
Categoric Variables: X’s
• How do we capture any interaction effect between age and drug
• Introduce a new indicator variable DX = D ⇥ X
engineering
based on knowledge
Yˆ = 70 + 5D + 0.44X + 0.21DX
engineering
construct a new
COMP9417 ML & DM
Regression (2)
Term 2, 2022 28 / 40
Some further issues in learning linear regression models
Categoric Variables: Y values
• Sometimes Y values may just be one of two values (say, 0 and 1)
• We can’t use the regression model as we described earlier, in which
the Y ’s can take any real value
• But, we can define a new linear regression model in which predicts
not the value of Y, but what are called the log odds of Y: = Odds =
• Once Odds are estimated, they can be used to calculate the probability of Y :
0-2 = Pr(Y =1)= eOdds
(1 + eOdds)
We can then use the value of Pr(Y =1) to decide if Y =1
• This procedure is called
log odds Y
b0 +b1X1 +···+bpXp
logistic regression
COMP9417 ML & DM Regression (2) Term 2, 2022 29 / 40

Some further issues in learning linear regression models
Is the Model Appropriate ?
• It should also be the case that there should be no pattern to the residual scatter all along the line. If the average size of the residuals varies along the line (this condition is called heteroscedasticity) then the relationship is probably more complex than a straight line
• Residuals from a well-fitting line should show an approximate symmetric, bell-shaped frequency distribution with a mean of 0
COMP9417 ML & DM Regression (2) Term 2, 2022 30 / 40
• If there is no systematic pattern to the residuals—that is, there are approximately half of them that are positive and half that are negative, then the line is a good fit
Ñ residua
Some further issues in learning linear regression models
Non-linear Relationships
A question: is it possible to do better than the line of best fit? Maybe. Linear regression assumes that the (xi,yi) examples in the data
are “generated” by the true (but unknown) function Y = f(X).
So any training set is a sample from the true distribution E(Y ) = f(X).
But what if f is non-linear ?
WPe may be able to reduce the mean squared error (MSE) value i(yi yˆ)2 by trying a different function.
COMP9417 ML & DM Regression (2) Term 2, 2022 31 / 40
Some further issues in learning linear regression models
Non-linear Relationships
” linear in the ” Y^=botb,✗, tbzxz parameters
feature engineering]
Some non-linear relationships can be captured in a linear model by a transformation (“trick”). For example, the curved model
Yˆ = b0 + b1X1 + b2X12 can be transformed by X2 = X12 into a linear model. This works for polynomial relationships.
Some other non-linear relationships may require more complicated transformations. For example, the relationship is Y = b0Xb1 Xb2 can
be transformed into the linear relationship
log(Y)=logb0 +b1logX1 +b2logX2
Other relationships cannot be transformed quite so easily, and will require full non-linear estimation (in subsequent topics in the ML course we will find out more about these)
COMP9417 ML & DM Regression (2) Term 2, 2022 32 / 40
Some further issues in learning linear regression models
Non-Linear Relationships
• Main difficulty with non-linear relationships is choice of function • How to learn ?
• Can use a form of gradient descent to estimate the parameters
• After a point, almost any sufficiently complex mathematical function
will do the job in a sufficiently small range
• Some kind of prior knowledge or theory is the only way to help here. • Otherwise, it becomes a process of trial-and-error, in which case,
beware of conclusions that can be drawn
COMP9417 ML & DM Regression (2) Term 2, 2022 33 / 40

Some further issues in learning linear regression models
Model Selection
• Suppose there are a lot of variables Xi, some of which may be representing products, powers, etc.
• Taking all the Xi will lead to an overly complex model. There are 3 ways to reduce complexity:
1 Subset-selection, by search over subset lattice. Each subset results in a new model, and the problem is one of model-selection
2 Shrinkage, or regularization of coefficients to zero, by optimization. There is a single model, and unimportant variables have near-zero coefficients.
3 Dimensionality-reduction, by projecting points into a lower dimensional space (this is different to subset-selection, and we will look at it later)
COMP9417 ML & DM Regression (2) Term 2, 2022 34 / 40
Some further issues in learning linear regression models
Model Selection as Search I
variable subset selection
• The subsets of the set of possible variables form a lattice with S1 S2 as the g.l.b. or meet and S1 [ S2 as the l.u.b. or join
• Each subset refers to a model, and a pair of subsets are connected if they differ by just 1 element
• A lattice is a graph, and we know how to search a graph
• A∗, greedy, randomised etc. cross-validation