# CS代考 Senction3-notes – cscodehelp代写

Senction3-notes
STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
This section provides a brief introduction to linear regression. Linear regression is a fundamental statistical learning method (and it is the basis of many methods) and I expect that some of you will have studied it before. However, there are a number of students in this class that have not covered linear regression. The purpose of this section is to introduce/refresh linear regression, rather than pro- viding a full treatment on the subject (STAT202, STAT315 and STAT448 cover linear regression in more detail). I encourage students that have not studied linear regression before to carefully read chapter 3 of the course textbook, including the sections that are not covered in these lecture notes. It is important to have a basic understanding of linear regression to fully appreciate the material covered later in the course.
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,1 / 26

Linear regression
Linear regression is a simple parametric approach to supervised learning that assumes there is an approximately linear relationship between the predictors X1,X2,…,Xp and the response Y.
Although true regression functions are never linear, linear regression is an extremely useful and widely used method.
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,2 / 26
Although linear models are simple, in some cases they can perform better than more sophisticated non-linear models. They can be particularly useful when the number of training observations is relatively small, when the signal-to-noise ratio is low (the ‘ term is relatively large) or when the training data sparsely populate the predictor space.

0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,3 / 26
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25

Simple linear regression
In simple (one predictor) linear regression, we assume a model Y = —0 + —1X + ‘,
where —0 and —1 are two unknown parameters and ‘ is an error term with E(‘) = 0.
Given some parameter estimates —ˆ0 and —ˆ1, the prediction of Y at X = x is given
b y yˆ = —ˆ + —ˆ x . 01
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,4 / 26
The population regression line is
E(Y|X=x) = E(—0+—1x+‘)
= —0+—1x,
where E(‘) = 0 by assumption. The parameters —0 (intercept) and —1 (slope) are
called the regression coecients. The line 􏱜 yˆ = —ˆ + —ˆ x ,
is the estimated regression line, which is an approximation to population regres-
01
sion line based on observed training data. To derive statistical properties of the
estimators (—ˆ , —ˆ ), further model assumptions are required (other than a linear 01
relationship and E(‘) = 0). Slide 10 requires the errors (‘) to be uncorrelated with constant variance ‡2. Slide 11 requires the errors to be independent and identically distributed normal random variables with mean 0 and variance ‡2 (in statistical notation: ‘ ≥ Normal(0, ‡2)). These additional assumptions are only required for these specific statistical properties and not to fit the model. For example, you do not require the normality assumption to fit a useful linear model.

Estimating the parameters: least squares approach
􏱲􏱮􏱳 ǐ g u
􏱛
Let yˆi = —ˆ0 + —ˆ1xi be the prediction of Y at X = xi , where xi is the predictor value at the ith training observation. Then, the ith residual is defined as
e i = y i ≠ yˆ i , in7
where yi is the response value at the ith training observation.
The least squares approach chooses —ˆ0 and —ˆ1 to minimize the residual sum of
RSS=ÿn ei2=ÿn (yi≠yˆi)2=ÿn (yi≠—ˆ0≠—ˆ1xi)2. i=1 i=1 i=1
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,5 / 26
We want the regression line to be as close to the data points as possible. A popular approach is the method of least squares:
m i n ÿn ( y ≠ — ˆ ≠ — ˆ x ) 2 . —ˆ0,—ˆ1i=1 i 0 1i
This quadratic function is relatively easy to minimize (by taking the partial deriva- tives and setting them equal to zero) and the solution is given on slide 8.
n.EC

yˆ = 7.03 + 0.0475x
TV
0 50 100 150 200 250 300
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,6 / 26
The least squares solution to regressing Sales on TV (using TV to predict Sales)
is
sales = 7.03 + 0.0475 ◊ TV, which was computed using the lm function in R.

Sales
5 10 15 20 25

56789
β0
Con􏱜tour plot of the RSS on the advertising data, using TV as the predictor.
2.5
2.2 2.3
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,7 / 26
The RRS function is quadratic (bowl shaped) and hence, has a unique minimizer (shown by the red dot).
3
3
β1
0.03 0.04 0.05 0.06
2.15
3
3

Estimating the parameters: least squares approach
Using some calculus, we can show that
qni=1(xi ≠x ̄)(yi ≠y ̄) qni=1(xi ≠x ̄)yi —1 = qni=1(xi ≠ x ̄)2 = qni=1(xi ≠ x ̄)2
ˆ
and
where x ̄ and y ̄ are the sample means of x and y, respectively.
—ˆ 0 = y ̄ ≠ —ˆ 1 x ̄ ,
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,8 / 26
There are two important consequences of the least squares fit. Firstly, the residuals sum to zero:
ÿn e =ÿn (y ≠yˆ) = ÿ(y ≠—ˆ ≠—ˆx) iiii01i
i=1 i=1
Secondly, the regression line passes through the centre of mass (x ̄,y ̄). The pre- dicted response at X = x ̄ is
yˆ = —ˆ + —ˆ x ̄ 0ˆ1ˆ
= (y ≠y ̄+—ˆx ̄≠—ˆx) i11i
= ny ̄≠ny ̄+n—ˆx ̄≠n—ˆx ̄=0. 11
= y ̄ ≠ — 1 x ̄ + — 1 x ̄
= y ̄ ((x ̄, y ̄) is on the regression line).
It is also relatively easy to show that —ˆ and —ˆ are unbiased estimators. That is,
ˆˆ01
E(—0|X) = —0 and E(—1|X) = —1 (we will not prove this result in this course).

Assessing the accuracy of the parameter estimates
−2 −1 0 1 2 −2 −1 0 1 2
XX
Truemodel(red)isY =2+3X+‘,where‘≥Normal(0,‡2).
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,9 / 26
The true regression function is linear, so we would expect the simple linear regres- sion model to perform well. Ten least squares fits using dierent training data are shown in the right panel. Observations:
• All ten fits have slightly dierent slopes.
• All ten fits pivot around the mean (x ̄, y ̄) = (0, 2).
• The population regression line and the fitted regression lines get further apart as x moves away from x ̄.
• We are less certain about predictions for an x far from x ̄ (the variability in the mean response increases as x moves away from x ̄ as seen by the dierent fits).
• The linear model is useful for interpolation (predictions within the range of training data), but not extrapolation (beyond the range of the training data).
To quantify the variability in the regression coecients, we compute/estimate their standard errors (which is simply the standard deviations of the estimators).
−10 −5 0 5 10
−10 −5 0 5 10
Y
Y

Assessing the accuracy of the parameter estimates
The standard errors for the parameter estimates are
SE(—ˆ0) = ÒV(—ˆ0|X) = ‡Û31 + qn x ̄2 4
SE(—ˆ1)= V(—ˆ1|X)=Òqni=1(xi≠x ̄)2, where ‡ = V (‘).
Usually ‡ is not known and needs to be estimated from data using the residual standard error (RSE)
RSE=Ûqni=1(yi ≠yˆi)2, wherepisthenumberofpredictors(p=1here). n≠p≠1
and Ò
n i = 1 ( x i ≠ x ̄ ) 2 ‡
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,10 / 26
The standard error reflects how much the estimator varies under repeated sampling. You can think about an SE of —ˆ in the following way. Assume we have many
1
training data sets from the population of interest. Then, for each training data
set, we fit a linear model (each fit has a dierent —ˆ value). The standard error
of —ˆ is the standard deviation of the —ˆ values we1obtained. If ‡ is large, the 11ˆ
standard errors will tend to be large. This means —1 can vary wildly for dierent training sets.
If the xi’s are well spread over the predictor’s range, the estimators will tend to be more precise (small standard errors). If an xi is far from x ̄, xi is called a high leverage point. These points can have a huge impact on the estimated regression line.
We can also construct confidence intervals (CIs) for the regression coecients, for example an ¥ 95% CI for the slope parameter —1 is
—ˆ ±2SE(—ˆ). 11
Assumptions: The SE formulas require the errors to be uncorrelated with constant variance ‡2. The CI requires a stronger assumption: ‘ ≥ Normal(0, ‡2). Bootstrap CIs can be constructed if the normality assumption fails (see Section 5).

Hypothesis testing
If—1 =0,thenthesimplelinearmodelreducestoY =—0+‘,andX isnot associated with Y .
To test whether X is associated with Y , we perform a hypothesis test: H0 : —1 = 0 (there is no relationship between X and Y )
HA : —1 ”= 0 (there is some relationship between X and Y ) If the null hypothesis is true (—1 = 0), then
t = —ˆ 1 ≠ 0 S E ( —ˆ 1 )
will have a t-distribution with n ≠ 2 degrees of freedom.
G. Erd ́elyi, University of Canterbury 2021
STAT318/462 — Data Mining ,11 / 26
We look for evidence to reject the null hypothesis (H0) to establish the alternative hypothesis HA. We reject H0 if the p-value for the test is small. The p-value is the probability of observing a t-statistic more extreme than the observed statistic tú if H0 is true. This is a two-sided test so more extreme means t Æ ≠|tú| and t Ø |tú|.
􏱖􏱴 􏱖􏱗 􏱖􏱵 0 􏱵 􏱗 􏱴 t
• A large p-value is NOT strong evidence in favour of H0.
• The p-value is NOT the probability that HA is true.
• When we reject H0 we say that the result is statistically significant (which does not imply scientific significance).