# CS代考 STAT318/462 — Data Mining – cscodehelp代写

STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,1 / 26
This section provides a brief introduction to linear regression. Linear regression is a fundamental statistical learning method (and it is the basis of many methods) and I expect that some of you will have studied it before. However, there are a number of students in this class that have not covered linear regression. The purpose of this section is to introduce/refresh linear regression, rather than pro- viding a full treatment on the subject (STAT202, STAT315 and STAT448 cover linear regression in more detail). I encourage students that have not studied linear regression before to carefully read chapter 3 of the course textbook, including the sections that are not covered in these lecture notes. It is important to have a basic understanding of linear regression to fully appreciate the material covered later in the course.

Linear regression
Linear regression is a simple parametric approach to supervised learning that assumes there is an approximately linear relationship between the predictors X1,X2,…,Xp and the response Y.
Although true regression functions are never linear, linear regression is an extremely useful and widely used method.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,2 / 26
Although linear models are simple, in some cases they can perform better than more sophisticated non-linear models. They can be particularly useful when the number of training observations is relatively small, when the signal-to-noise ratio is low (the ε term is relatively large) or when the training data sparsely populate the predictor space.

0 50 100
200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV
Newspaper
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining
,3 / 26
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25

Simple linear regression
In simple (one predictor) linear regression, we assume a model Y = β0 + β1X + ε,
where β0 and β1 are two unknown parameters and ε is an error term with E(ε) = 0.
Given some parameter estimates βˆ0 and βˆ1, the prediction of Y at X = x is given by
yˆ = βˆ 0 + βˆ 1 x .
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,4 / 26
The population regression line is
E(Y|X=x) = E(β0+β1x+ε)
= β0+β1x,
where E(ε) = 0 by assumption. The parameters β0 (intercept) and β1 (slope) are
called the regression coefficients. The line
yˆ = βˆ 0 + βˆ 1 x ,
is the estimated regression line, which is an approximation to population regres- sion line based on observed training data. To derive statistical properties of the estimators (βˆ0,βˆ1), further model assumptions are required (other than a linear relationship and E(ε) = 0). Slide 10 requires the errors (ε) to be uncorrelated with constant variance σ2. Slide 11 requires the errors to be independent and identically distributed normal random variables with mean 0 and variance σ2 (in statistical notation: ε ∼ Normal(0, σ2)). These additional assumptions are only required for these specific statistical properties and not to fit the model. For example, you do not require the normality assumption to fit a useful linear model.

Estimating the parameters: least squares approach
Let yˆi = βˆ0 + βˆ1xi be the prediction of Y at X = xi , where xi is the predictor value at the ith training observation. Then, the ith residual is defined as
e i = y i − yˆ i ,
where yi is the response value at the ith training observation.
The least squares approach chooses βˆ0 and βˆ1 to minimize the residual sum of squares (RSS)
nnn
RSS = 􏰋 ei2 = 􏰋(yi − yˆi )2 = 􏰋(yi − βˆ0 − βˆ1xi )2.
i=1 i=1 i=1
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,5 / 26
We want the regression line to be as close to the data points as possible. A popular approach is the method of least squares:
n
min􏰋(yi −βˆ0 −βˆ1xi)2.
βˆ0,βˆ1 i=1
This quadratic function is relatively easy to minimize (by taking the partial deriva- tives and setting them equal to zero) and the solution is given on slide 8.

0 50 100 150 200 250 300
TV
The least squares solution to regressing Sales on TV (using TV to predict Sales) is
sales = 7.03 + 0.0475 × TV, which was computed using the lm function in R.
yˆ = 7.03 + 0.0475x
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,6 / 26
􏱱
Sales
5 10 15 20 25

2.5
2.2 2.3
56789
β0
Contour plot of the RSS on the advertising data, using TV as the predictor.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,7 / 26
The RRS function is quadratic (bowl shaped) and hence, has a unique minimizer (shown by the red dot).
3
3
β1
0.03 0.04 0.05 0.06
2.15
3
3

Estimating the parameters: least squares approach
Using some calculus, we can show that
ˆ 􏰊ni=1(xi −x ̄)(yi −y ̄) 􏰊ni=1(xi −x ̄)yi
β1 = 􏰊ni=1(xi − x ̄)2 = 􏰊ni=1(xi − x ̄)2 βˆ 0 = y ̄ − βˆ 1 x ̄ ,
and
where x ̄ and y ̄ are the sample means of x and y, respectively.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,8 / 26
There are two important consequences of the least squares fit. Firstly, the residuals sum to zero:
nn
􏰋ei=􏰋(yi−yˆi) = 􏰋(yi−βˆ0−βˆ1xi) i=1 i=1
= 􏰋(yi −y ̄+βˆ1x ̄−βˆ1xi)
= n y ̄ − n y ̄ + n βˆ 1 x ̄ − n βˆ 1 x ̄ = 0 .
Secondly, the regression line passes through the centre of mass (x ̄,y ̄). The pre- dicted response at X = x ̄ is
yˆ = βˆ0+βˆ1x ̄
= y ̄ − βˆ 1 x ̄ + βˆ 1 x ̄
= y ̄ ((x ̄, y ̄) is on the regression line).
It is also relatively easy to show that βˆ0 and βˆ1 are unbiased estimators. That is, E(βˆ0|X) = β0 and E(βˆ1|X) = β1 (we will not prove this result in this course).

Assessing the accuracy of the parameter estimates
−2 −1 0 1 2 −2 −1 0 1 2
XX
Truemodel(red)isY =2+3X+ε,whereε∼Normal(0,σ2).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,9 / 26
The true regression function is linear, so we would expect the simple linear regres- sion model to perform well. Ten least squares fits using different training data are shown in the right panel. Observations:
• All ten fits have slightly different slopes.
• All ten fits pivot around the mean (x ̄, y ̄) = (0, 2).
• The population regression line and the fitted regression lines get further apart as x moves away from x ̄.
• We are less certain about predictions for an x far from x ̄ (the variability in the mean response increases as x moves away from x ̄ as seen by the different fits).
• The linear model is useful for interpolation (predictions within the range of training data), but not extrapolation (beyond the range of the training data).
To quantify the variability in the regression coefficients, we compute/estimate their standard errors (which is simply the standard deviations of the estimators).
−10 −5 0 5 10
−10 −5 0 5 10
Y
Y

Assessing the accuracy of the parameter estimates
The standard errors for the parameter estimates are
􏰏 􏱩􏰆1 x ̄2 􏰇
SE(βˆ0)= V(βˆ0|X)=σ n+􏰊ni=1(xi −x ̄)2 ˆ􏰏ˆσ
where σ = 􏰎V (ε).
Usually σ is not known and needs to be estimated from data using the residual
standard error (RSE)
􏱩􏰊ni=1(yi −yˆi)2
RSE= n−p−1 , wherepisthenumberofpredictors(p=1here).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,10 / 26
The standard error reflects how much the estimator varies under repeated sampling. You can think about an SE of βˆ1 in the following way. Assume we have many training data sets from the population of interest. Then, for each training data set, we fit a linear model (each fit has a different βˆ1 value). The standard error of βˆ1 is the standard deviation of the βˆ1 values we obtained. If σ is large, the standard errors will tend to be large. This means βˆ1 can vary wildly for different training sets.
If the xi’s are well spread over the predictor’s range, the estimators will tend to be more precise (small standard errors). If an xi is far from x ̄, xi is called a high leverage point. These points can have a huge impact on the estimated regression line.
We can also construct confidence intervals (CIs) for the regression coefficients, for example an ≈ 95% CI for the slope parameter β1 is
βˆ1 ± 2SE(βˆ1).
Assumptions: The SE formulas require the errors to be uncorrelated with constant variance σ2. The CI requires a stronger assumption: ε ∼ Normal(0, σ2). Bootstrap CIs can be constructed if the normality assumption fails (see Section 5).
and
SE(β1)= V(β1|X)=􏰏 n , 􏰊i=1(xi − x ̄)2

Hypothesis testing
Ifβ1 =0,thenthesimplelinearmodelreducestoY =β0+ε,andX isnot associated with Y .
To test whether X is associated with Y , we perform a hypothesis test: H0 : β1 = 0 (there is no relationship between X and Y )
HA : β1 ̸= 0 (there is some relationship between X and Y ) If the null hypothesis is true (β1 = 0), then
t = βˆ 1 − 0 SE(βˆ1 )
will have a t-distribution with n − 2 degrees of freedom.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,11 / 26
We look for evidence to reject the null hypothesis (H0) to establish the alternative hypothesis HA. We reject H0 if the p-value for the test is small. The p-value is the probability of observing a t-statistic more extreme than the observed statistic t∗ if H0 is true. This is a two-sided test so more extreme means t ≤ −|t∗| and t ≥ |t∗|.
p-value for t*= 2 (or t*= -2).
−3 −2 −1 0 1 2 3 t
• A large p-value is NOT strong evidence in favour of H0.
• The p-value is NOT the probability that HA is true.
• When we reject H0 we say that the result is statistically significant (which does not imply scientific significance).