STAT318/462 — Data Mining
Dr G ́abor Erd ́elyi
University of Canterbury, Christchurch,
Course developed by Dr B. Robertson. Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,1 / 26
This section provides a brief introduction to linear regression. Linear regression is a fundamental statistical learning method (and it is the basis of many methods) and I expect that some of you will have studied it before. However, there are a number of students in this class that have not covered linear regression. The purpose of this section is to introduce/refresh linear regression, rather than pro- viding a full treatment on the subject (STAT202, STAT315 and STAT448 cover linear regression in more detail). I encourage students that have not studied linear regression before to carefully read chapter 3 of the course textbook, including the sections that are not covered in these lecture notes. It is important to have a basic understanding of linear regression to fully appreciate the material covered later in the course.
Linear regression is a simple parametric approach to supervised learning that assumes there is an approximately linear relationship between the predictors X1,X2,…,Xp and the response Y.
Although true regression functions are never linear, linear regression is an extremely useful and widely used method.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,2 / 26
Although linear models are simple, in some cases they can perform better than more sophisticated non-linear models. They can be particularly useful when the number of training observations is relatively small, when the signal-to-noise ratio is low (the ε term is relatively large) or when the training data sparsely populate the predictor space.
Linear regression: advertising data
0 50 100
200 300 0 10 20 30 40 50 0 20 40 60 80 100
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining
,3 / 26
5 10 15 20 25
5 10 15 20 25
5 10 15 20 25
Simple linear regression
In simple (one predictor) linear regression, we assume a model Y = β0 + β1X + ε,
where β0 and β1 are two unknown parameters and ε is an error term with E(ε) = 0.
Given some parameter estimates βˆ0 and βˆ1, the prediction of Y at X = x is given by
yˆ = βˆ 0 + βˆ 1 x .
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,4 / 26
The population regression line is
E(Y|X=x) = E(β0+β1x+ε)
where E(ε) = 0 by assumption. The parameters β0 (intercept) and β1 (slope) are
called the regression coefficients. The line
yˆ = βˆ 0 + βˆ 1 x ,
is the estimated regression line, which is an approximation to population regres- sion line based on observed training data. To derive statistical properties of the estimators (βˆ0,βˆ1), further model assumptions are required (other than a linear relationship and E(ε) = 0). Slide 10 requires the errors (ε) to be uncorrelated with constant variance σ2. Slide 11 requires the errors to be independent and identically distributed normal random variables with mean 0 and variance σ2 (in statistical notation: ε ∼ Normal(0, σ2)). These additional assumptions are only required for these specific statistical properties and not to fit the model. For example, you do not require the normality assumption to fit a useful linear model.
Estimating the parameters: least squares approach
Let yˆi = βˆ0 + βˆ1xi be the prediction of Y at X = xi , where xi is the predictor value at the ith training observation. Then, the ith residual is defined as
e i = y i − yˆ i ,
where yi is the response value at the ith training observation.
The least squares approach chooses βˆ0 and βˆ1 to minimize the residual sum of squares (RSS)
RSS = ei2 = (yi − yˆi )2 = (yi − βˆ0 − βˆ1xi )2.
i=1 i=1 i=1
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,5 / 26
We want the regression line to be as close to the data points as possible. A popular approach is the method of least squares:
min(yi −βˆ0 −βˆ1xi)2.
This quadratic function is relatively easy to minimize (by taking the partial deriva- tives and setting them equal to zero) and the solution is given on slide 8.
0 50 100 150 200 250 300
The least squares solution to regressing Sales on TV (using TV to predict Sales) is
sales = 7.03 + 0.0475 × TV, which was computed using the lm function in R.
yˆ = 7.03 + 0.0475x
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,6 / 26
5 10 15 20 25
Contour plot of the RSS on the advertising data, using TV as the predictor.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,7 / 26
The RRS function is quadratic (bowl shaped) and hence, has a unique minimizer (shown by the red dot).
0.03 0.04 0.05 0.06
Estimating the parameters: least squares approach
Using some calculus, we can show that
ˆ ni=1(xi −x ̄)(yi −y ̄) ni=1(xi −x ̄)yi
β1 = ni=1(xi − x ̄)2 = ni=1(xi − x ̄)2 βˆ 0 = y ̄ − βˆ 1 x ̄ ,
where x ̄ and y ̄ are the sample means of x and y, respectively.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,8 / 26
There are two important consequences of the least squares fit. Firstly, the residuals sum to zero:
ei=(yi−yˆi) = (yi−βˆ0−βˆ1xi) i=1 i=1
= (yi −y ̄+βˆ1x ̄−βˆ1xi)
= n y ̄ − n y ̄ + n βˆ 1 x ̄ − n βˆ 1 x ̄ = 0 .
Secondly, the regression line passes through the centre of mass (x ̄,y ̄). The pre- dicted response at X = x ̄ is
yˆ = βˆ0+βˆ1x ̄
= y ̄ − βˆ 1 x ̄ + βˆ 1 x ̄
= y ̄ ((x ̄, y ̄) is on the regression line).
It is also relatively easy to show that βˆ0 and βˆ1 are unbiased estimators. That is, E(βˆ0|X) = β0 and E(βˆ1|X) = β1 (we will not prove this result in this course).
Assessing the accuracy of the parameter estimates
−2 −1 0 1 2 −2 −1 0 1 2
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,9 / 26
The true regression function is linear, so we would expect the simple linear regres- sion model to perform well. Ten least squares fits using different training data are shown in the right panel. Observations:
• All ten fits have slightly different slopes.
• All ten fits pivot around the mean (x ̄, y ̄) = (0, 2).
• The population regression line and the fitted regression lines get further apart as x moves away from x ̄.
• We are less certain about predictions for an x far from x ̄ (the variability in the mean response increases as x moves away from x ̄ as seen by the different fits).
• The linear model is useful for interpolation (predictions within the range of training data), but not extrapolation (beyond the range of the training data).
To quantify the variability in the regression coefficients, we compute/estimate their standard errors (which is simply the standard deviations of the estimators).
−10 −5 0 5 10
−10 −5 0 5 10
Assessing the accuracy of the parameter estimates
The standard errors for the parameter estimates are
1 x ̄2
SE(βˆ0)= V(βˆ0|X)=σ n+ni=1(xi −x ̄)2 ˆˆσ
where σ = V (ε).
Usually σ is not known and needs to be estimated from data using the residual
standard error (RSE)
RSE= n−p−1 , wherepisthenumberofpredictors(p=1here).
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,10 / 26
The standard error reflects how much the estimator varies under repeated sampling. You can think about an SE of βˆ1 in the following way. Assume we have many training data sets from the population of interest. Then, for each training data set, we fit a linear model (each fit has a different βˆ1 value). The standard error of βˆ1 is the standard deviation of the βˆ1 values we obtained. If σ is large, the standard errors will tend to be large. This means βˆ1 can vary wildly for different training sets.
If the xi’s are well spread over the predictor’s range, the estimators will tend to be more precise (small standard errors). If an xi is far from x ̄, xi is called a high leverage point. These points can have a huge impact on the estimated regression line.
We can also construct confidence intervals (CIs) for the regression coefficients, for example an ≈ 95% CI for the slope parameter β1 is
βˆ1 ± 2SE(βˆ1).
Assumptions: The SE formulas require the errors to be uncorrelated with constant variance σ2. The CI requires a stronger assumption: ε ∼ Normal(0, σ2). Bootstrap CIs can be constructed if the normality assumption fails (see Section 5).
SE(β1)= V(β1|X)= n , i=1(xi − x ̄)2
Ifβ1 =0,thenthesimplelinearmodelreducestoY =β0+ε,andX isnot associated with Y .
To test whether X is associated with Y , we perform a hypothesis test: H0 : β1 = 0 (there is no relationship between X and Y )
HA : β1 ̸= 0 (there is some relationship between X and Y ) If the null hypothesis is true (β1 = 0), then
t = βˆ 1 − 0 SE(βˆ1 )
will have a t-distribution with n − 2 degrees of freedom.
G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,11 / 26
We look for evidence to reject the null hypothesis (H0) to establish the alternative hypothesis HA. We reject H0 if the p-value for the test is small. The p-value is the probability of observing a t-statistic more extreme than the observed statistic t∗ if H0 is true. This is a two-sided test so more extreme means t ≤ −|t∗| and t ≥ |t∗|.
p-value for t*= 2 (or t*= -2).
−3 −2 −1 0 1 2 3 t
• A large p-value is NOT strong evidence in favour of H0.
• The p-value is NOT the probability that HA is true.
• When we reject H0 we say that the result is statistically significant (which does not imply scientific significance).
• Alevel0<α<1testrejectsH0 :β1 =0ifandonlyifthe(1−α)100% confidence interval for β1 does not include 0. Density 0.0 0.3 Results for the advertising data set Coefficient Intercept 7.0325 TV 0.0475 Std. Error 0.4578 0.0027 t-statistic p-value 15.36 <0.0001 17.67 <0.0001 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,12 / 26 We fit linear models using R and it performs hypothesis tests for us. We need to be able to interpret the output and draw valid conclusions. • The intercept βˆ0 tells us the predicted sales when the TV budget is set to zero. • The p-value is very small for TV, so there is some relationship between TV budget and sales (β1 ̸= 0). • An approximate (we are estimating the standard error using RSE) 95% confidence interval for β1 is 0.0475 ± 2(0.0027) ≈ (0.042, 0.053), which does not contain zero. That is, β1 ̸= 0 at the 95% confidence level. Assessing the overall accuracy Once we have established that there is some relationship between X and Y , we want to quantify the extent to which the linear model fits the data. The residual standard error (RSE) provides an absolute measure of lack of fit for the linear model, but it is not always clear what a good RSE is. An alternative measure of fit is R-squared (R2), 2 RSS ni=1(yi −yˆi)2 R =1−TSS=1−ni=1(yi−y ̄)2, where TSS is the total sum of squares. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,13 / 26 The R2 statistic measures the fraction of variation in y that is explained by the model and satisfies 0 ≤ R2 ≤ 1. The closer R2 is to 1, the better the model. Results for the advertising data set Quantity Residual standard error (RSE) Value 3.26 0.612 R 2 The R2 statistic has an interpretable advantage over RSE because it always lies between 0 and 1. A good R2 value usually depends on the application. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,14 / 26 Approximately 61% of the variation in sales is explained by the linear model with TV as a predictor. Multiple linear regression In multiple linear regression, we assume a model Y =β0 +β1X1 +...+βpXp +ε, whereβ0,β1,...,βp arep+1unknownparametersandεisanerrortermwith E(ε) = 0. Given some parameter estimates βˆ0,βˆ1,...,βˆp, the prediction of Y at X = x is given by yˆ = βˆ 0 + βˆ 1 x 1 + . . . + βˆ p x p . G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,15 / 26 The slope parameters have a mathematical interpretation in multiple linear regres- sion: βi estimates the expected change in Y per unit change in Xi , with all other predictors fixed. This is a useful mathematical property, but usually the predictors are correlated and hence, tend to change together (an increase in one predictor tends to increase another etc.). Multiple linear regression Y yˆ = βˆ0 + βˆ1x1 + βˆ2x2. G. Erd ́elyi, University of Canterbury 2021 X2 X1 STAT318/462 — Data Mining ,16 / 26 Estimating the parameters: least squares approach The parameters β0, β1, . . . , βp are estimated using the least squares approach. We choose βˆ0, βˆ1, . . . , βˆp to minimize the sum of squared residuals n RSS = (yi−yˆi)2 i=1 n = (yi −βˆ0 −βˆ1xi1 −...−βˆpxip)2. i=1 We will calculate these parameter estimates using R. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,17 / 26 Results for the advertising data Intercept TV Radio Newspaper Coefficient 2.939 0.046 0.189 -0.001 Std. Error 0.3119 0.0014 0.0086 0.0059 t-statistic 9.42 32.81 21.89 -0.18 p-value <0.0001 <0.0001 <0.0001 0.8599 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,18 / 26 When reading this output, each statement is made conditional on the other pre- dictors being in the model. • Given that TV and Radio are in the model, Newspaper is not useful for predicting sales (high p-value). • Possible cause: There is a moderate correlation between Newspaper and Radio of ≈ 0.35. If Radio is included in the model, Newspaper is not needed. Revision material: The sample correlation for variables x and y is ni=1(xi −x ̄)(yi −y ̄) rxy = ni=1(xi − x ̄)2 ni=1(yi − y ̄)2 and satisfies −1 ≤ rxy ≤ 1. The sample correlation measures how close the pairs (xi , yi ) are to falling on a line. If rxy > 0, then x and y are positively correlated. If rxy < 0, then x and y are negatively correlated. Finally, If rxy ≈ 0, then x and y are uncorrelated (not linearly related, but they could be non-linearly related). Is there a relationship between Y and X? To test whether X is associated with Y , we perform a hypothesis test: H0 :β1 =β2 =...=βp =0(thereisnorelationship) HA : at least one βj is non-zero (there is some relationship) If the null hypothesis is true (no relationship), then F = (TSS - RSS)/p RSS/(n − p − 1) will have an F -distribution with parameters p and n − p − 1. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,19 / 26 We will use R to compute the F-statistic and its corresponding p-value. The F- test is a one-sided test with a distribution whose shape is determined by p and n − p − 1: Comparison of F-Distributions F(p,n-p-1) Dist ributions F(20,1000) F(10,50) F(5,10) F(3,5) 012345 F value We reject the null hypothesis (that there is no relationship) if F is sufficiently large (small p-value). Density 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Is the model a good fit? Once we have established that there is some relationship between the response and the predictors, we want to quantify the extent to which the multiple linear model fits the data. The residual standard error (RSE) and R2 are commonly used. For the advertising data we have: Quantity Value Residual standard error (RSE) 1.69 R 2 F-statistic G. Erd ́elyi, University of Canterbury 2021 0.897 (p-value << 0.0001) 570 The F-statistic is very large (p-value is essentially zero) which gives us strong evidence for a relationship (we reject H0 and accept HA). By including all three predictors to the linear model, we have explained ≈ 90% of the variation in sales. The F statistic does not tell us which predictors are important, only that at least one of the slope parameters is non-zero. To determine which predictors are important, further analysis is required. Warning: R2 increases (or remains the same) if more predictors are added to the model. Hence, we should not use R2 for comparing models that contain different numbers of predictors. We will consider model selection in Section 5. STAT318/462 — Data Mining ,20 / 26 Extensions to the linear model We can remove the additive assumption and allow for interaction effects. Consider the standard linear model with two predictors Y =β0 +β1X1 +β2X2 +ε. An interaction term is included by adding a third predictor to the standard model Y =β0 +β1X1 +β2X2 +β3X1X2 +ε. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,21 / 26 For example, spending money on radio advertising could increase the effectiveness of TV advertising. Results for the advertising data Consider the model Sales = β0 + β1Tv + β2Radio + β3(Tv × Radio) + ε. The results are: Intercept TV Radio Tv×Radio Coefficient 6.7502 0.0191 0.0289 0.0011 Std. Error 0.248 0.002 0.009 0.000 t-statistic 27.23 12.70 3.24 20.73 p-value <0.0001 <0.0001 0.0014 <0.0001 ,22 / 26 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining The estimated coefficient for the interaction term is statistically significant, when TV and Radio are included in the model. • There is strong evidence for β3 ̸= 0. • R2 has gone from ≈ 90% (the model with all three predictors) to ≈ 97% by including the interaction term. Note: these models have the same complexity (4 parameters, β0, . . . , β3) and hence can be compared used R2. Hierarchical Principle: If we include an interaction term in the model, we should also include the main predictors (even if they are not significant). • We include them for better interpretation (as interpretable results are often one of the reasons for choosing a linear model in the first place). Extensions to the linear model We can accommodate non-linear relationships using polynomial regression. Consider the simple linear model Y = β0 + β1X + ε. Non-linear relationships can be captured by including powers of X in the model. For example, a quadratic model is Y = β0 + β1X + β2X2 + ε. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,23 / 26 The model is linear in β0,β1 and β2. The hierarchical principle applies here as well. If you include X2 in your model, you should also include X (even if it is not significant). Polynomial regression: Auto data 50 100 150 200 Horsepower G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,24 / 26 Linear Degree 2 Degree 5 The linear fit fails to capture the non-linear structure in the training data. To objectively choose the best fit, we need to use model selection methods (more about this in section 5). We cannot use R2 because R2 fordegree1≤R2 fordegree2≤R2 fordegree5 Subjectively, we could argue that the quadratic model fits best (the degree 5 polynomial is more complex and does appear to capture much more than the quadratic). Miles per gallon 10 20 30 40 50 Results for the auto data The figure suggests that mpg = β0 + β1Horsepower + β2Horsepower2 + ε, may fit the data better than a simple linear model. The results are: Intercept Horsepower Horsepower2 Coefficient 56.9001 -0.4662 0.0012 Std. Error 1.8004 0.0311 0.0001 t-statistic p-value 31.6 <0.0001 -15.0 <0.0001 10.1 <0.0001 G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,25 / 26 Horsepower and (Horsepower)2 are significant and hence, useful for predicting mpg. What we did not cover Qualitative predictors need to be coded using dummy variables for linear regression (R does this automatically for us). Deciding on important variables. Outliers (unusual y value) and high leverage points (x value far from x ̄). Non-constant variance and correlation of error terms. Collinearity. G. Erd ́elyi, University of Canterbury 2021 STAT318/462 — Data Mining ,26 / 26 The textbook covers these topics and you are encouraged to read these sections if you are unfamiliar with this material. A basic knowledge of linear regression is essential to fully appreciate the methods covered in this course. The previous lecture slides (and the course textbook) provide this basic knowledge.