Machine Learning and Data Mining in Business

Lecture 7: Nonlinear Modelling

Discipline of Business Analytics

Copyright By cscodehelp代写 加微信 cscodehelp

Lecture 7: Nonlinear Modelling

Learning objectives

• Generalised additive models. • Regression splines.

• Smoothing splines.

Lecture 7: Nonlinear Modelling

1. Nonlinear modelling

2. Basis functions

3. Regression splines

4. Smoothing splines and penalised splines 5. Curse of dimensionality

6. Generalised additive models

Nonlinear modelling

Generalised additive model

Our objective is to move beyond the linear predictive function

f (x) = β0 + βj xj ,

to study learning algorithms that can estimate nonlinear

relationships.

The generalised additive model (GAM) is

f(x) = β0 + fj(xj),

where we choose a suitable model fj(xj) for each predictor.

Generalised additive model

f (x) = β0 + fj (xj )

• The key question when specifying a GAM is how to model each term fj(xj).

• The model for fj(xj) is known as a smoother in this context.

• Therefore, most of this lecture will focus on methods for learning a univariate function f(x).

Under-smoothing

Insufficient smoothing leads to overfitting.

Over-smoothing

Too much smoothing leads to underfitting.

Adequate smoothing

Polynomial regression

A basic method is the polynomial regression model f(x)=β0 +β1x+β2×2 +…+βdxd,

where d is the polynomial degree.

Example: Fuel Economy

Example: Fuel Economy

Example: Fuel Economy

Example: cubic polynomial

Polynomial regression

Simplicity: the model is a linear regression based on an extended set of features.

Flexibility: a polynomial of sufficiently high degree can approximate any continuous function as closely as desired inside an interval.

Polynomial regression

Seeks a global fit that can be highly unstable: observations in one region, especially outliers, can seriously affect the fit in another region.

Polynomials are unstable near the boundaries of the data. You should never extrapolate polynomial regressions to generate predictions outside the observed range of the predictor.

Polynomials can overfit. A polynomial of degree d can approximate d points exactly, so that increasing d produces a wiggly curve that gets close to the data but predicts poorly.

Example: polynomial overfitting

K-Nearest Neighbours

In contrast to polynomial regression, kNN is completely local.

k-Nearest Neighbours

Very flexible: with sufficient data, it can approximate any reasonable regression function without the need for structural assumptions.

Lack of smoothness.

High variance since each prediction is based on only a few training points.

Basis functions

Basis functions

Polynomial regression:

f(x)=β0 +β1x+β2×2 +…+βdxd.

• The key idea of several learning algorithms is to augment the input vector x with additional features which are transformations of x.

• We then specify models that are linear on the derived features.

Basis functions

Let hm(x) be a deterministic transformation of x, m = 1,…,M. We call hm a basis function. The model

f(x) = β0 + βmhm(x)

m=1 is a linear basis expansion in x.

Basis functions: examples

Linear basis expansion:

f(x) = β0 + βmhm(x)

• h1(x) = x, M = 1 recovers the original linear model.

• h1(x)=x,h2(x)=x2,h3(x)=x3,M=3leadstoacubic

regression.

• hm(x) = log(x), hm(x) = √x, etc, permit other nonlinear transformations of the predictor.

Regression splines

Linear spline

A linear spline is a linear regression with changing slopes at discrete points. If there is one split point (called a knot) at ξ, the model is:

f(x)=β0 +β1x+β2(x−ξ)+,

where we define (x − ξ)+ as

(x − ξ)+ = 0 if x − ξ ≤ 0 x−ξ ifx−ξ>0

Wealsowrite(x−ξ)+ =I(x>ξ)(x−ξ).

Example: linear spline with one knot

Example: linear spline with one knot

Regression Splines

has interpretation

f(x)=β0 +β1x+β2(x−ξ)+

f(x) = β0 + β1x if x ≤ ξ β0+β1x+β2(x−ξ) ifx>ξ.

Note that at point x = ξ the two linear components take the same value. Thus, the resulting function f is continuous.

Piecewise linear regressions

Figure from ESL 26/57

Example: piecewise linear regression

Linear splines

The linear spline model with K knots ξ1, ξ2, . . . , ξK is: f(x)=β0 +β1x+β2(x−ξ1)+ +…+βK+1(x−ξK)+.

Note that this representation uses K + 1 basis functions: h1(x)=x, h2(x)=(x−ξ1)+, … ,hK+1(x)=(x−ξK)+.

Example: linear spline with two knots

Regression splines

A regression spline is a piecewise degree-d polynomial regression that restricts the regression function f(x) to be continuous and have continuous first d − 1 derivatives.

This general approach extends the idea of splines by fitting polynomials instead of linear functions in each region.

Piecewise cubic polynomials

Figure from ESL

Cubic spline

The cubic spline model with K knots is

f(x) = β0 +β1x+β2×2 +β3×3 +β4(x−ξ1)3+ …+βK+3(x−ξK)3+, which has K + 3 basis functions.

For example, when there is K = 1 knot:

f(x)=β0 +β1x+β2×2 +β3×3 +β4(x−ξ1)3+

Example: Cubic Spline

Regression splines vs. polynomial regression

• Regression splines are more flexible than polynomials and tend to perform better for prediction.

• The reason is that we can increase the flexibility of a regression spline by increasing the number of knots to make it more local.

• This leads to more stable estimates compared to increasing d, which is the only option for polynomial regression.

Illustration: regression splines vs. polynomial regression

Illustration: regression splines vs. polynomial regression

Illustration: regression splines vs. polynomial regression

Natural splines

• Natural splines add the constraint that the function is required to be linear at the boundary (i.e to the left of the first knot and to the right of the last knot).

• This increases the stability of the fit near the boundaries of the data, which tends to have high variance.

Natural splines

Natural Cubic Spline Cubic Spline

Figure from ISL

20 30 40 50 60 70 Age

50 100 150 200 250

Illustration: natural spline vs. polynomial regression

Figure from ISL

Regression splines: modelling choices

1. Polynomial order: linear and cubic splines are the most common choices.

2. Placement of the knots: typically at uniformly spaced quantiles. For example, if there is one knot then we would place it a the sample median. If there are three, we would place them at the first quartile, median, and third quartile.

3. Number of knots: model selection.

Degrees of freedom

The degrees of freedom or effective number of parameters of a regression estimator is

ni=1 Cov(Yi,f(xi)) df(f) = ,

σ2 where we assume an additive error model.

The degrees of freedom is a measure of model complexity. For a linear regression without regularisation, the degrees of freedom equals the number of parameters.

Degrees of freedom

• A cubic spline with K interior knots uses K + 3 degrees of freedom (excluding the intercept).

• A natural cubic spline uses K − 1 degrees of freedom (excluding the intercept).

• Advanced: in pratice, natural splines include boundary knots that correspond to the minimum and maximum value of the input. In this case, the degrees of freedom equal the number of interior knots plus one (excluding the intercept).

Smoothing splines and penalised splines

Smoothing splines

A smoothing spline solves the regularised risk minimisation problem

n 2 ′′ 2

yi−f(xi) +λ f(x)dx ,

where λ ≥ 0 is a roughness penalty, f′′(x) is second derivative of the function, and f′′(x)2dx measures the roughness of the function.

Smoothing splines

n 2 ′′ 2

yi − f(xi) + λ f (x) dx

• Remember from calculus that the first derivative f′(x) measures the slope of a function at x, while the second derivative measures how fast the slope increases or decreases as we change x.

• Hence, the second derivative of a function indicates its roughness: it is large in absolute value if f(x) is very wiggly near x, and it is close to zero otherwise.

• f′′(x)2dx is a measure of the total change in f′(x) over its entire range.

Smoothing splines

Smoothing spline:

n 2 ′′ 2

yi − f(xi) + λ f (x) dx

• λ = 0: the solution f can be any function that interpolates the data. There are many such solutions.

• λ = ∞: the simple least squares line fit, since the solution tolerates no second derivative (f′′(x) = 0, hence the model is linear).

• λ ∈ (0, ∞) allows functions that vary from very rough to very smooth. We hope that in between, there is an interesting class of functions that leads to good generalisation.

Smoothing splines

Remarkably, the unique minimiser is a natural cubic spline with knots at x1, x2, . . . , xn. Because the solution is a natural spline, it

has the form

f(x) = β0 + βjgj(x),

where gj(x) is the j-th basis function for the natural spline evaluated at x. However, unlike in the previous section, the solution is regularised.

Choosing the penalty parameter

• We choose the penalty parameter by varying the degrees of freedom of the model, which we can directly compute.

• As with regression splines, we can quickly perform leave one-out cross-validation error (LOOCV) for smoothing splines.

Smoothing splines vs. regression splines

• A smoothing spline avoids the choice of the number and placement of the knots, since it places knots at each unique data point.

• Even though the smoothing spline solution has many parameters, the use of regularisation avoids overfitting.

• On the other hand, smoothing splines have a higher computational cost which increases with the number of observations.

Penalised splines

• Penalised splines (P-splines) are a compromise between regression splines and smoothing splines.

• In practice, the smoothing spline many more knots than necessary for the model to perform well, leading to wasted computational effort.

• In penalised splines, we specify regression splines with potentially many knots and perform regularised estimation as in smoothing splines.

Curse of dimensionality

Curse of dimensionality

• The curse of dimensionality refers to the fact that many machine learning problems become exceptionally more difficult as the number of inputs increases.

• In highly flexible examplar-based models such as kNN and local regression, generalisation becomes exponentially harder as the number of inputs increases.

• In highly flexible parameter-based models, the number of parameters may increase quickly (at worst, exponentially) with the number of inputs.

Illustration

Figure from ESL

Inductive bias

• Inductive bias refers to the set of assumptions that the model uses to generalise. That is, any part of a model that is not learned.

• In practice, learning algorithms must embody knowledge or assumptions in order to predict well.

Generalised additive models

Generalised additive models

We’re now ready to return to the GAM,

f(x)=β0 +f1(x1)+f2(x2)+…+fp(xp),

where choose a suitable smoother fj(xj) for each predictor (linear, polynomial, kNN, regression spline, smoothing spline, penalised spline, local regression, etc).

Generalised additive models

• Depending on the choice smoothers, the GAM may be simply a linear model with many basis functions. In this case, we use standard methods to train the GAM.

• Otherwise, we can use an approach known as backfitting to train the model. This method fits a model involving multiple inputs by repeatedly updating each term fj(xj) in turn, holding the others fixed.

Generalised additive models

GAMs allow us to fit nonlinear functions fj(xj) for each predictor, so that we can automatically model nonlinear relationships that a linear model would miss.

The additive structure avoids the curse of dimensionality.

GAM often display good predictive performance.

Interpretability.

We can summarise the complexity of each function fj(xj) by the degrees of freedom.

Generalised additive models

The model is restricted to be additive, which can miss important interaction effects.

The backfitting algorithm can have high computational cost.

GAMs requires you to specify a suitable smoother for each predictor, which may not be convenient in practice.

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com