# 程序代写代做代考 AI Bayesian scheme chain matlab data mining database GMM algorithm finance ER Lecture 1: Introduction to Forecasting

Lecture 1: Introduction to Forecasting

UCSD, January 9 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Forecasting Winter, 2017 1 / 64

1 Course objectives

2 Challenges facing forecasters

3 Forecast Objectives: the Loss Function

4 Common Assumptions on Loss

5 Specific Types of Loss Functions

6 Multivariate loss

7 Does the loss function matter?

8 Informal Evaluation Methods

9 Out-of-Sample Forecast Evaluation

10 Some easy and hard to predict variables

11 Weak predictability but large economic gains

Timmermann (UCSD) Forecasting Winter, 2017 2 / 64

Course objectives: Develop

Skills in analyzing, modeling and working with time series data from

finance and economics

Ability to construct forecasting models and generate forecasts

formulating a class of models – using information intelligently

model selection

estimation – making best use of historical data

Develop creativity in posing forecasting questions, collecting and

using often incomplete data

which data help me build a better forecasting model?

Ability to critically evaluate and compare forecasts

reasonable (simple) benchmarks

skill or luck? Overfitting (data mining)

Compete or combine?

Timmermann (UCSD) Forecasting Winter, 2017 2 / 64

Ranking forecasters: Mexican inflation

Timmermann (UCSD) Forecasting Winter, 2017 3 / 64

Forecast situations

Forecasts are used to guide current decisions that affect the future

welfare of a decision maker (forecast user)

Predicting my grade – updating information on the likely grade as the

course progresses

Choosing between a fixed-rate mortgage (interest rate fixed for 20

years) versus a floating-rate (variable) mortgage

Depends on interest rate and inflation forecast

Political or sports outcomes – prediction markets

Investing in the stock market. How volatile will the stock market be?

Predicting Chinese property prices. Supply and demand considerations,

economic growth

Structural versus reduced-form approaches

Depends on the forecast horizon: 1 month vs 10 years

Timmermann (UCSD) Forecasting Winter, 2017 4 / 64

Forecasting and decisions

Credit card company deciding which transactions are potentially

fraudulent and should be denied (in real time)

requires fitting a model to past credit card transactions

binary data (zero-one)

Central Bank predicting the state of the economy – timing issues

Predicting which fund manager (if any) or asset class will outperform

Forecasting the outcome of the world cup:

http://www.goldmansachs.com/our-thinking/outlook/world-cup-

sections/world-cup-book-2014-statistical-model.html

Timmermann (UCSD) Forecasting Winter, 2017 5 / 64

Forecasting the outcome of the world cup

Timmermann (UCSD) Forecasting Winter, 2017 6 / 64

Key issues

Decision maker’s actions depend on predicted future outcomes

Trade off relative costs of over- or underpredicting outcomes

Actions and forecasts are inextricably linked

good forecasts are expected to lead to good decisions

bad forecasts are expected to lead to poor decisions

Forecast is an intermediate input in a decision process, rather than an

end product of separate interest

Loss function weighs the cost of possible forecast errors – like a utility

function uses preferences to weigh different outcomes

Timmermann (UCSD) Forecasting Winter, 2017 7 / 64

Loss functions

Forecasts play an important role in almost all decision problems where

a decision maker’s utility or wealth is affected by his current and

future actions and depend on unknown future events

Central Banks

Forecast inflation, unemployment, GDP growth

Action: interest rate; monetary policy

Trade off cost of over- vs. under-predictions

Firms

Forecast sales

Action: production level, new product launch

Trade off inventory vs. stock-out/goodwill costs

Money managers

Forecast returns (mean, variance, density)

Action: portfolio weights/trading strategy

Trade off Risk vs. return

Timmermann (UCSD) Forecasting Winter, 2017 8 / 64

Ways to generate forecasts

Rule of thumb. Simple decision rule that is not optimal, but may be

robust

Judgmental/subjective forecast, e.g., expert opinion

Combine with other information/forecasts

Quantitative models

“… an estimated forecasting model provides a characterization of what

we expect in the present, conditional upon the past, from which we

infer what to expect in the future, conditional upon the present and the

past. Quite simply, we use the estimated forecasting model to

extrapolate the observed historical data.” (Frank Diebold, Elements of

Forecasting).

Combine different types of forecasts

Timmermann (UCSD) Forecasting Winter, 2017 9 / 64

Forecasts: key considerations

Forecasting models are simplified approximations to a complex reality

How do we make the right shortcuts?

Which methods seem to work in general or in specific situations?

Economic theory may suggest relevant predictor variables, but is silent

about functional form, dynamics of forecasting model

combine art (judgment) and science

how much can we learn from the past?

Timmermann (UCSD) Forecasting Winter, 2017 10 / 64

Forecast object – what are we trying to forecast?

Event outcome: predict if a certain event will happen

Will a bank or hedge fund close?

Will oil prices fall below $40/barrel in 2017?

Will Europe experience deflation in 2017?

Event timing: it is known that an event will happen, but unknown

when it will occur

When will US stocks enter a “bear” market (Dow drops by 10%)?

Time-series: forecasting future values of a continuous variable by

means of current and past data

Predicting the level of the Dow Jones Index on March 15, 2017

Timmermann (UCSD) Forecasting Winter, 2017 11 / 64

Forecast statement

Point forecast

Single number summarizing “best guess”. No information on how

certain or precise the point forecast is. Random shocks affect all

time-series so a non-zero forecast error is to be expected even from a

very good forecast

Ex: US GDP growth for 2017 is expected to be 2.5%

Interval forecast

Lower and upper bound on outcome. Gives a range of values inside

which we expect the outcome will fall with some probability (e.g., 50%

or 95%). Confidence interval for the predicted variable. Length of

interval conveys information about forecast uncertainty.

Ex: 90% chance US GDP growth will fall between 1% and 4%

Density or probability forecast

Entire probability distribution of the future outcome

Ex: US GDP growth for 2017 is Normally distributed N(2.5,1)

Timmermann (UCSD) Forecasting Winter, 2017 12 / 64

Forecast horizon

The best forecasting model is likely to depend on whether we are

forecasting 1 minute, 1 day, 1 month or 1 year ahead

We refer to an h−step-ahead forecast, where h (short for “horizon”)

is the number of time periods ahead that we predict

Often you hear the argument that “fundamentals matter in the long

run, psychological factors are more important in the short run”

Timmermann (UCSD) Forecasting Winter, 2017 13 / 64

Information set

Do we simply use past values of a series itself or do we include a

larger information set?

Suppose we wish to forecast some outcome y for period T + 1 and

have historical data on this variable from t = 1, ..,T . The univariate

information set consists of the series itself up to time T :

IunivariateT = {y1, …, yT }

If data on other series, zt (typically an N × 1 vector), are available,

we have a multivariate information set

ImultivariateT = {y1, …, yT , z1, …, zT }

It is often important to establish whether a forecast can benefit from

using such additional information

Timmermann (UCSD) Forecasting Winter, 2017 14 / 64

Loss function: notations

Outcome: Y

Forecast: f

Forecast error: e = Y − f

Observed data: Z

Loss function: L(f ,Y )→ R

maps inputs f ,Y to the real number line R

yields a complete ordering of forecasts

describes in relative terms how costly it is to make forecast errors

Timmermann (UCSD) Forecasting Winter, 2017 15 / 64

Loss Function Considerations

Choice of loss function that appropriately measures trade-offs is

important for every facet of the forecasting exercise and affects

which forecasting models are preferred

how parameters are estimated

how forecasts are evaluated and compared

Loss function reflects the economics of the decision problem

Financial analysts’forecasts; Hong and Kubik (2003), Lim (2001)

Analysts tend to bias their earnings forecasts (walk-down effect)

Sometimes a forecast is best viewed as a signal in a strategic game

that explicitly accounts for the forecast provider’s incentives

Timmermann (UCSD) Forecasting Winter, 2017 16 / 64

Constructing a loss function

For profit maximizing investors the natural choice of loss is the

function relating payoffs (through trading rule) to the forecast and

realized returns

Link between loss and utility functions: both are used to minimize risk

arising from economic decisions

Loss is sometimes viewed as the negative of utility

U(f ,Y ) ≈ −L(Y , f )

Majority of forecasting papers use simple ‘off the shelf’statistical loss

functions such as Mean Squared Error (MSE)

Timmermann (UCSD) Forecasting Winter, 2017 17 / 64

Common Assumptions on Loss

Granger (1999) proposes three ‘required’properties for error loss

functions, L(f , y) = L(y − f ) = L(e):

A1. L(0) = 0 (minimal loss of zero for perfect forecast);

A2. L(e) ≥ 0 for all e;

A3. L(e) is monotonically non-decreasing in |e| :

L(e1) ≥ L(e2) if e1 > e2 > 0

L(e1) ≥ L(e2) if e1 < e2 < 0
A1: normalization
A2: imperfect forecasts are more costly than perfect ones
A3: regularity condition - bigger forecast mistakes are (weakly)
costlier than smaller mistakes (of same sign)
Timmermann (UCSD) Forecasting Winter, 2017 18 / 64
Additional Assumptions on Loss
Symmetry:
L(y − f , y) = L(y + f , y)
Granger and Newbold (1986, p. 125): “.. an assumption of symmetry
about the conditional mean ... is likely to be an easy one to accept ...
an assumption of symmetry for the cost function is much less
acceptable.”
Homogeneity: for some positive function h(a) :
L(ae) = h(a)L(e)
scaling doesn’t matter
Differentiability of loss with respect to the forecast (regularity
condition)
Timmermann (UCSD) Forecasting Winter, 2017 19 / 64
Squared Error (MSE) Loss
L(e) = ae2, a > 0

Satisfies the three Granger properties

Homogenous, symmetric, differentiable everywhere

Convex: penalizes large forecast errors at an increasing rate

Optimal forecast:

f ∗ = arg

f

min

∫

(y − f )2pY dy

First order condition

f ∗ =

∫

ypY dy = E (y)

The optimal forecast under MSE loss is the conditional mean

Timmermann (UCSD) Forecasting Winter, 2017 20 / 64

Piece-wise Linear (lin-lin) Loss

L(e) = (1− α)e1e>0 − αe1e≤0, 0 < α < 1 1e>0 = 1 if e > 0, otherwise 1e>0 = 0. Indicator variable

Weight on positive forecast errors: (1− α)

Weight on negative forecast errors: α

Lin-lin loss satisfies the three Granger properties and is homogenous

and differentiable everywhere with regard to f , except at zero

Lin-lin loss does not penalize large errors as much as MSE loss

Mean absolute error (MAE) loss arises if α = 1/2:

L(e) = |e|

Timmermann (UCSD) Forecasting Winter, 2017 21 / 64

MSE vs. piece-wise Linear (lin-lin) Loss

-3 -2 -1 0 1 2 3

0

5

10

L(

e)

e

α = 0.25

-3 -2 -1 0 1 2 3

0

5

10

L(

e)

e

α = 0.5, MAE loss

-3 -2 -1 0 1 2 3

0

5

10

L(

e)

e

α = 0.75

MSE

linlin

MSE

linlin

MSE

linlin

Timmermann (UCSD) Forecasting Winter, 2017 22 / 64

Optimal forecast under lin-lin Loss

Expected loss under lin-lin loss:

EY [L(Y − f )] = (1− α)E [Y |Y > f ]− αE [Y |Y ≤ f ]

First order condition:

f ∗ = P−1Y (1− α)

PY : CDF of Y

The optimal forecast is the (1− α) quantile of Y

α = 1/2 : optimal forecast is the median of Y

As α increases towards one, the optimal forecast moves further to the

left of the tail of the predicted outcome distribution

Timmermann (UCSD) Forecasting Winter, 2017 23 / 64

Optimal forecast of N(0,1) variable under lin-lin loss

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

2.5

α

f*

Timmermann (UCSD) Forecasting Winter, 2017 24 / 64

Linex Loss

L(e) = exp(a2e)− a2e − 1, a2 6= 0

Differentiable everywhere

Asymmetric: a2 controls both the degree and direction of asymmetry

a2 > 0 : loss is approximately linear for e < 0 and approximately exponential for e > 0

Large underpredictions are very costly (f < y , so e = y − f > 0)

Converse is true when a2 < 0 Timmermann (UCSD) Forecasting Winter, 2017 25 / 64 MSE versus Linex Loss -3 -2 -1 0 1 2 3 0 5 10 15 20 L( e) e right-skewed linex loss with a 2 =1 -3 -2 -1 0 1 2 3 0 5 10 15 20 L( e) e left-skewed linex loss with a 2 =-1 MSE Linex MSE Linex Timmermann (UCSD) Forecasting Winter, 2017 26 / 64 Linex Loss Suppose Y ∼ N(µY , σ 2 Y ). Then E [L(e)] = exp(a2(µY − f ) + a22 2 σ2Y )− a2(µY − f ) Optimal forecast: f ∗ = µY + a2 2 σ2Y Under linex loss, the optimal forecast depends on both the mean and variance of Y (µY and σ 2 Y ) as well as on the curvature parameter of the loss function, a2 Timmermann (UCSD) Forecasting Winter, 2017 27 / 64 Optimal bias under Linex Loss for N(0,1) variable -3 -2 -1 0 1 2 3 0 0.2 0.4 e MSE loss -3 -2 -1 0 1 2 3 0 0.2 0.4 e linex loss with a 2 =1 -3 -2 -1 0 1 2 3 0 0.2 0.4 e linex loss with a 2 =-1 Timmermann (UCSD) Forecasting Winter, 2017 28 / 64 Multivariate Loss Functions Multivariate MSE loss with n errors e = (e1, ..., en)′ : MSE (A) = e ′Ae A is a nonnegative and positive definite n× n matrix This satisfies the basic assumptions for a loss function When A = In, covariances can be ignored and the loss function simplifies to MSE (In) = E [e ′e] = ∑ n i=1 e 2 i , i.e., the sum of the individual mean squared errors Timmermann (UCSD) Forecasting Winter, 2017 29 / 64 Does the loss function matter? Cenesizoglu and Timmermann (2012) compare statistical and economic measures of forecasting performance across a large set of stock return prediction models with time-varying mean and volatility Economic performance is measured through the certainty equivalent return (CER), i.e., the risk-adjusted return Statistical performance is measured through mean squared error (MSE) Performance is measured relative to that of a constant expected return (prevailing mean) benchmark Common for forecast models to produce worse mean squared error (MSE) but better return performance than the benchmark Relation between statistical and economic measures of forecasting performance can be weak Timmermann (UCSD) Forecasting Winter, 2017 30 / 64 Does loss function matter? Cenesizoglu and Timmermann Timmermann (UCSD) Forecasting Winter, 2017 31 / 64 Percentage of models with worse statistical but better economic performance than prevailing mean (CT, 2012) CER is certainty equivalent return Sharpe is the Sharpe ratio RAR is risk-adjusted return RMSE is root mean squared (forecast) error Timmermann (UCSD) Forecasting Winter, 2017 32 / 64 Example: Directional Trading system Consider the decisions of a risk-neutral ‘market timer’whose utility is linear in the return on the market portfolio (y) U(δ(f ), y) = δy Investor’s decision rule, δ(f ) : go ‘long’one unit in the risky asset if a positive return is predicted (f > 0), otherwise go short one unit:

δ(f ) =

{

1 if f ≥ 0

−1 if f < 0
Let sign(y) = 1, if y > 0, otherwise sign(y) = 0. Payoff:

U(y , δ(f )) = (2sign(f )− 1)y

Sign and magnitude of y and sign of f matter to trader’s utility

Timmermann (UCSD) Forecasting Winter, 2017 33 / 64

Example: Directional Trading system (cont.)

Which forecast approach is best under the directional trading rule?

Since the trader ignores information about the magnitude of the

forecast, an approach that focuses on predicting only the sign of the

excess return could make sense

Leitch and Tanner (1991) studied forecasts of T-bill futures:

Professional forecasters reported predictions with higher mean squared

error (MSE) than those from simple time-series models

Puzzling since the time-series models incorporate far less information

than the professional forecasts

When measured by their ability to generate profits or correctly forecast

the direction of future interest rate movements the professional

forecasters did better than the time-series models

Professional forecasters’objectives are poorly approximated by MSE

loss – closer to directional or ‘sign’loss

Timmermann (UCSD) Forecasting Winter, 2017 34 / 64

Common estimates of forecasting performance

Define the forecast error et+h|t = yt+h − ft+h|t . Then

MSE = T−1

T

∑

t=1

e2t+h|t

RMSE =

√√√√T−1 T∑

t=1

e2

t+h|t

MAE = T−1

T

∑

t=1

|et+h|t |

Directional accuracy (DA): let Ixt+1>0 = 1 if xt+1 > 0, otherwise

Ixt+1>0 = 0. Then an estimate of DA is

DA = T−1

T

∑

t=1

Iyt+h×ft+h|t>0

Timmermann (UCSD) Forecasting Winter, 2017 35 / 64

Forecast evaluation

ft+h|t : forecast of yt+h given information available at time t

Given a sequence of forecasts, ft+h|t , and outcomes, yt+h,

t = 1, …,T , it is natural to ask if the forecast was “optimal”or

obviously deficient

Questions posed by forecast evaluation are related to the

measurement of predictive accuracy

Absolute performance measures the accuracy of an individual

forecast relative to the outcome, using either an economic

(loss-based) or a statistical metric

Relative performance compares the performance of one or several

forecasts against some benchmark

Timmermann (UCSD) Forecasting Winter, 2017 36 / 64

Forecast evaluation (cont.)

Forecast evaluation amounts to understanding if the loss from a given

forecast is “small enough”

Informal methods – graphical plots, decompositions

Formal methods – distribution of test statistic for sample averages of

loss estimates can depend on how the forecasts were constructed, e.g.

which estimation method was used

The method (not only the model) used to construct the forecast

matters – expanding vs. rolling estimation window

Formal evaluation of an individual forecast requires testing whether

the forecast is optimal with respect to some loss function and a

specific information set

Rejection of forecast optimality suggests that the forecast can be

improved

Timmermann (UCSD) Forecasting Winter, 2017 37 / 64

Effi cient Forecast: Definition

A forecast is effi cient (optimal) if no other forecast using the available

data, xt ∈ It , can be used to generate a smaller expected loss

Under MSE loss:

f̂ ∗t+h|t = arg

f̂ (xt )

minE

[

(yt+h − f̂ (xt ))2

]

If we can use information in It to produce a more accurate forecast,

then the original forecast would be suboptimal

Effi ciency is conditional on the information set

weak form forecast effi ciency tests include only past forecasts and

past outcomes It = {yt , yt−1, …, f̂t |t−1, et |t−1, …}

strong form effi ciency tests extend this to include all other variables

xt ∈ It

Timmermann (UCSD) Forecasting Winter, 2017 38 / 64

Optimality under MSE loss

First order condition for an optimal forecast under MSE loss:

E [

∂(yt+h − ft+h|t )2

∂ft+h|t

] = −2E

[

yt+h − ft+h|t

]

= −2E

[

et+h|t

]

= 0

Similarly, conditional on information at time t, It :

E [et+h|t |It ] = 0

The expected value of the forecast error must equal zero given

current information, It

Test E [et+h|txt ] = 0 for all variables xt ∈ It known at time t

If the forecast is optimal, no variable known at time t can predict its

future forecast error et+h|t . Otherwise the forecast wouldn’t be

optimal

If I can predict that my forecast will be too low, I should increase my

forecast

Timmermann (UCSD) Forecasting Winter, 2017 39 / 64

Optimality properties under Squared Error Loss

1 Optimal forecasts are unbiased: the forecast error et+h|t has zero

mean, both conditionally and unconditionally:

E [et+h|t ] = E [et+h|t |It ] = 0

2 h-period forecast errors (et+h|t) are uncorrelated with information

available at the time the forecast was computed (It). In particular,

single-period forecast errors, et+1|t , are serially uncorrelated:

E [et+1|tet |t−1] = 0

3 The variance of the forecast error (et+h|t) increases (weakly) in the

forecast horizon, h :

Var(et+h+1|t ) ≥ Var(et+h|t ) for all h ≥ 1

Timmermann (UCSD) Forecasting Winter, 2017 40 / 64

Optimality properties under Squared Error Loss (cont.)

Forecasts should be unbiased. Why? If they were biased, we could

improve the forecast simply by correcting for the bias

Suppose ft+1|t is biased:

yt+1 = 1+ ft+1|t + εt+1, εt+1 ∼ WN(0, σ

2)

The bias-corrected forecast:

f ∗t+1|t = 1+ ft+1|t

is more accurate than ft+1|t

Forecast errors should be unpredictable:

Suppose yt+1 − ft+1|t = et+1 = 0.5et + εt+1 so the one-step forecast

error is serially correlated

Adding back 0.5et to the original forecast yields a more accurate

forecast: f ∗t+1|t = ft+1|t + 0.5et is better than f

∗

t+1|t

Variance of forecast error increases in the forecast horizon

We learn more information as we get closer to the forecast “target”

Timmermann (UCSD) Forecasting Winter, 2017 41 / 64

Informal evaluation methods (Greenbook forecasts)

Time-series graph of forecasts and outcomes {ft+h|t , yt+h}Tt=1

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

-10

-5

0

5

10

GDP growth

time

an

nu

al

iz

ed

c

ha

ng

e

Actual

Forecast

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

0

2

4

6

8

10

12

14

inflation rate

time, t

an

nu

al

iz

ed

c

ha

ng

e

Actual

Forecast

Timmermann (UCSD) Forecasting Winter, 2017 42 / 64

Informal evaluation methods (Greenbook forecasts)

Scatterplots of {ft+h|t , yt+h}Tt=1

-10 -8 -6 -4 -2 0 2 4 6 8 10

-10

-5

0

5

10

GDP growth

forecast

ac

tu

al

0 5 10 15

0

5

10

15

inflation rate

forecast

ac

tu

al

Timmermann (UCSD) Forecasting Winter, 2017 43 / 64

Informal evaluation methods (Greenbook Forecasts)

Plots of ft+h|t − yt against yt+h − yt : directional accuracy

-15 -10 -5 0 5 10 15

-10

-5

0

5

10

forecast

ac

tu

al

GDP growth

-10

-5

0

5

10

-15 -10 -5 0 5 10 15

-4 -3 -2 -1 0 1 2 3 4

-6

-4

-2

0

2

4

6

forecast

ac

tu

al

inflation rate

-6

-4

-2

0

2

4

6

-4 -3 -2 -1 0 1 2 3 4

Timmermann (UCSD) Forecasting Winter, 2017 44 / 64

Informal evaluation methods (Greenbook forecasts)

Plot of forecast errors et+h = yt+h − ft+h|t

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

-5

0

5

10

fo

re

ca

st

e

rr

or

GDP growth

1965 1970 1975 1980 1985 1990 1995 2000 2005 2010

-4

-2

0

2

4

6

fo

re

ca

st

e

rr

or

time, t

Inflation rate

Timmermann (UCSD) Forecasting Winter, 2017 45 / 64

Informal evaluation methods

Theil (1961) suggested the following decomposition:

E [y − f ]2 = E [(y − Ey)− (f − Ef ) + (Ey − Ef )]2

= (Ey − Ef )2 + (σy − σf )2 + 2σyσf (1− ρ)

MSE depends on

squared bias (Ey − Ef )2

squared differences in standard deviations (σy − σf )2

correlation between the forecast and outcome ρ

Timmermann (UCSD) Forecasting Winter, 2017 46 / 64

Pseudo out-of-sample Forecasts

Simulated (“pseudo”) out-of-sample (OoS) forecasts seek to mimic

the “real time”updating underlying most forecasts

What would a forecaster have done (historically) at a given point in

time?

Method splits data into an initial estimation sample (in-sample

period) and a subsequent evaluation sample (OoS period)

Forecasts are based on parameter estimates that use data only up to

the date when the forecast is computed

As the sample expands, the model parameters get updated, resulting

in a sequence of forecasts

Why do out-of-sample forecasting?

control for data mining – harder to “game”

feasible in real time (less “look-ahead” bias)

Timmermann (UCSD) Forecasting Winter, 2017 47 / 64

Pseudo out-of-sample forecasts (cont.)

Out-of-sample (OoS) forecasts impose the constraint that the

parameter estimates of the forecasting model only use information

available at the time the forecast was computed

Only information known at time t can be used to estimate and select

the forecasting model and generate forecasts ft+h|t

Many variants of OoS forecast estimation methods exist. These can

be illustrated for the linear regression model

yt+1 = β

′xt + εt+1

f̂t+1|t = β̂

′

txt

β̂t =

(

t

∑

s=1

ω(s, t)xs−1x

′

s−1

)−1 (

t

∑

s=1

ω(s, t)xs−1y

′

s

)

Different methods use different weighting functions ω(s, t)

Timmermann (UCSD) Forecasting Winter, 2017 48 / 64

Expanding window

Expanding or recursive estimation windows put equal weight on all

observations s = 1, …, t to estimate the parameters of the model:

ω(s, t) =

{

1 1 ≤ s ≤ t

0 otherwise

As time progresses, the estimation sample grows larger, It ⊆ It+1

If the parameters of the model do not change (“stationarity”), the

expanding window approach makes effi cient use of the data and leads

to consistent parameter estimates

If model parameters are subject to change, the approach leads to

biased forecasts

The approach works well empirically due to its use of all available

data which reduces the effect of estimation error on the forecasts

Timmermann (UCSD) Forecasting Winter, 2017 49 / 64

Expanding window

1 t t+1 t+2 T-1

time

Timmermann (UCSD) Forecasting Winter, 2017 50 / 64

Rolling window

Rolling window uses an equal-weighted kernel of the most recent ω̄

observations to estimate the parameters of the forecasting model

ω(s, t) =

{

1 t − ω̄+ 1 ≤ s ≤ t

0 otherwise

Only one ‘design’parameter: ω̄ (length of window)

Practical way to account for slowly-moving changes to the data

generating process

Does this address “breaks”?

window too long immediately after breaks

window too short further away

Timmermann (UCSD) Forecasting Winter, 2017 51 / 64

Rolling window

t-w+1 t-w+2 t t+1 t+2 T-1

time

Timmermann (UCSD) Forecasting Winter, 2017 52 / 64

Fixed window

Fixed window uses only the first ω̄0 observations to once and for all

estimate the parameters of the forecasting model

ω(s, t) =

{

1 1 ≤ s ≤ ω̄0

0 otherwise

This method is typically employed when the costs of estimation are

very high, so re-estimating the model with new data is prohibitively

expensive or impractical in real time

The method also makes analytical results easier

Timmermann (UCSD) Forecasting Winter, 2017 53 / 64

Fixed window

1 w t t+1 t+2 T-1

time

Timmermann (UCSD) Forecasting Winter, 2017 54 / 64

Exponentially declining weights

In the presence of model instability, it is common to discount past

observations using weights that get smaller, the older the data

Exponentially declining weights take the following form:

ω(s, t) =

{

λt−s 1 ≤ s ≤ t

0 otherwise

0 < λ < 1. This method is sometimes called discounted least squares as the discount factor, λ, puts less weight on past observations Timmermann (UCSD) Forecasting Winter, 2017 55 / 64 Comparisons Expanding estimation window: number of observations available for estimating model parameters increases with the sample size Effect of estimation error gets reduced Fixed/rolling/discounted window: parameter estimation error continues to affect the forecasts even as the sample grows large model parameters are inconsistent Forecasts vary more under the short (fixed and rolling) estimation windows than under the expanding window Timmermann (UCSD) Forecasting Winter, 2017 56 / 64 US stock index Timmermann (UCSD) Forecasting Winter, 2017 57 / 64 Monthly US stock returns Timmermann (UCSD) Forecasting Winter, 2017 58 / 64 Monthly inflation Timmermann (UCSD) Forecasting Winter, 2017 59 / 64 US T-bill rate Timmermann (UCSD) Forecasting Winter, 2017 60 / 64 US Stock market volatility Timmermann (UCSD) Forecasting Winter, 2017 61 / 64 Example: Portfolio Choice under Mean-Variance Utility T-bills with known payoff rf vs stocks with uncertain return r s t+1 and excess return rt+1 = r st+1 − rf Wt = $1 : Initial wealth ωt : portion of portfolio held in stocks at time t (1−ωt ) : portion of portfolio held in Tbills Wt+1 : future wealth Wt+1 = (1−ωt )rf +ωt (rt+1 + rf ) = rf +ωt rt+1 Investor chooses ωt to maximize mean-variance utility: Et [U(Wt+1)] = Et [Wt+1]− A 2 Vart (Wt+1) Et [Wt+1] and Vart (Wt+1) : conditional mean and variance of Wt+1 Timmermann (UCSD) Forecasting Winter, 2017 62 / 64 Portfolio Choice under Mean-Variance Utility (cont.) Suppose stock returns follow the process rt+1 = µ+ xt + εt+1 xt ∼ (0, σ2x ), εt+1 ∼ (0, σ2ε ), cov(xt , εt+1) = 0 xt : predictable component given information at t εt+1 : unpredictable innovation (shock) Uninformed investor’s (no information on xt) stock holding: ω∗t = arg ωt max { ωtµ+ rf − A 2 ω2t (σ 2 x + σ 2 ε ) } = µ A(σ2x + σ 2 ε ) E [U(Wt+1(ω ∗ t ))] = rf + µ2 2A(σ2x + σ 2 ε ) = rf + S2 2A S = µ/ √ σ2x + σ 2 ε : unconditional Sharpe ratio Timmermann (UCSD) Forecasting Winter, 2017 63 / 64 Portfolio Choice under Mean-Variance Utility (cont.) Informed investor knows xt . His stock holdings are ω∗t = µ+ xt Aσ2ε Et [U(Wt+1(ω ∗ t ))] = rf + (µ+ xt )2 2Aσ2ε Average (unconditional expectation) value of this is E [Et [U(Wt+1(ω ∗ t ))]] = rf + µ2 + σ2x 2Aσ2ε Increase in expected utility due to knowing the predictor variable: E [U inf ]− E [Uun inf ] = σ2x 2Aσ2ε = R2 2A(1− R2) Plausible empirical numbers, i.e., R2 = 0.005, and A = 3, give an annualized certainty equivalent return of about 1% Timmermann (UCSD) Forecasting Winter, 2017 64 / 64 Lecture 2: Univariate Forecasting Models UCSD, January 18 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) ARMA Winter, 2017 1 / 59 1 Introduction to ARMA models 2 Covariance Stationarity and Wold Representation Theorem 3 Forecasting with ARMA models 4 Estimation and Lag Selection for ARMA Models Choice of Lag Order 5 Random walk model 6 Trend and Seasonal Components Seasonal components Trended Variables Timmermann (UCSD) ARMA Winter, 2017 2 / 59 Introduction: ARMA models When building a forecasting model for an economic or financial variable, the variable’s own past time series is often the first thing that comes to mind Many time series are persistent Effect of past and current shocks takes time to evolve Auto Regressive Moving Average (ARMA) models Work horse of forecast profession since Box and Jenkins (1970) Remain the centerpiece of many applied forecasting courses Used extensively commercially Timmermann (UCSD) ARMA Winter, 2017 2 / 59 Why are ARMA models so popular? 1 Minimalist demand on forecaster’s information set: Need only past history of the variable IT = {y1, y2, ..., yT−1, yT } "Reduced form": No need to derive fully specified model for y By excluding other variables, ARMA forecasts show how useful the past of a time series is for predicting its future 2 Empirical success: ARMA forecasts often provide a good ‘benchmark’and have proven surprisingly diffi cult to beat in empirical work 3 ARMA models underpinned by theoretical arguments Wold Representation Theorem: Covariance stationary processes can be represented as a (possibly infinite order) moving average process ARMA models have certain optimality properties among linear projections of a variable on its own past and past shocks to the series ARMA models are not optimal in a global sense - it may be optimal to use nonlinear transformations of past values of the series or to condition on a wider information set ("other variables") Timmermann (UCSD) ARMA Winter, 2017 3 / 59 Covariance Stationarity: Definition A time series, or stochastic process, {yt}∞t=−∞, is covariance stationary if The mean of yt , µt = E [yt ], is the same for all values of t: µt = µ without loss of generality we set µt = 0 for all t [de-meaning] The autocovariance exists and does not depend on t, but only on the "distance", j , i.e., E [ytyt−j ] ≡ γ(j , t) = γ(j) for all t Autocovariance measures how strong the covariation is between current and past values of a time series If yt is independently distributed over time, then E [ytyt−j ] = 0 for all j 6= 0 Timmermann (UCSD) ARMA Winter, 2017 4 / 59 Covariance Stationarity: Interpretation History repeats: if the series changed fundamentally over time, the past would not be useful for predicting the future of the series. To rule out this situation, we have to assume a certain degree of stability of the series. This is known as covariance stationarity Covariance stationarity rules out shifting patterns such as trends in the mean of a series breaks in the mean, variance, or autocovariance of a series Covariance stationarity allows us to use historical information to construct a forecasting model and predict the future Under covariance stationarity Cov(y2016, y2015) = Cov(y2017, y2016). This allows us to predict y2017 from y2016 Timmermann (UCSD) ARMA Winter, 2017 5 / 59 White noise Covariance stationary processes can be built from white noise: Definition A stochastic process, εt , is called white noise if it has zero mean, constant variance, and is serially uncorrelated: E [εt ] = 0 Var(εt ) = σ 2 E [εt εs ] = 0, for all t 6= s Timmermann (UCSD) ARMA Winter, 2017 6 / 59 Wold Representation Theorem Any covariance stationary process can be written as an infinite order MA model, MA(∞), with coeffi cients θi that are independent of t : Theorem Wold’s Representation Theorem: Any covariance stationary stochastic process {yt} can be represented as a linear combination of serially uncorrelated lagged white noise terms εt and a linearly deterministic component, µt : yt = ∞ ∑ j=0 θj εt−j + µt where {θi} are independent of time and ∑∞j=0 θ 2 j < ∞. Timmermann (UCSD) ARMA Winter, 2017 7 / 59 Wold Representation Theorem: Discussion Since E [εt ] = 0, E [ε2t ] = σ 2 ≥ 0, E [εt εs ] = 0, for all t 6= s, εt is not predictable using linear models of past data Practical concern: MA order is potentially infinite Since ∑∞j=0 θ 2 j < ∞, the parameters are likely to die off over time - a finite approximation to the infinite MA process could be appropriate In practice we need to construct εt from data (filtering) MA representation holds apart from a possible deterministic term, µt , which is perfectly predictable infinitely far into the future e.g., constant, linear time trend, seasonal pattern, or sinusoid with known periodicity Timmermann (UCSD) ARMA Winter, 2017 8 / 59 Estimation of Autocovariances Autocovariances and autocorrelations can be estimated from sample data (sample t = 1, ....,T ): Ĉov(Yt ,Yt−j ) = 1 T − j − 1 T ∑ t=j+1 (yt − ȳ)(yt−j − ȳ) ρ̂j = ĉov(yt , yt−j ) v̂ar(yt ) where ȳ = (1/T )∑Tt=1 yt is the sample mean of Y Testing for autocorrelation: Q−stat can be used to test for serial correlation of order 1, ...,m : Q = T m ∑ j=1 ρ̂2j ∼ χ 2 m Small p-values (below 0.05) suggest significant serial correlation Timmermann (UCSD) ARMA Winter, 2017 9 / 59 Autocovariances in matlab autocorr : computes sample autocorrelation parcorr : computes sample partial autocorrelation lbqtest: computes Ljung-Box Q-test for residual autocorrelation Timmermann (UCSD) ARMA Winter, 2017 10 / 59 Sample autocorrelation for US T-bill rate Timmermann (UCSD) ARMA Winter, 2017 11 / 59 Sample autocorrelation for US stock returns Timmermann (UCSD) ARMA Winter, 2017 12 / 59 Autocorrelations and predictability The more strongly autocorrelated a variable is, the easier it is to predict its mean strong serial correlation means the series is slowly mean reverting and so the past is useful for predicting the future strongly serially correlated variables include interest rates (in levels) level of inflation rate (year on year) weakly serially correlated or uncorrelated variables include stock returns changes in inflation growth rate in corporate dividends Timmermann (UCSD) ARMA Winter, 2017 13 / 59 Lag Operator and Lag Polynomials The lag operator, L, when applied to any variable simply lags the variable by one period: Lyt = yt−1 Lpyt = yt−p Lag polynomials such as φ(L) take the form φ(L) = p ∑ i=0 φiL i For example, if p = 2 and φ(L) = 1− φ1L− φ2L 2, then φ(L)yt = 1× yt − φ1Lyt − φ2L 2yt = yt − φ1yt−1 − φ2yt−2 Timmermann (UCSD) ARMA Winter, 2017 14 / 59 ARMA Models Autoregressive models specify y as a function of its own lags Moving average models specify y as a weighted average of past shocks (innovations) to the series ARMA(p, q) specification for a stationary variable yt : yt = φ1yt−1 + ...+ φpyt−p + εt + θ1εt−1 + ...+ θqεt−q In lag polynomial notation φ(L)yt = θ(L)εt φ(L) = 1− p ∑ j=0 φiL i θ(L) = q ∑ i=0 θiL i = 1+ θ1L+ ...+ θqL q Timmermann (UCSD) ARMA Winter, 2017 15 / 59 AR(1) Model ARMA(1, 0) or AR(1) model takes the form: yt = φ1yt−1 + εt (1− φ1L)yt = εt , θ(L) = 1 By recursive backward substitution, yt = φ1(φ1yt−2 + εt−1)︸ ︷︷ ︸ yt−1 + εt = φ 2 1yt−2 + εt + φ1εt−1 Iterating further backwards, we have, for h ≥ 1, yt = φ h 1yt−h + h−1 ∑ s=0 φs1εt−s = φh1yt−h + θ(L)εt , where θ(L) : θi = φ i 1 (for i = 1, .., h− 1) Timmermann (UCSD) ARMA Winter, 2017 16 / 59 AR(1) Model AR(1) model is equivalent to an MA(∞) model as long as φh1yt−h becomes “small” in a mean square sense: E [ yt − h−1 ∑ s=0 φs1εt−s ]2 = E [ φh1yt−h ]2 ≤ φ2h1 γy (0)→ 0 as h→ ∞, provided that φ2h1 → 0, i.e., |φ1| < 1 Stationary AR(1) process has an equivalent MA(∞) representation The root of the polynomial φ(z) = 1− φ1L = 0 is L ∗ = 1/φ1, so |φ1| < 1 means that the root exceeds one. This is a necessary and suffi cient condition for stationarity of an AR(1) process Stationarity of an AR(p) model requires that all roots of the equation φ(z) = 0 exceed one (fall outside the unit circle) Timmermann (UCSD) ARMA Winter, 2017 17 / 59 MA(1) Model ARMA(0, 1) or MA(1) model: yt = εt + θ1εt−1, i.e., φ(L) = 1, θ(L) = 1+ θ1L Backwards substitution yields εt = yt 1+ θ1L = h ∑ s=0 (−θ1)syt−s + (−θ1)hεt−h εt is equivalent to an AR(h) process with coeffi cients φs = (−θ1) s provided that E [(−θ1)hεt−h ] gets small as h increases, i.e., |θ1| < 1 MA(q) is invertible if the roots of θ(z) exceed one Invertible MA process can be written as an infinite order AR process A stationary and invertible ARMA(p, q) process can be written as either an AR model or as an MA model, typically of infinite order yt = φ(L) −1θ(L)εt or θ(L) −1φ(L)yt = εt Timmermann (UCSD) ARMA Winter, 2017 18 / 59 ARIMA representation for nonstationary processes Suppose that d of the roots of φ(L) equal unity (one), while the remaining roots of φ̃(L) fall outside the unit circle. Factorization: φ(L) = φ̃(L)(1− L)d Applying (1− L) to a series is called differencing Let ỹt = (1− L)dyt be the d th difference of yt . Then φ̃(L)ỹt = θ(L)εt By assumption, the roots of φ̃(L) lie outside the unit circle so the differenced process, ỹt , is stationary and can be studied instead of yt Processes with d 6= 0 need to be differenced to achieve stationarity and are called ARIMA(p, d , q) Timmermann (UCSD) ARMA Winter, 2017 19 / 59 US stock index Timmermann (UCSD) ARMA Winter, 2017 20 / 59 Monthly US stock returns (first-differenced prices) Timmermann (UCSD) ARMA Winter, 2017 21 / 59 Forecasting with AR models Prediction is straightforward for AR(p) models yT+1 = φ1yT + ...+ φpyT−p+1 + εT+1, εT+1 ∼ WN(0, σ 2) Treat parameters as known and ignore estimation error Using that E [εT+1|IT ] = 0 and {yT−p+1, ..., yT } ∈ IT , the forecast of yT+1 given IT becomes fT+1|T = φ1yT + ...+ φpyT−p+1 fT+1|T means the forecast of yT+1 given information at time T x ∈ IT means "x is known at time T , i.e., belongs to the information set at time T" Timmermann (UCSD) ARMA Winter, 2017 22 / 59 Forecasting with AR models: The Chain Rule When generating forecasts multiple steps ahead, unknown values of yT+h (h ≥ 1) can be replaced with their forecasts, fT+h|T , setting up a recursive system of forecasts: fT+2|T = φ1fT+1|T + φ2yT + ...+ φpyT−p+2 fT+3|T = φ1fT+2|T + φ2fT+1|T + φ3yT + ...+ φpyT−p+3 ... fT+p+1|T = φ1fT+p|T + φ2fT+p−1|T + φ3fT+p−2|T + ...+ φp fT+1|T ‘Chain rule’is equivalent to recursively expressing unknown future values yT+i as a function of yT and its past Known values of y affect the forecasts of an AR (p) model up to horizon T + p, while forecasts further ahead only depend on past forecasts themselves Timmermann (UCSD) ARMA Winter, 2017 23 / 59 Forecasting with MA models Consider the MA(q) model yT+1 = εT+1 + θ1εT + ...+ θqεT−q+1 One-step-ahead forecast: fT+1|T = θ1εT + ...+ θqεT−q+1 Sequence of shocks {εt} are not directly observable but can be computed recursively (estimated) given a set of assumptions on the initial values for εt , t = 0, ..., q − 1 For the MA(1) model, we can set ε0 = 0 and use the recursion ε1 = y1 ε2 = y2 − θ1ε1 = y2 − θ1y1 ε3 = y3 − θ1ε2 = y3 − θ1(y2 − θy1) Unobserved shocks can be written as a function of the parameter value θ1 and current and past values of y Timmermann (UCSD) ARMA Winter, 2017 24 / 59 Forecasting with MA models (cont.) Simple recursions using past forecasts can also be employed to update the forecasts. For the MA(1) model we have ft+1|t = θ1εt = θ1(yt − ft |t−1) MA processes of infinite order: yT+h for h ≥ 1 is yT+h = θ(L)εT+h = (εT+h + θ1εT+h−1 + ...+ θh−1εT+1︸ ︷︷ ︸ unpredictable + θhεT + θh+1εT−1 + ...︸ ︷︷ ︸ predictable . Hence, if εT were observed, the forecast would be fT+h|T = θhεT + θh+1εT−1 + ... = ∞ ∑ j=h θj εT+h−j MA(q) model has limited memory: values of an MA(q) process more than q periods into the future are not predictable Timmermann (UCSD) ARMA Winter, 2017 25 / 59 Forecasting with mixed ARMA models Consider a mixed ARMA(p, q) model yT+1 = φ1yT + φ2yT−1 + ...+ φpyT−p+1 + εT+1 + θ1εT + ...+ θq εT−q+1 Separate AR and MA prediction steps can be combined by recursively replacing future values of yT+i with their predicted values and setting E [εT+j |IT ] = 0 for j ≥ 1 : fT+1|T = φ1yT + φ2yT−1 + ...+ φpyT−p+1 + θ1εT + ...+ θq εT−q+1 fT+2|T = φ1fT+1|T + φ2yT + ...+ φpyT−p+2 + θ2εT + ...+ θq εT−q+2 ... fT+h|T = φ1fT+h−1|T + φ2fT+h−2|T + ...+ φp fT−p+h|T + θhεT + ...+ θq εT−q+h Note: fT−j+h|T = yT−j+h if j ≥ h, and we assumed q ≥ h Timmermann (UCSD) ARMA Winter, 2017 26 / 59 Mean Square Forecast Errors By the Wold Representation Theorem, all stationary ARMA processes can be written as an MA process with associated forecast error yT+h − fT+h|T = εT+h + θ1εT+h−1 + ...+ θh−1εT+1 Mean square forecast error: E [ (yT+h − fT+h|T )2 ] = E [(εT+h + θ1εT+h−1 + ...+ θh−1εT+1) 2 ] = σ2(1+ θ21 + ...+ θ 2 h−1) For the AR(1) model, θi = φ i 1 and so the MSE becomes E [(yT+h − fT+h|T )2] = σ2(1+ φ21 + ...+ φ 2(h−1) 1 ) = σ2(1− φ2h1 ) 1− φ21 Timmermann (UCSD) ARMA Winter, 2017 27 / 59 Direct vs. Iterated multi-period forecasts Two ways to generate multi-period forecasts (h > 1):

Iterated approach: forecasting model is estimated at the highest

frequency and iterated upon to obtain forecasts at longer horizons

Direct approach: forecasting model is matched with the desired

forecast horizon: One model for each horizon, h. The dependent

variable is yt+h while all predictor variables are dated period t

Example: AR(1) model yt = φ1yt−1 + εt

Iterated approach: use the estimated value, φ̂1, to obtain a forecast

fT+h|T = φ̂

h

1yT

Direct approach: Estimate h−period lag relationship:

yt+h = φ

h

1︸︷︷︸

φ̃1h

yt +

h−1

∑

s=0

φs1εt−s︸ ︷︷ ︸

ε̃t+h

Timmermann (UCSD) ARMA Winter, 2017 28 / 59

Direct vs. Iterated multi-period forecasts: Trade-offs

When the autoregressive model is correctly specified, the iterated

approach makes more effi cient use of the data and so tends to

produce better forecasts

Conversely, by virtue of being a linear projection, the direct approach

tends to be more robust towards misspecification

When the model is grossly misspecified, iteration on the misspecified

model can exacerbate biases and may result in a larger MSE

Which approach performs best depends on the true DGP, the degree

of model misspecification (both unknown), and the sample size

Empirical evidence in Marcellino et al. (2006) suggests that the

iterated approach works best on average for macro variables

Timmermann (UCSD) ARMA Winter, 2017 29 / 59

Estimation of ARIMA models

ARIMA models can be estimated by maximum likelihood methods

ARIMA models are based on linear projections (regressions) which

provide reasonable forecasts of linear processes under MSE loss

There may be nonlinear models of past data that provide better

predictors:

Under MSE loss the best predictor is the conditional mean, which need

not be a linear function of the past

Timmermann (UCSD) ARMA Winter, 2017 30 / 59

Estimation (continued)

AR(p) models with known p > 0 can be estimated by ordinary least

squares by regressing yT on yT−1, yT , .., .yT−p

Assuming the data are covariance stationary, OLS estimates of the

coeffi cients φ1, .., φp are consistent and asymptotically normal

If the AR model is correctly specified, such estimates are also

asymptotically effi cient

Least squares estimates are not optimal in finite samples and will be

biased

For the AR(1) model, φ̂1 has a downward bias of (1+ 3φ1)/T

For higher order models, the biases are complicated and can go in

either direction

Timmermann (UCSD) ARMA Winter, 2017 31 / 59

Estimation and forecasting with ARMA models in matlab

regARIMA: creates regression model with ARIMA time series errors

estimate: estimates parameters of regression models with ARIMA

errors

Pure AR models: can be estimated by OLS

forecast: forecast ARIMA models

Timmermann (UCSD) ARMA Winter, 2017 32 / 59

Lag length selection

In most situations, forecasters do not know the true or optimal lag

orders, p and q

Judgmental approaches based on examining the autocorrelations and

partial autocorrelations of the data

Model selection criteria: Different choices of (p, q) result in a set of

models {Mk}Kk=1, where Mk represents model k and the search is

conducted over K different combinations of p and q

Information criteria trade off fit versus parsimony

Timmermann (UCSD) ARMA Winter, 2017 33 / 59

Information criteria

Information criteria (IC) for linear ARMA specifications:

ICk = ln σ̂

2

k + nkg(T )

IC s trade off fit (gets better with more parameters) against parsimony

(fewer parameters is better). Choose k to minimize IC

σ̂2k : sum of squared residuals of model k. Lower σ̂

2

k ⇔ better fit

nk = pk + qk + 1 : number of estimated parameters for model k

g(T ) : penalty term that depends on the sample size, T :

Criterion g(T )

AIC (Akaike (1974)) 2T−1

BIC (Schwartz (1978)) ln(T )/T

In matlab: aicbic

Timmermann (UCSD) ARMA Winter, 2017 34 / 59

Marcellino, Stock and Watson (2006)

Timmermann (UCSD) ARMA Winter, 2017 35 / 59

Random walk model

The random walk model is an AR(1) with φ1 = 1 :

yt = yt−1 + εt , εt ∼ WN(0, σ2)

This model implies that the change in yt is unpredictable:

∆yt = yt − yt−1 = εt

For example, the level of stock prices is easy to predict, but not its

change (rate of return if using logarithm of stock index)

Shocks to the random walk have permanent effects: A one unit shock

moves the series by one unit forever. This is in sharp contrast to a

mean-reverting process

Timmermann (UCSD) ARMA Winter, 2017 36 / 59

Random walk model (cont)

The variance of a random walk increases over time so the distribution

of yt changes over time. Suppose that yt started at zero, y0 = 0 :

y1 = y0 + ε1 = ε1

y2 = y1 + ε2 = ε1 + ε2

…

yt = ε1 + ε2 + …+ εt−1 + εt

From this we have

E [yt ] = 0

var(yt ) = tσ

2, lim

t→∞

var(yt ) = ∞

The variance of y grows proportionally with time

A random walk does not revert to the mean but wanders up and

down at random

Timmermann (UCSD) ARMA Winter, 2017 37 / 59

Forecasts from random walk model

Recall that forecasts from the AR(1) process yt = φ1yt−1 + εt ,

εt ∼ WN(0, σ2) are simply

ft+h|t = φ

h

1yt

For the random walk model φ1 = 1, so for all forecast horizons, h, the

forecast is simply the current value:

ft+h|t = yt

The basic random walk model says that the value of the series next

period (given the history of the series) equals its current value plus an

unpredictable change:

Forecast of tomorrow = today’s value

Random steps, εt , makes yt a “random walk”

Timmermann (UCSD) ARMA Winter, 2017 38 / 59

Random walk with a drift

Introduce a non-zero drift term, δ :

yt = δ+ yt−1 + εt , εt ∼ WN(0, σ2)

This is a popular model for the logarithm of stock prices

The drift term, δ, plays the same role as a time trend. Assuming

again that the series started at y0, we have

yt = δt + y0 + ε1 + ε2 + …+ εt−1 + εt

Similarly,

E [yt ] = y0 + δt

var(yt ) = tσ

2

lim

t→∞

var(yt ) = ∞

Timmermann (UCSD) ARMA Winter, 2017 39 / 59

Summary of properties of random walk

Changes in random walk are unpredictable

Shocks have permanent effects

Variance grows in proportion with the forecast horizon

These points are important for forecasting:

point forecasts never revert to a mean

since the variance goes to infinity, the width of interval forecasts

increases without bound as the forecast horizon grows

Uncertainty grows without bounds

Timmermann (UCSD) ARMA Winter, 2017 40 / 59

Logs, levels and growth rates

Certain transformations of economic variables such as their logarithm

are often easier to forecast than the “raw” data

If the standard deviation of a time series is approximately proportional

to its level, then the standard deviation of the change in the logarithm

of the series is approximately constant:

Yt+1 = Yt exp(εt+1), εt+1 ∼ (0, σ2)⇔

ln(Yt+1)− ln(Yt ) = εt+1

Example: US GDP follows an upward trend. Instead of studying the

level of US GDP, we can study its growth rate which is not trending

The first difference of the log of Yt is ∆ ln(Yt ) = ln(Yt )− ln(Yt−1)

The percentage change in Yt between t − 1 and t is approximately

100∆ ln(Yt ). This can be interpreted as a growth rate

Timmermann (UCSD) ARMA Winter, 2017 41 / 59

Unit root processes

Random walk is a special case of a unit root process which has a unit

root in the AR polynomial, i.e.,

(1− L)φ̃(L)yt = θ(L)εt

where the roots of φ̃(L) lie outside the unit circle

We can test for a unit root using an Augmented Dickey Fuller (ADF)

test:

∆yt = α+ βyt−1 +

p

∑

i=1

∆yt−i + εt

In matlab: adftest

Under the null of a unit root, β = 0. Under the alternative of

stationarity, β < 0
Test is based on the t-stat of β. Test statistic follows a non-standard
distribution
Timmermann (UCSD) ARMA Winter, 2017 42 / 59
Critical values for Dickey-Fuller test
Timmermann (UCSD) ARMA Winter, 2017 43 / 59
Classical decomposition of time series into three
components
Cycles (stochastic) - captured using ARMA models
Trend
trend captures the slow, long-run evolution in the outcome
for many series in levels, this is the most important component for
long-run predictions
Seasonals
regular (deterministic) patterns related to time of the year (day), public
holidays, etc.
Timmermann (UCSD) ARMA Winter, 2017 44 / 59
Example: CO2 concentration (ppm) - Dave Keeling,
Scripps, 1957-2005
Timmermann (UCSD) ARMA Winter, 2017 45 / 59
Seasonality
Sources of seasonality: technology, preferences and institutions are
linked to the calendar
weather (agriculture, construction)
holidays, religious events
Many economic time series display seasonal variations:
home sales
unemployment figures
stock prices (?)
commodity prices?
Timmermann (UCSD) ARMA Winter, 2017 46 / 59
Handling seasonalities
One strategy is to remove the seasonal component and work with
seasonally adjusted series
Problem: We might be interested in forecasting the actual
(non-adjusted) series, not just the seasonally adjusted part
Timmermann (UCSD) ARMA Winter, 2017 47 / 59
Seasonal components
Seasonal patterns can be deterministic or stochastic
Stochastic modeling approach uses differencing to incorporate
seasonal components - e.g., year-on-year changes
Box and Jenkins (1970) considered seasonal ARIMA, or SARIMA,
models of the form
φ(L)(1− LS )yt = θ(L)εt .
(1− LS )yt = yt − yT−S : computes year-on-year changes
Timmermann (UCSD) ARMA Winter, 2017 48 / 59
Modeling seasonality
Seasonality can be modeled through seasonal dummies. Let S be the
number of seasons per year.
S = 4 (quarterly data)
S = 12 (monthly data)
S = 52 (weekly data)
For example, the following set of dummies would be used to model
quarterly variation in the mean:
D1t =
(
1 0 0 0 1 0 0 0 1 0 0 0
)
D2t =
(
0 1 0 0 0 1 0 0 0 1 0 0
)
D3t =
(
0 0 1 0 0 0 1 0 0 0 1 0
)
D4t =
(
0 0 0 1 0 0 0 1 0 0 0 1
)
D1 picks up mean effects in the first quarter. D2 picks up mean
effects in the second quarter, etc. At any point in time only one of
the quarterly dummies is activated
Timmermann (UCSD) ARMA Winter, 2017 49 / 59
Pure seasonal dummy model
The pure seasonal dummy model is
yt =
S
∑
s=1
δsDst + εt
We only regress yt on intercept terms (seasonal dummies) that vary
across seasons. δs summarizes the seasonal pattern over the year
Alternatively, we can include an intercept and S − 1 seasonal
dummies.
Now the intercept captures the mean of the omitted season and the
remaining seasonal dummies give the seasonal increase/decrease
relative to the omitted season
Never include both a full set of S seasonal dummies and an intercept
term - perfect collinearity
Timmermann (UCSD) ARMA Winter, 2017 50 / 59
General seasonal effects
Holiday variation (HDV ) variables capture dates of holidays which
may change over time (Easter, Thanksgiving) - v1 of these:
yt =
S
∑
s=1
δsDst +
v1
∑
i=1
δHDVi HDVit + εt
Timmermann (UCSD) ARMA Winter, 2017 51 / 59
Seasonals
ARMA model with seasonal dummies takes the form
φ(L)yt =
S
∑
s=1
δsDst + θ(L)εt
Application of seasonal dummies can sometimes yield large
improvements in predictive accuracy
Example: day of the week, seasonal, and holiday dummies:
µt =
7
∑
day=1
βdayDday ,t +
H
∑
holiday=1
βholidayDholiday ,t +
12
∑
month=1
βmonthDmonth,t
Adding deterministic seasonal terms to the ARMA component, the
value of y at time T + h can be predicted as follows:
yT+h =
7
∑
day=1
βdayDday ,T+h +
H
∑
holiday=1
βholidayDholiday ,T+h +
12
∑
month=1
βmonthDmonth,T+h + ỹT+h ,
φ(L)ỹT+h = θ(L)εT+h
Timmermann (UCSD) ARMA Winter, 2017 52 / 59
Deterministic trends
Let Timet be a deterministic time trend so that
Timet = t, t = 1, ....,T
This time trend is perfectly predictable (deterministic)
Linear trend model:
Trendt = β0 + β1Timet
β0 is the intercept (value at time zero)
β1 is the slope which is positive if the trend is increasing or negative if
the trend is decreasing
Timmermann (UCSD) ARMA Winter, 2017 53 / 59
Examples of trended variables
US stock price index
Number of residents in Beijing, China
US labor participation rate for women (up) or men (down)
Exchange rates over long periods (?)
Interest rates (?)
Global mean temperature (?)
Timmermann (UCSD) ARMA Winter, 2017 54 / 59
Quadratic trend
Sometimes the trend is nonlinear (curved) as when the variable
increases at an increasing or decreasing rate
For such cases we can use a quadratic trend:
Trendt = β0 + β1Timet + β2Time
2
t
Caution: quadratic trends are mostly considered adequate local
approximations and can give rise to a variety of unrealistic shapes for
the trend if the forecast horizon is long
Timmermann (UCSD) ARMA Winter, 2017 55 / 59
Log-linear trend
log-linear trends are used to describe time series that grow at a
constant exponential rate:
Trendt = β0 exp(β1Timet )
Although the trend is non-linear in levels, it is linear in logs:
ln(Trendt ) = ln(β0) + β1Timet
Timmermann (UCSD) ARMA Winter, 2017 56 / 59
Deterministic Time Trends: summary
Three common time trend specifications:
Linear : µt = µ0 + β0t
Quadratic : µt = µ0 + β0t + β1t
2
Exponential : µt = exp(µ0 + β0t)
These global trends are unlikely to provide accurate descriptions of
the future value of most time series at long forecast horizons
Timmermann (UCSD) ARMA Winter, 2017 57 / 59
Estimating trend models
Assuming MSE loss, we can estimate the trend parameters by solving
θ̂ = arg
θ
min
{
T
∑
t=1
(yt − Trendt (θ))2
}
Example: with a linear trend model we have
Trendt (θ) = β0 + β1Timet
θ = {β0, β1}
and we can estimate β0, β1 by OLS
(β̂0, β̂1) = arg
β0,β1
min
{
T
∑
t=1
(yt − β0 − β1Timet )
2
}
Timmermann (UCSD) ARMA Winter, 2017 58 / 59
Forecasting Trend
Suppose a time series is generated by the linear trend model
yt = β0 + β1Timet + εt , εt ∼ WN(0, σ
2)
Future values of εt are unpredicable given current information, It :
E [εt+h |It ] = 0
Suppose we want to predict the series at time T + h given
information IT :
yT+h = β0 + β1TimeT+h + εT+h
Since TimeT+h = T + h is perfectly predictable while εT+h is
unpredictable, our best forecast (under MSE loss) becomes
fT+h|T = β̂0 + β̂1TimeT+h
Timmermann (UCSD) ARMA Winter, 2017 59 / 59
Lecture 3: Model Selection
UCSD, January 23 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) Model Selection Winter, 2017 1 / 35
1 Estimation Methods
2 Introduction to Model Selection
3 In-Sample Selection Methods
4 Sequential Selection
5 Information Criteria
6 Cross-Validation
7 Lasso
Timmermann (UCSD) Model Selection Winter, 2017 2 / 35
Least Squares Estimation I
The multivariate linear regression model with k regressors
yt =
k
∑
j=1
βjxjt−1 + ut , t = 1, ...,T
can be written more compactly as
yt = β
′xt−1 + ut ,
β = (β1, ..., βk )
′, xt−1 = (x1t−1, ..., xkt−1)
′
or, in matrix form,
y = X β+ u,
y
T×1
=
y1
y2
...
yT
, XT×k =
x10 x20 · · · xk0
x11 x21 · · · xk1
...
...
...
x1,T−1 x2,T−1 · · · xk ,T−1
Timmermann (UCSD) Model Selection Winter, 2017 2 / 35
Least Squares Estimation II
Ordinary Least Squares (OLS) minimizes the sum of squared (forecast) errors:
β̂ = Argmin
β
T
∑
t=1
(yt − β′xt−1)2
This is the same as minimizing (y − X β)′(y − X β) and yields the solution
β̂ = (X ′X )−1X ′y
(assuming that X ′X is of full rank). Four assumptions on the disturbance terms,
u, ensure that the OLS estimator has the smallest variance among all linear
unibased estimators of β, i.e., it is the Best Linear Unbiased Estimator (BLUE)
1 Zero mean: E (ut ) = 0 for all t
2 Homoskedastic: Var(ut |X ) = σ2 for all t
3 No serial correlation: Cov(ut , us |X ) = 0 for all s, t
4 Orthogonal: E [ut |X ] = 0 for all t
Timmermann (UCSD) Model Selection Winter, 2017 3 / 35
Least Squares Estimation III
We can also write the OLS estimator as follows:
β̂ = β+ (X ′X )−1X ′u
Provided that u is normally distributed, u ∼ N(0, σ2IT ), we have
β̂ ∼ N(β, σ2(X ′X )−1)
A similar result can be established asymptotically
Thus we can use standard t−tests or F -tests to test if β is statistically
significant
Timmermann (UCSD) Model Selection Winter, 2017 4 / 35
Maximum Likelihood Estimation (MLE)
Suppose the residuals u are independently, identically and normally distributed
ut ∼ N(0, σ2). Then the likelihood of u1, ..., uT as a function of the parameters
θ = (β′, σ2), becomes
L(θ) =
(
2πσ2
)−T /2
exp
(
−1
2σ2
T
∑
t=1
u2t
)
=
(
2πσ2
)−T /2
exp
(
−1
2σ2
(y − X β)′(y − X β)
)
Taking logs, we get the log-likelihood LL(θ) = log(L(θ)):
LL(θ) =
−T
2
ln(2πσ2)−
1
2σ2
(y − X β)′(y − X β).
The following parameter estimates maximize the LL:
β̂MLE = (X
′X )−1X ′y
σ̂2MLE =
u′u
T
Timmermann (UCSD) Model Selection Winter, 2017 5 / 35
Generalized Method of Moments (GMM) I
Suppose we have data (y1, x0), ..., (yT , xT−1) drawn from a probability
distribution p((y1, x0), ..., (yT , xT−1)|θ0) with true parameters θ0. The
parameters can be identified from a set of population moment conditions
E [m((yt , xt−1), θ0)] = 0 for all t
Parameter estimates can be based on sample moments, m((yt , xt−1), θ) :
1
T
T
∑
t=1
m((yt , xt−1), θ̂T ) = 0
If we have the same number of (non-redundant) moment conditions as we have
parameters, the parameters θ̂T are exactly identified by the moment conditions.
For the linear regression model, the moment conditions are that the regression
residuals (yt − x ′t−1β) are uncorrelated with the predictors, xt−1:
E [xt−1ut ] = E [xt−1(yt − x ′t−1β)] = 0, t = 0, ...,T ⇒
β̂MoM = (X
′X )−1X ′y
Timmermann (UCSD) Model Selection Winter, 2017 6 / 35
Generalized Method of Moments (GMM) II
Under broad conditions the GMM estimator is consistent and asymptotically
normally distributed
GMM estimator allows for heteroskedastic (time-varying covariance) and
autocorrelated (persistent) errors
GMM estimator has certain robustness properties
GMM is widely used throughout finance
Lars Peter Hansen (Chicago) shared the Nobel prize in 2013 for his work on
GMM estimation (and other topics)
Timmermann (UCSD) Model Selection Winter, 2017 7 / 35
Shrinkage and Ridge estimators
Estimation errors often lead to bad forecasts
A simple "trick" is to penalize for large parameter estimates
Shrinkage estimators do this by solving the problem
β = Argmin
β
T−1
T
∑
t=1
(yt − β′xt−1)2 + λ
nk
∑
i=1
β2i︸ ︷︷ ︸
penalty
λ > 0 : penalizes for large parameters

With a single regressor, the solution to this problem is simple:

β̃shrink =

1

1+ λ

β̂OLS

In the multivariate case we get the ridge estimator

β̃Ridge = (X

′X + λI )−1X ′y

Even though β̃shrink is now biased, the variance of the forecast is reduced

Timmermann (UCSD) Model Selection Winter, 2017 8 / 35

Model selection

Typically a set of models, rather than just a single model is considered when

constructing a forecast of a particular outcome

Models could differ by

dynamic specification (ARMA lags)

predictor variables (covariates)

functional form (nonlinearities)

estimation method (OLS, GMM, MLE)

Can a single ‘best’model be identified?

Model selection methods attempt to choose such a ‘best’model

might be hard if the space of models is very large

what if many models have similar performance?

Different from forecast combination which combines forecasts from several

models

Timmermann (UCSD) Model Selection Winter, 2017 9 / 35

Model selection: setup

MK = {M1, …,MK } : Finite set of K forecasting models

Mk : individual model used to generate a forecast, fk (xt , βk ), k = 1, …,K

xt : data (conditioning information or predictors) at time t

βk : parameters for model k

Model selection involves searching overMK to find the best forecasting

model

Data sample: {yt+1, xt}, t = 0, …,T − 1

One model nests another model if the second model is a special case

(smaller version) of the first one. Example:

M1 : yt+1 = β1x1t + ε1t+1 (small model)

M2 : yt+1 = β21x1t + β22x2t + ε2t+1 (big model)

Timmermann (UCSD) Model Selection Winter, 2017 10 / 35

In-sample comparison of models

Two models: M1 = f1(x1, β1) and M2 = f2(x2, β2)

Squared error loss: e1 = y − f1, e2 = y − f2

The second (“large”) model nests the first (“small”)

Coeffi cient estimates for both models are selected such that

β̂i = argmin

βi

T−1

T

∑

t=1

(yt − fit |t−1(βi ))

2

Because f2t |t−1(β2) nests f1t |t−1(β1), it follows that, in a given sample,

T−1

T

∑

t=1

(yt − f2t |t−1(β̂2))

2 ≤ T−1

T

∑

t=1

(yt − f1t |t−1(β̂1))

2

The larger model (M2) always provides at least as good a fit as the smaller

model (M1) and in most cases will provide a strictly better in-sample fit

Timmermann (UCSD) Model Selection Winter, 2017 11 / 35

In-sample comparison of models

The smaller model’s in-sample fit is always worse even if the true expected

loss under the (first) small model is less than or equal to the expected loss

under the second (large) model, i.e., even if the following holds in population:

E

[

(yt+1 − f1t+1|t (β

∗

1))

2

]

≤ E

[

(yt+1 − f2t+1|t (β

∗

2))

2

]

β∗1, β

∗

2 : true parameters. These are unknown

Superior in-sample fit does not by itself suggest that a particular forecast

model is necessarily better out-of-sample

Large (complex) models often perform well in comparisons of in-sample fit

even when they perform poorly compared with smaller models when

evaluated on new (out-of-sample) data

Take-away: Overfitting matters. Be careful with large models

Timmermann (UCSD) Model Selection Winter, 2017 12 / 35

Model selection methods

Popular in-sample model selection methods

Information criteria (IC)

Sequential hypothesis testing

Cross validation

LASSO (large dimensional models)

Advantages of each approach depends on whether there are few or very many

potential predictor variables

Also depends on the true, but unknown, model

Are many or few of the predictor variables truly significant?

sparseness

Timmermann (UCSD) Model Selection Winter, 2017 13 / 35

Sequential Hypothesis Testing

Sequential hypothesis tests choose the ‘best’submodel from a larger set of

models through a sequence of specification tests that identify the relevant

parts of a model and exclude the remainder

Approach reflects how applied researchers construct their models in practice

Remove terms found not to be useful when tested against a smaller model

that omits such variables

Use t−tests, F−tests, or p−values

Diffi culties may arise if models are nonnested or include nonlinearities

In matlab: stepwisefit

Timmermann (UCSD) Model Selection Winter, 2017 14 / 35

Sequential Hypothesis Testing

Different orders of the sequence in which variables are tested

forward stepwise

backward stepwise

General-to-specific – start big: include all potential variables in the initial

model. Then remove variables thought not to be useful through a sequence

of tests. The final model typically depends on the sequence of tests

Specific-to-general – start small: begin with a small baseline model with

the ‘main’variables or simply a constant, then add further variables if they

improve the prediction model

Forward and backward methods can be mixed

Timmermann (UCSD) Model Selection Winter, 2017 15 / 35

Sequential Testing with Linear Model (backwise stepwise)

Kitchen sink with K potential predictors {x1, …, xK }:

yt+1 = β0 + β1x1t + β2x2t + …+ βK−1xK−1t + βK xKt + εt+1

Suppose the smallest absolute value of the t−statistic of any variable falls

below some threshold, t, such as t = 2:

tmin = min

k=1,…,K

|t

β̂k

| < t = 2
Eliminate the variable with the smallest t−statistic or the lowest p−value
smaller than 0.05
Timmermann (UCSD) Model Selection Winter, 2017 16 / 35
Sequential Testing with Linear Model (cont.)
Suppose xK is dropped. The trimmed model with the remaining K − 1
variables is next re-estimated:
yt+1 = β0 + β1x1t + β2x2t + ...+ βK xK−1t + εt+1
Recalculate the t-statistics for all regressors. Check if
min
k=1,...,K−1
{|t
β̂k
|}| < t
and drop the variable with the smallest t−statistic if this condition holds
Repeat procedure until no further variable is dropped
Timmermann (UCSD) Model Selection Winter, 2017 17 / 35
Forecasts from sequential tests
Backward stepwise forecast:
ft+1|t = β̂0k +
K
∑
k=1
β̂k Ik xkt
Ik = 1 if the kth variable is included in the final model. Otherwise Ik = 0
Ik depends on the entire sequence of t−statistics not only for the kth
variable itself but also for all other variables. Why?
In matlab:
stepwisefit(X(1:t-1,:),y(2:t),’display’,’off’,’inmodel’,ones(1,size(X,2))); %
backward stepwise model selection
Timmermann (UCSD) Model Selection Winter, 2017 18 / 35
Sequential Hypothesis Testing: Specific to general
Begin from a simple model that only includes an intercept
yt+1 = β0 + εt+1
Next consider all K univariate models (forward stepwise approach):
yt+1 = β0k + βk xkt + εkt+1 k = 1, ...,K
Add to the model the variable with the highest t−statistic subject to the
condition that this exceeds some threshold value t̄, e.g., t̄ = 2
tmax = max
k=1,...,K
|t
β̂k
| > t̄

Add regressors from the remaining pool to this univariate model one by one.

New regressor is included if its t−statistic exceeds t̄

Repeat until no further variable is included

matlab: stepwisefit(X(1:t-1,:),y(2:t),’display’,’off’); % forward stepwise model

selection

Timmermann (UCSD) Model Selection Winter, 2017 19 / 35

Sequential approach

Forecasts from the backward and forward stepwise approaches take the form

ft+1|t = β̂0k +

K

∑

k=1

β̂k Ik xkt

Ik = 1 if the kth variable is included in the final model. Otherwise Ik = 0

Advantages:

intuitive appeal

simplicity

computationally easy

Disadvantages:

no comprehensive search across all possible models

outcome can be path dependent – no guarantee that it finds the globally

optimal model

pre-test biases: hard to control the size

Timmermann (UCSD) Model Selection Winter, 2017 20 / 35

Information Criteria (IC)

ICs trade off model fit against a penalty for model complexity measured by

the number of freely estimated parameters

ICs were developed under different assumptions regarding the ‘correct’

underlying model and hence have different properties

ICs ‘correct’a minimization criterion for the effect of parameter estimation

which tends to make large models appear better in-sample than they really are

Popular information criteria:

Bayes information criterion (BIC or SIC)

Akaike information criterion (AIC)

In matlab: aicbic . Choose model with smallest

aic = −2 ∗ logL+ 2 ∗ numParam

bic = −2 ∗ logL+ numParam ∗ log (numObs)

Timmermann (UCSD) Model Selection Winter, 2017 21 / 35

Information Criteria

How much additional parameters improve the in-sample fit depends on the

true model, so differences across information criteria hinge on how to

practically get around our ignorance about the true model

Information criteria are used to rank a set of parametric models,Mk

Each model Mk ∈ MK requires estimating nk parameters, βk

To conduct a comprehensive search over all possible models with K potential

predictor variables, {x1, …., xK } means considering 2K possible model

specifications

Example: K = 2 : Two possible predictors {x1, x2} yield four models {0, 0},

{1, 0}, {0, 1}, and {1, 1}

with K = 11, 211 = 2, 048 models

with K = 20, 2K = 220 > 1, 000, 000 models

Timmermann (UCSD) Model Selection Winter, 2017 22 / 35

Bayesian/Schwarz Information Criterion

BIC = −2LogLk + nk ln(T )

nk : number of parameters of model k

T : sample size – penalty depends on the sample size: bigger T , bigger

penalty

For linear regressions, the BIC takes the form

BIC = ln σ̂2k + nk ln(T )/T

σ̂2k = e

′

k ek/T : sample estimate of the residual variance

Select the model with the smallest BIC

BIC is a consistent model selection criterion: It selects the true model in a

very large sample (big T ) if this is included inMK

Timmermann (UCSD) Model Selection Winter, 2017 23 / 35

Akaike Information Criterion

AIC = −2LogLk + 2nk

AIC minimizes the distance between the true model and the fitted model

For linear regression models

AIC (k) = ln σ̂2k +

2nk

T

AIC penalizes inclusion of extra parameters less than the SIC

AIC is not a consistent model selection criterion – it tends to select models

with too many parameters

AIC selects the best “approximate” model – asymptotic effi ciency

Timmermann (UCSD) Model Selection Winter, 2017 24 / 35

Cross-validation

Cross validation (CV) avoids overfitting by removing the correlation that

causes the estimated in-sample loss to be “small” due to the use of the same

observations for both parameter estimation and model evaluation

CV makes use of the full dataset for both estimation and evaluation

CV averages over all possible combinations of estimation and evaluation

samples obtainable from a given data set

‘Leave one out’CV estimator holds out one observation for model evaluation

Remaining observations are used for estimation of the parameters

The loss is calculated solely from the evaluation sample – this breaks the

connection leading to overfitting

Repeat calculation for all possible ways to leave out one data point for model

evaluation

CV can be computationally slow if T is large

Timmermann (UCSD) Model Selection Winter, 2017 25 / 35

Illustration: Estimation of sample mean under MSE

Estimation of sample mean ȳT = T

−1 ∑Tt=1 yt for i.i.d. time series, yt

Mean Squared Error (MSE):

T−1

T

∑

t=1

(yt − ȳT )2 = T−1

T

∑

t=1

ε2t − (ȳT − µ)2 ⇒

E

[

T−1

T

∑

t=1

(yt − ȳT )2

]

= σ2 − E [(ȳT − µ)2 ] ≤ σ2

E [(ȳT − µ)2 ], gets subtracted from the MSE!

The in-sample MSE based on the fitted mean will on average be smaller than

the true MSE computed under known µ

Cross validation breaks the correlation between the forecast error and the

estimation error

Separate observations used to estimate the parameters of the prediction model

(the sample mean) from observations used to compute the MSE

Timmermann (UCSD) Model Selection Winter, 2017 26 / 35

How does the classical ‘leave one out’CV work?

At each point in time, t, use the sample mean

ȳ{−t} = (T − 1)−1 ∑Ts=1,s 6=t ys that leaves out observation yt

We can show that

E

(

T −1

T

∑

t=1

(yt − ȳ{−t})2

)

= E

(

T −1

T

∑

t=1

ε2t

)

+ (T − 1)−1E

[

T −1

T

∑

t=1

ε2t

]

= σ2(1+ (T − 1)−1) > σ2

The expected squared error of the leave-one-out estimator ȳ{−t} can be

shown to be smaller than that of the usual estimate, ȳt , that does not

leave-one-out

CV estimator tells us (correctly) that the MSE exceeds σ2

Timmermann (UCSD) Model Selection Winter, 2017 27 / 35

How many predictor variables do we have?

Low-dimensional set of variables

Large-dimensional: hundreds or thousands

Federal Reserve Bank of St Louis Federal, FRED, has 429,000 time series

Timmermann (UCSD) Model Selection Winter, 2017 28 / 35

Lasso Model Selection

Lasso (Least Absolute Shrinkage and Selection Operator) is a type of

shrinkage estimator for least squares regression

Shrinkage estimators (“pull towards zero”) reduce the effect of sampling

errors

Lasso estimates linear regression coeffi cients by minimizing the sum of least

squares residuals subject to a penalty function

T−1

T

∑

t=1

(yt − β′xt−1)2 + λ

nk

∑

i=1

|βi |︸ ︷︷ ︸

penalty

λ : scalar tuning parameter determining the size of the penalty

Shrinks the parameter estimates towards zero

λ = 0 gives OLS estimates. Big values of λ pull β̂ towards zero

No closed form solution for minimizing the expression – computational

methods are required

Timmermann (UCSD) Model Selection Winter, 2017 29 / 35

Lasso Model Selection

Common to re-estimate parameters of selected variables by OLS

In matlab: lasso

“lasso performs lasso or elastic net regularization for linear regression.

[B,STATS] = lasso(X,Y,…) Performs L1-constrained linear least squares fits

(lasso) or L1- and L2-constrained fits (elastic net) relating the predictors in X

to the responses in Y. The default is a lasso fit, or constraint on the L1-norm

of the coeffi cients B.”

matlab uses cross-validation to choose the weight on the penalty term, λ

Lasso tends to set many coeffi cients to zero and can thus be used for model

selection

Timmermann (UCSD) Model Selection Winter, 2017 30 / 35

Variable selection and Lasso (Patrick Breheny slides)

Timmermann (UCSD) Model Selection Winter, 2017 31 / 35

Empirical example

Forecasts of quarterly (excess) stock returns

Twelve predictor variables:

dividend-price ratio,

dividend-earnings (payout) ratio,

stock volatility,

book-to-market ratio,

net equity issues,

T-bill rate,

long term return,

term spread,

default yield,

default return,

inflation

investment-capital ratio

Timmermann (UCSD) Model Selection Winter, 2017 32 / 35

Time-series forecasts of quarterly stock returns

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Aic Bic CrossVal

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Forward Backward

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4

-0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

Bagg α=2% Bagg α=3%

1970Q1 1980Q1 1990Q3 2000Q4 2010Q4

-0.1

-0.05

0

0.05

0.1

0.15

Lasso λ=8 Lasso λ=3

Timmermann (UCSD) Model Selection Winter, 2017 33 / 35

Inclusion of individual variables

1970Q1 1990Q3 2010Q4

AIC

BIC

CV

Forw

Back

Lasso

dp

1970Q1 1990Q3 2010Q4

AIC

BIC

CV

Forw

Back

Lasso

tbl

1970Q1 1990Q3 2010Q4

AIC

BIC

CV

Forw

Back

Lasso

tms

1970Q1 1990Q3 2010Q4

AIC

BIC

CV

Forw

Back

Lasso

dfy

Timmermann (UCSD) Model Selection Winter, 2017 34 / 35

Conclusion

Model selection increases the “space”over which the search for the

forecasting model is conducted

Model uncertainty matters and can be as important as parameter

estimation error

When one model is clearly superior to others it will nearly always be selected

No free lunch – when a single model is not obviously superior to all other

models, different models are selected by different criteria in different samples

statistical techniques for model selection are used precisely because models are

hard to tell apart, and not because one model is obviously much better than

the others

Timmermann (UCSD) Model Selection Winter, 2017 35 / 35

Lecture 4: Updating and Forecasting with New

Information

UCSD, January 30, 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Filtering Winter, 2017 1 / 53

1 Bayes rule and updating beliefs

2 The Kalman Filter

Application to inflation forecasting

3 Nowcasting

4 Jagged Edge Data

5 Daily Business Cycle Indicator for the U.S.

6 Markov Switching models

Empirical examples

Timmermann (UCSD) Filtering Winter, 2017 2 / 53

Updating and forecasting

A good forecasting model closely tracks how data evolve over time

Important to update beliefs about the predicted variable or the

forecasting model as new information arrives

Filtering

Suppose the current “state” is unobserved. For example we may not

know in real time if the economy is in a recession

Nowcasting: obtaining the best estimate of the current state

How do we accurately and effi ciently update our forecasts as new

information arrives?

Kalman filter (continuous state)

Regime switching model (small number of states)

Timmermann (UCSD) Filtering Winter, 2017 2 / 53

Bayes’rule

Bayes rule for two random variables A and B:

P(B |A) =

P(A|B)P(B)

P(A)

P(A) : probability of event A

P(B) : probability of event B

P(B |A) : probability of event B given that event A occurred

Timmermann (UCSD) Filtering Winter, 2017 3 / 53

Bayes’rule (cont.)

Let θ be some unknown model parameters while y are observed data.

By Bayes’rule (setting θ = B, y = A)

P(θ|y) =

P(y |θ)P(θ)

P(y)

Given some data, y , what do we know about model parameters θ?

If we are only interested in θ, we can ignore p(y) :

P(θ|y)︸ ︷︷ ︸

posterior

∝ P(y |θ)︸ ︷︷ ︸

likelihood

P(θ)︸︷︷︸

prior

We start with prior beliefs before seeing the data. Updating these

beliefs with the observed data, we get posterior beliefs

Parameters, θ, that do not fit the data become less likely

For example, if θH is ’high mean returns’and we observe a data sample

with low mean returns, then we put less weight on θH

Timmermann (UCSD) Filtering Winter, 2017 4 / 53

Examples of Bayes’rule

B : European recession. A : European growth rate of -1%

P(recession|g = −1%) =

P(g = −1%|recession)P(recession)

P(g = −1%)

B : bear market. A : negative returns of -5%

P(bear |r = −5%) =

P(r = −5%|bear)P(bear)

P(r = −5%)

Here P (recession) and P(bear) are the initial probabilities of being in

a recession/bear market (before observing the data)

Timmermann (UCSD) Filtering Winter, 2017 5 / 53

Understanding the updating process

Suppose two random variables Y and X are normally distributed(

Y

X

)

= N

((

µy

µx

)

,

(

σ2y σxy

σxy σ

2

x

))

µy and µx are the initial (unconditional) expected values of y and x

σ2y , σ

2

x , σxy are variances and covariance

Conditional mean and variance of Y given an observation X = x :

E [Y |X = x ] = µy +

σxy

σ2x

(x − µx )

Var(Y |X = x) = σ2y −

σ2xy

σ2x

If Y and X are positively correlated (σxy > 0) and we observe a high

value of X (x > µx ), then we increase our expectation of Y

Just like a linear regression! σxy/σ2x is the beta coeffi cient

Timmermann (UCSD) Filtering Winter, 2017 6 / 53

Kalman Filter: Background

The Kalman filter is an algorithm for linear updating and prediction

Introduced by Kalman in 1960 for engineering applications

Method has found great use in many disciplines, including economics

and finance

Kalman Filter gives an updating rule that can be used to revise our

beliefs as we see more and more data

For models with normally distributed variables, the filter can be used

to write down the likelihood function

Timmermann (UCSD) Filtering Winter, 2017 7 / 53

Kalman Filter (Wikipedia) I

“Kalman filtering, also known as linear quadratic estimation (LQE), is

an algorithm that uses a series of measurements observed over time,

containing noise (random variations) and other inaccuracies, and

produces estimates of unknown variables that tend to be more precise

than those based on a single measurement alone. More formally, the

Kalman filter operates recursively on streams of noisy input data to

produce a statistically optimal estimate of the underlying system

state. The filter is named for Rudolf (Rudy) E. Kálmán, one of the

primary developers of its theory.

The Kalman filter has numerous applications in technology. A

common application is for guidance, navigation and control of

vehicles, particularly aircraft and spacecraft. Furthermore, the

Kalman filter is a widely applied concept in time series analysis used

in fields such as signal processing and econometrics.

Timmermann (UCSD) Filtering Winter, 2017 8 / 53

Kalman Filter (Wikipedia) II

The algorithm works in a two-step process. In the prediction step,

the Kalman filter produces estimates of the current state variables,

along with their uncertainties. Once the outcome of the next

measurement (necessarily corrupted with some amount of error,

including random noise) is observed, these estimates are updated

using a weighted average, with more weight being given to estimates

with higher certainty. Because of the algorithm’s recursive nature, it

can run in real time using only the present input measurements and

the previously calculated state and its uncertainty matrix; no

additional past information is required.”

Timmermann (UCSD) Filtering Winter, 2017 9 / 53

Kalman Filter: Models in state space form

Let St be an unobserved (state) variable while yt is an observed

variable. A model that shows how yt is related to St and how St

evolves is called a state space model. This has two equations:

State equation (unobserved/latent):

St = φ× St−1 + εst , εst ∼ (0, σ2s ) (1)

Measurement equation (observed)

yt = B × St + εyt , εyt ∼ (0, σ2y ) (2)

Innovations are uncorrelated with each other:

Cov(εst , εyt ) = 0

Timmermann (UCSD) Filtering Winter, 2017 10 / 53

Example 1: AR(1) model in state space form

AR(1) model

yt = φyt−1 + εt

This can be written in state space form as

St = φSt−1 + εt state eq.

yt = St measurement eq.

with B = 1, σ2s = σ

2

ε , and σ

2

y = 0

very simple: no error in the measurement equation: yt is observed

without error

Timmermann (UCSD) Filtering Winter, 2017 11 / 53

Example 2: MA(1) model in state space form

MA(1) model with unobserved shocks εt :

yt = εt + θεt−1

This can be written in state space form(

S1t

S2t

)

=

(

0 0

1 0

)

︸ ︷︷ ︸

φ

(

S1t−1

S2t−1

)

+

(

1

0

)

εt

yt =

(

1 θ

)︸ ︷︷ ︸

B

(

S1t

S2t

)

= εt + θεt−1

Note that S1t = εt ,S2t = S1t−1 = εt−1

Timmermann (UCSD) Filtering Winter, 2017 12 / 53

Example 3: Unobserved components model

The unobserved components model consists of two equations

yt = St + εyt (B = 1)

St = St−1 + εst (φ = 1)

yt is observed with noise

St is the underlying “mean” of yt . This is smoother than yt

This model can be written as an ARIMA(0,1,1):

yt − yt−1 = St − St−1 + εyt − εyt−1

= εst + εyt − εyt−1 : MA(1)

Timmermann (UCSD) Filtering Winter, 2017 13 / 53

Kalman Filter: Advantages

The state equation in (1) is in AR(1) form and so is easy to iterate

forward. The h−step-ahead forecast of the state given its current

value, St , is given by

Et [St+h |St ] = φhSt

In practice we don’t observe St and so need an estimate of this given

current information, St |t , or past information, St |t−1

Updating the Kalman filter through newly arrived information is easy

Timmermann (UCSD) Filtering Winter, 2017 14 / 53

Kalman Filter Updating Equations

It = {yt , yt−1, yt−2, …}. Current information

It−1 = {yt−1, yt−2, yt−3, …}. Lagged information

yt : random variable we want to predict

yt |t−1 : best prediction of yt given information at t − 1, It−1

St |t−1 : best prediction of St given information at t − 1, It−1

St |t : best “prediction” (or nowcast) of St given information It

Define mean squared error (MSE)-values associated with the forecasts

of St and yt

MSESt |t−1 = E [(St − St |t−1)

2]

MSESt |t = E [(St − St |t )

2]

MSE y

t |t−1 = E [(yt − yt |t−1)

2]

Timmermann (UCSD) Filtering Winter, 2017 15 / 53

Prediction and Updating Equations

Using the state, measurement, and MSE equations, the Kalman filter

gives a set of prediction equations:

St |t−1 = φSt−1|t−1

MSESt |t−1 = φ

2MSESt−1|t−1 + σ

2

s

yt |t−1 = B × St |t−1

MSE y

t |t−1 = B

2 ×MSESt |t−1 + σ

2

y

Similarly, we have a pair of updating equations for S :

St |t = St |t−1 + B

(

MSESt |t−1/MSE

y

t |t−1

)

(yt − yt |t−1)

MSESt |t = MSE

S

t |t−1

[

1− B2

(

MSESt |t−1/MSE

y

t |t−1

)]

Timmermann (UCSD) Filtering Winter, 2017 16 / 53

Prediction and Updating Equations

Intuition for updating equations (B = 1)

St |t = St |t−1 +

MSESt |t−1

MSE y

t |t−1

(yt − yt |t−1)

St |t : estimate of current (t) state given current information It

St |t−1 : old (t − 1) estimate of state St given It−1

MSESt |t−1/MSE

y

t |t−1 : amount by which we update our estimate of

the current state after we observe yt . This is small if MSE

y

t |t−1 is big

(noisy data) relative to MSESt |t−1, i.e., σ

2

y >> σ

2

s

(yt − yt |t−1) : surprise (news) about yt

If yt is higher than we expected, (yt − yt |t−1) > 0 and we increase

our expectations about the state: St |t > St |t−1. The updating

equation tells us by how much

Timmermann (UCSD) Filtering Winter, 2017 17 / 53

Starting the Algorithm

At t = 0, we have not observed any data, so we must make our best

guesses of S1|0 and MSE

S

1|0 without data by picking a pair of initial

conditions. This gives y1|0 and MSE

y

1|0 from the prediction equations

At t = 1 we observe y1. The updating equations generate S1|1 and

MSE s1|1. The prediction equations then generate forecasts for the

second period

At t = 2 we observe y2, and the cycle continues to give sequences of

predictions of the states, {St |t} and {St |t−1}

Keep on iterating to get a sequence of estimates

Timmermann (UCSD) Filtering Winter, 2017 18 / 53

Filtered versus smoothed states

St |t : filtered states: estimate of the state at time t given information

up to time t

Uses only historical information

What is my best guess of St given my current information?

“Filters” past historical information for noise

St |T : smoothed states: estimate of the state at time t given

information up to time T

Uses the full sample up to time T ≥ t

Less sensitive to noise and thus tends to be smoother than the filtered

states

Information on yt−1, yt , yt+1 help us more precisely estimate the state

at time t, St

Timmermann (UCSD) Filtering Winter, 2017 19 / 53

Practical applications of the Kalman filter

Common to use Kalman filter to estimate adaptive forecasting models

with time-varying relations:

yt+1 = βtxt + εt+1

Two alternative specifications for βt :

βt − β̄ = φ(βt−1 − β̄) + ut : mean-reverts to β̄

βt = βt−1 + ut : random walk

yt , xt : observed variables

βt : time-varying coeffi cient (unobserved state variable)

Timmermann (UCSD) Filtering Winter, 2017 20 / 53

Kalman filter in matlab

Matlab has a good Kalman filter called ssm (state space model). The

setup is

xt = Atxt−1 + Btut

yt = Ctxt +Dtet

xt : unobserved state (our St)

yt : observed variable

ut , et : uncorrelated noise processes with variance of one

model = ssm(A,B,C,D,’StateType’,stateType); % state space model

modelEstimate = estimate(model,variable,params0,’lb’,[0; 0])

filtered = filter(modelEstimate,variable)

smoothed = smooth(modelEstimate,variable)

Timmermann (UCSD) Filtering Winter, 2017 21 / 53

Kalman filter example: monthly inflation

Unobserved components model for inflation

xt = xt−1 + σuut

yt = xt + σeet

A = 1; % state-transition matrix (A = φ in our notation)

B = NaN; % state-disturbance-loading matrix (B = σS )

C = 1; % measurement-sensitivity matrix (C = B in our notation)

D = NaN; % observation-innovation matrix (D = σy )

stateType = 2; % sets state equation to be a random walk

Timmermann (UCSD) Filtering Winter, 2017 22 / 53

Application of Kalman filter to monthly US inflation

1930 1940 1950 1960 1970 1980 1990 2000 2010

-0.02

-0.01

0

0.01

0.02

0.03

0.04

0.05

Time

In

fla

tio

n

Timmermann (UCSD) Filtering Winter, 2017 23 / 53

Kalman filter estimates of inflation (last 100 obs.)

2005 2006 2007 2008 2009 2010 2011 2012

-1.5

-1

-0.5

0

0.5

1

Time

P

er

ce

nt

ag

e

po

in

ts

Inflation

Filtered inflation

Smoothed inflation

Timmermann (UCSD) Filtering Winter, 2017 24 / 53

Kalman filter take-aways

Kalman filter is a very popular approach for dynamically updating

linear forecasts

Used to estimate ARMA models

Used throughout engineering and the social sciences

Fast, easy algorithm

Optimal updating equations for normally distributed data

Timmermann (UCSD) Filtering Winter, 2017 25 / 53

Nowcasting

Nowcasting refers to “estimating the present”

Nowcasting extracts information about the present state of some

variable or system of variables

distinct from traditional forecasting

Nowcasting only makes sense if the present state is

unknown−otherwise nowcasting would just amount to checking the

current value

Example: Use a single unobserved state variable to summarize the

state of the economy, e.g., the daily point in the business cycle

Variables such as GDP are actually observed with large measurement

errors (revisions)

Timmermann (UCSD) Filtering Winter, 2017 26 / 53

Jagged edge data

Macroeconomic data such as GDP, monetary aggregates,

consumption, unemployment figures or housing starts as well as

financial data extracted from balance sheets and income statements

are published infrequently and sometimes at irregular intervals

Delays in the publication of macro variables differ across variables

Irregular data releases (release date changes from month to month)

generate what is often called “jagged edge”data

A forecaster can only use the data that is available on any given date

and needs to pay careful attention to which variables are in the

information set

Timmermann (UCSD) Filtering Winter, 2017 27 / 53

Aruoba, Diebold and Scotti daily business cycle indicator

ADS model the daily business cycle, St , as an unobserved variable

that follows a (zero-mean) AR(1) process:

St = φSt−1 + et

Although St is unobserved, we can extract information about it from

its relation with a set of observed economic variables y1t , y2t , …

At the daily horizon these variables follow processes:

yit = ki + βiSt + γiyi ,t−Di + uit , i = 1, .., n

Di equals seven days if the variable is observed weekly, etc.

Timmermann (UCSD) Filtering Winter, 2017 28 / 53

Aruoba, Diebold and Scotti index from Philly Fed

Timmermann (UCSD) Filtering Winter, 2017 29 / 53

ADS five variable model

The ADS model can be written in state-space form

For example, a model could use the following observables:

interest rates (daily, y1t)

initial jobless claims (weekly, y2t)

personal income (monthly, y3t)

industrial production (monthly, y4t)

GDP (quarterly, y5t)

Kalman filter can be used to extract and update estimates of the

unobserved common variable that tracks the state of the economy

Kalman filter is well suited for handling missing data

If all elements of yt are missing on a given day, we skip the updating

step

Timmermann (UCSD) Filtering Winter, 2017 30 / 53

Markov Chains: Basics

Updating equations simplify a great deal if we only have two states,

states 1 and 2, and want to know which state we are currently in

recession/expansion

inflation/deflation

bull/bear market

high volatility/low volatility

Timmermann (UCSD) Filtering Winter, 2017 31 / 53

Markov Chains: Basics

A first order (constant) Markov chain, St , is a random process that

takes integer values {1, 2, ….,K} with state transitions that depend

only on the most recent state, St−1

Probability of moving from state i at time t − 1 to state j at time t is

pij :

P(St = j |St−1 = i) = pij

0 ≤ pij ≤ 1

K

∑

j=1

pij = 1

Timmermann (UCSD) Filtering Winter, 2017 32 / 53

Fitted values, 3-state model for monthly T-bill rates

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Time

F

itt

ed

v

al

ue

s,

T

-b

ill

r

at

e

Timmermann (UCSD) Filtering Winter, 2017 33 / 53

Two-state Markov Chain

With K = 2 states, the transition probabilities can be collected in a

2× 2 matrix

P = P(St+1 = j |St = i)

=

[

p11 p21

p12 p22

]

=

[

p11 1− p22

1− p11 p22

]

Law of total probability: pi1 + pi2 = 1: we either stay in state i or we

leave to state j

pii : “stayer”probability – measure of state i’s persistence

Timmermann (UCSD) Filtering Winter, 2017 34 / 53

Basic regime switching model

Simple two-state regime-switching model

yt+1 = µst+1 + σst+1 εt+1, εt+1 ∼ N(0, 1)

P(st+1 = j |st = i) = pij

µst+1 : mean in state st+1

σst+1 : volatility in state st+1

st+1 matters for both the mean and volatility of yt+1

if st+1 = 1 : yt+1 = µ1 + σ1εt+1

if st+1 = 2 : yt+1 = µ2 + σ2εt+1

Timmermann (UCSD) Filtering Winter, 2017 35 / 53

Updating state probabilities

To predict yt+1, we need to predict st+1. This depends on the current

state, st

Let p1t |t = prob(st = 1|It ) be the current probability of being in

state 1 given all information up to time t, It

If p1t |t = 1, we know for sure that we are in state 1 at time t

Typically p1t |t < 1 and there is uncertainty about the present state
Let p1t+1|t = prob(st+1 = 1|It ) be the predicted probability of being
in state 1 next period (t + 1), given It
Timmermann (UCSD) Filtering Winter, 2017 36 / 53
Updating state probabilities
To be in state 1 at time t + 1, we must have come from either state 1
or from state 2:
p1t+1|t = p11 × p1t |t + (1− p22)× p2t |t
p2t+1|t = (1− p11)× p1t |t + p22 × p2t |t
If p1t |t = 1, we know for sure that we are in state 1 at time t. Then
the equations simplify to
p1t+1|t = p11 × 1+ (1− p22)× 0 = p11
p2t+1|t = (1− p11)× 1+ p22 × 0 = 1− p11
Timmermann (UCSD) Filtering Winter, 2017 37 / 53
Updating with two states
Let P(st = 1|yt−1) and P(st = 1|yt−1) be our initial estimates of
being in states 1 and 2 given information at time t − 1
In period t we observe a new data point: yt
If we are in state 1 the likelihood of observing yt is P(yt |st = 1)
If we are in state 2 the likelihood of yt is P(yt |st = 2)
If these are normally distributed, we have
P(yt |st = 1) =
1√
2πσ21
exp
(
−(yt − µ1)
2
2σ21
)
P(yt |st = 2) =
1√
2πσ22
exp
(
−(yt − µ2)
2
2σ22
)
(3)
Timmermann (UCSD) Filtering Winter, 2017 38 / 53
Bayesian updating with two states: examples I
Use Bayes’rule to compute the updated state probabilities:
P(st = 1|yt ) =
P(y t |st= 1)P(st= 1)
P(y t )
, where
P(yt ) = P(y t |st= 1)P(st= 1) + P(y t |st= 2)P(st= 2)
Similarly
P(st = 2|yt ) =
P(y t |st= 2)P(st= 2)
P(y t )
Suppose that µ1 < 0, σ
2
1 is "large" so state 1 is a high volatility state
with negative mean, while µ2 > 0 with small σ

2

2 so state 2 is a

“normal” state

Timmermann (UCSD) Filtering Winter, 2017 39 / 53

Bayesian updating with two states: examples II

If we see a large negative yt , this is most likely drawn from state 1

and so P(yt |st = 1) > P(yt |st = 2). Then we revise upward the

probability that we are currently (at time t) in state 1

Example:

µ1 = −3, σ1 = 5, µ2 = 1, σ2 = 2

P(st = 1|yt−1) = 0.70, P(st = 2|yt−1) = 0.30 : initial estimates

p11 = 0.8, p22 = 0.9

Suppose we observe yt = −4. Then from (3)

p(yt |st = 1) = Normpdf (−1/5) = 0.0782

p(yt |st = 2) = Normpdf (−5/2) = 0.0088

P(st = 1|yt ) =

0.0782× 0.70

0.0782× 0.70+ 0.0088× 0.30

= 0.954

P(st = 2|yt ) =

0.0088× 0.30

0.0782× 0.70+ 0.0088× 0.30

= 0.046

Timmermann (UCSD) Filtering Winter, 2017 40 / 53

Bayesian updating with two states: examples III

Because the observed value (-4%) is far more likely to have been

drawn from state 1 than from state 2, we revise upwards our beliefs

that we are currently in the first state from 70% to 95.4%

Using p11 and p22, our forecast of being in state 1 next period (at

time t + 1) is

P(st+1 = 1|yt ) = 0.954× 0.8+ 0.046× (1− 0.9) = 0.768

Our forecast of being in state 2 next period is

P(st+1 = 2|yt ) = 0.954× (1− 0.8) + 0.046× 0.9 = 0.232

Timmermann (UCSD) Filtering Winter, 2017 41 / 53

Bayesian updating with two states: examples IV

Similarly, the mean and variance forecasts in this case are given by

E [yt+1|yt ] = µ1P(st+1 = 1|yt ) + µ2P(st+1 = 2|yt )

= −3× 0.768+ 1× 0.232 = −2.07

Var(yt+1|yt ) = σ21P(st+1 = 1|yt ) + σ22P(st+1 = 2|yt )

+P(st+1 = 1|yt )× P(st+1 = 2|yt )(µ2 − µ1)

2

= 52 × 0.768+ 22 × 0.232+ 0.768× 0.232× (1+ 3)2

= 22.98

Timmermann (UCSD) Filtering Winter, 2017 42 / 53

Bayesian updating with two states: example (cont.)

Suppose instead we observe a value yt = +1. Then

p(yt |st = 1) = Normpdf (4/5) = 0.0579

p(yt |st = 2) = Normpdf (0) = 0.1995

P(st = 1|yt ) =

0.0579× 0.70

0.0579× 0.70+ 0.1995× 0.30

= 0.4038

P(st = 2|yt ) =

0.1995× 0.30

0.0579× 0.70+ 0.1995× 0.30

= 0.5962

Now, we reduce the probability of being in state 1 from 70% to 40%,

while we increase the chance of being in state 2 from 30% to 60%

Our forecasts of being in states 1 and 2 next period are

P(st+1 = 1|yt ) = 0.4038× 0.8+ 0.5962× (1− 0.9) = 0.3827

P(st+1 = 1|yt ) = 0.4038× (1− 0.8) + 0.5962× 0.9 = 0.6173

Timmermann (UCSD) Filtering Winter, 2017 43 / 53

Estimation of Markov switching models

The MS model is neither Gaussian, nor linear: the state st might lead

to changes in regression coeffi cients and the covariance matrix

Two common estimation methods:

Maximum likelihood estimation (MLE)

Bayesian estimation using Gibbs sampler

Filtered states: P(st = i |It ) : probability of being in state i at time t

given information at time t, It

Smoothed states: P(st = i |IT ) : probability of being in state i at

time t given information at the end of the sample, IT

Choice of number of states can be tricky. We can use AIC or BIC

Timmermann (UCSD) Filtering Winter, 2017 44 / 53

Filtered states (Ang-Timmermann, 2012)

Timmermann (UCSD) Filtering Winter, 2017 45 / 53

Smoothed state probabilities (Ang-Timmermann, 2012)

Timmermann (UCSD) Filtering Winter, 2017 46 / 53

Smoothed state probabilities (Ang-Timmermann)

Timmermann (UCSD) Filtering Winter, 2017 47 / 53

Parameter estimates (Ang-Timmermann, 2012)

yt = µst + φst yt−1 + σst εt , εt ∼ iiN(0, 1)

Timmermann (UCSD) Filtering Winter, 2017 48 / 53

Take-away for MS models

Markov switching models are popular in finance and economics

MS models are easy to interpret economically

Empirically often one state is highly persistent (“normal” state) with

parameters not too far from the average of the series

The other state is often more transitory and captures spells of high

volatility (asset returns) or negative outliers (GDP growth)

Forecasts are easy to compute with MS models

One state often has high volatility – regime switching can be

important for risk management

Try to experiment with the Markov switching and Kalman filter codes

on Ted

Timmermann (UCSD) Filtering Winter, 2017 49 / 53

Estimates, 3-state model for monthly stock returns

P ′ =

0.9881 0.0119 0.0000.000 0.9197 0.0803

0.8437 0.000 0.1563

µ =

(

0.0651 -0.1321 0.3756

)

σ =

(

0.0571 0.1697 0.0154

)

P : state transition probabilities, µ : means, σ : volatilities

State 1: highly persistent, medium mean, medium volatility

State 2: negative mean, high volatility, medium persistence

State 3: transitory bounce-back state with high mean

Timmermann (UCSD) Filtering Winter, 2017 50 / 53

Smoothed state probabilities, monthly stock returns

1930 1940 1950 1960 1970 1980 1990 2000 2010

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Time

M

on

th

ly

s

to

ck

r

et

ur

ns

Timmermann (UCSD) Filtering Winter, 2017 51 / 53

Fitted versus actual stock returns (3 state model)

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time

F

itt

ed

v

al

ue

s,

s

to

ck

r

et

ur

ns

Timmermann (UCSD) Filtering Winter, 2017 52 / 53

Volatility of monthly stock returns

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

Time

F

itt

ed

v

ol

at

ili

ty

,

st

oc

k

re

tu

rn

s

Timmermann (UCSD) Filtering Winter, 2017 53 / 53

Lecture 5: Random walk and spurious correlation

UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Random walk Winter, 2017 1 / 21

1 Random walk model

2 Logs, levels and growth rates

3 Spurious correlation

Timmermann (UCSD) Random walk Winter, 2017 2 / 21

Random walk model

The random walk model is an AR(1) yt = φ1yt−1 + εt with φ1 = 1 :

yt = yt−1 + εt , εt ∼ WN(0, σ2).

This model implies that the change in yt is unpredictable:

∆yt = yt − yt−1 = εt

For example, the level of (log-) stock prices is easy to predict, but not

its change (rate of return for log-prices)

Shocks to the random walk have permanent effects: A one unit shock

moves the series by one unit forever. This is in sharp contrast to a

mean-reverting process such as yt = 0.8yt−1 + εt

Timmermann (UCSD) Random walk Winter, 2017 2 / 21

Random walk model (cont)

The variance of a random walk increases over time so the distribution

of yt changes over time. Suppose that yt started at zero, y0 = 0 :

y1 = y0 + ε1 = ε1

y2 = y1 + ε2 = ε1 + ε2

…

yt = ε1 + ε2 + …+ εt−1 + εt , so

E [yt ] = 0

var(yt ) = var(ε1 + ε2 + …+ εt ) = tσ

2 ⇒

lim

t→∞

var(yt ) = ∞

The variance of y grows proportionally with time

A random walk does not revert back to the mean but wanders up and

down at random

Timmermann (UCSD) Random walk Winter, 2017 3 / 21

Forecasts from random walk model

Recall that forecasts from the AR(1) process yt = φ1yt−1 + εt ,

εt ∼ WN(0, σ2) are simply

ft+h|t = φ

h

1yt

For the random walk model φ1 = 1, so for all forecast horizons, h, the

forecast is simply the current value:

ft+h|t = yt

Forecast of tomorrow = today’s value

The basic random walk model says that the value of the series next

period (given the history of the series) equals its current value plus an

unpredictable change. Random steps, εt , make yt a “random walk”

Timmermann (UCSD) Random walk Winter, 2017 4 / 21

Random walk with a drift

Introduce a non-zero drift term, δ :

yt = δ+ yt−1 + εt , εt ∼ WN(0, σ2).

This is a popular model for the logarithm of stock prices

The drift term, δ, plays the same role as a time trend. Assuming

again that the series started at y0, we have

yt = 2δ+ yt−2 + εt + εt−1

= δt + y0 + ε1 + ε2 + …+ εt−1 + εt , so

E [yt ] = y0 + δt

var(yt ) = tσ

2

lim

t→∞

var(yt ) = ∞

Timmermann (UCSD) Random walk Winter, 2017 5 / 21

Summary of properties of random walk

Changes in a random walk are unpredictable

Shocks have permanent effects

Variance grows in proportion with the forecast horizon

These points are important for forecasting:

point forecasts never revert to a mean or a trend

since the variance goes to infinity, the width of interval forecasts

increases without bound as the forecast horizon grows. Uncertainty

grows without bounds.

Timmermann (UCSD) Random walk Winter, 2017 6 / 21

Logs, levels and growth rates

Certain transformations of economic variables such as their logarithm

are often easier to model than the “raw” data

If the standard deviation of a time series is proportional to its level,

then the standard deviation of the logarithm of the series is

approximately constant:

Yt = Yt−1 exp(εt ), εt ∼ (0, σ2)⇔

ln(Yt ) = ln(Yt−1) + εt

The first difference of the log of Yt is ∆ ln(Yt ) = ln(Yt )− ln(Yt−1)

The percentage change in Yt between t − 1 and t is approximately

100∆ ln(Yt ). This can be interpreted as a growth rate

Example: US GDP follows an upward trend. Instead of studying the

level of US GDP, we can study its growth rate which is not trending

Timmermann (UCSD) Random walk Winter, 2017 7 / 21

Unit root processes

Random walk is a special case of a unit root process which has a unit

root in the AR polynomial, i.e.,

(1− L)yt = θ(L)εt ,

We can test for a unit root using an Augmented Dickey Fuller (ADF)

test:

∆yt = α+ βyt−1 +

p

∑

i=1

λi∆yt−i + εt .

Under the null of a unit root, H0 : β = 0. Under the alternative of

stationarity, H1 : β < 0
Timmermann (UCSD) Random walk Winter, 2017 8 / 21
Unit root processes (cont.)
Example: suppose p = 0 (no autoregressive terms for ∆yt) and
β = −0.2. Then
∆yt = yt − yt−1 = α− 0.2yt−1 + εt ⇔
yt = 0.8yt−1 + εt (which is stationary)
If instead β = 0.2, we have
yt − yt−1 = α+ 0.2yt−1 + εt ⇔
yt = 1.2yt−1 + εt (which is explosive)
Test is based on the t-stat of β. Test statistic follows a non-standard
distribution with wider tails than the normal distribution
Timmermann (UCSD) Random walk Winter, 2017 9 / 21
Unit root test in matlab
In matlab: adftest
[h,pValue,stat,cValue,reg] = adftest(y)
[h,pvalue,stat,cvalue] = adftest(logprice,’lags’,1,’model’,’AR’);
Timmermann (UCSD) Random walk Winter, 2017 10 / 21
Critical values for Dickey-Fuller test
Timmermann (UCSD) Random walk Winter, 2017 11 / 21
Shanghai SE stock price (monthly, 1991-2014)
t-statistic: 1, 0362. p-value:0.92. Fail to reject null of a unit root.
1990 1995 2000 2005 2010 2015
4.5
5
5.5
6
6.5
7
7.5
8
8.5
9
Timmermann (UCSD) Random walk Winter, 2017 12 / 21
Changes in Shanghai SE stock price
t-statistic: -11.15. p−value: 0.001. Reject null of a unit root.
1990 1995 2000 2005 2010 2015
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
Timmermann (UCSD) Random walk Winter, 2017 13 / 21
Spurious correlation
Time series that are trending systematically up or down may appear
to be significantly correlated even though they are completely
independent
Correlation between a city’s ice cream sales and the number of
drownings in city swimming pools: Both peak at the same time even
though there is no causal relationship between the two. In fact, a
heat wave may drive both variables
Dutch statistics reveal a positive correlation between the number of
storks nesting in the spring and the number of human babies born at
that time. Any causal relation?
Cumulative rainfall in Brazil and US stock prices
Timmermann (UCSD) Random walk Winter, 2017 14 / 21
Spurious correlation
Two series with a random walk (unit root) component may appear to
be related even when they are not. Consider an example:
y1t = y1t−1 + ε1t
y2t = y2t−1 + ε2t ,
cov(ε1t , ε2t ) = 0
Regressing one variable on the other y2t = α+ βy1t + ut often leads
to apparently high values of R2 and of the associated t−statistic for
β. Both are unreliable! Solutions:
instead of regressing y1t on y2t in levels, regress ∆y1t on ∆y2t
use cointegration analysis
Timmermann (UCSD) Random walk Winter, 2017 15 / 21
Simulations of stationary processes
1,000 simulations (T = 500) of uncorrelated stationary AR(1)
processes:
y1t = 0.5y1t−1 + ε1t
y2t = 0.5y2t−1 + ε2t
cov(ε1t , ε2t ) = 0
The two time series y1t and y2t are independent by construction
Next, estimate a regression
y1t = β0 + β1y2t + ut
What do you expect to find?
Timmermann (UCSD) Random walk Winter, 2017 16 / 21
Simulation from stationary AR(1) process
Average t−stat: 1.02. Rejection rate: 5.7%. Average R2 : 0.003
-4 -3 -2 -1 0 1 2 3 4
0
100
200
300
distribution of t-stats: stationary AR(1)
0 0.005 0.01 0.015 0.02 0.025 0.03
0
200
400
600
800
distribution of R-squared: stationary AR(1)
Timmermann (UCSD) Random walk Winter, 2017 17 / 21
Spurious correlation: simulations
1,000 simulations of uncorrelated random walk processes:
y1t = y1t−1 + ε1t
y2t = y2t−1 + ε2t
cov(ε1t , ε2t ) = 0
Then estimate regression
y1t = β0 + β1y2t + ut
What do we find now?
Timmermann (UCSD) Random walk Winter, 2017 18 / 21
Spurious correlation: simulation from random walk
Average t−stat: 13.4. Rejection rate: 44%. Average R2 : 0.25
-80 -60 -40 -20 0 20 40 60 80 100
0
100
200
300
400
distribution of t-stats: random walk
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
100
200
300
400
distribution of R-squared: random walk
Timmermann (UCSD) Random walk Winter, 2017 19 / 21
Spurious correlation: dealing with the problem
1,000 simulations of uncorrelated random walk processes:
y1t = y1t−1 + ε1t
y2t = y2t−1 + ε2t
cov(ε1t , ε2t ) = 0
Next, estimate regression on first-differenced series:
∆y1t = β0 + β∆y2t + ut
Timmermann (UCSD) Random walk Winter, 2017 20 / 21
Spurious correlation: simulation from random walk
Average t−stat: 0.78. Rejection rate: 1.5%. Average R2 : 0.002
-4 -3 -2 -1 0 1 2 3 4
0
100
200
300
distribution of t-stats: random walk, first-differences
0 0.005 0.01 0.015 0.02 0.025
0
200
400
600
800
distribution of R-squared: random walk, first-differences
Timmermann (UCSD) Random walk Winter, 2017 21 / 21
Lecture 5: Vector Autoregressions and Factor Models
UCSD, Winter 2017
Allan Timmermann1
1UC San Diego
Timmermann (UCSD) VARs and Factors Winter, 2017 1 / 41
1 Vector Autoregressions
2 Forecasting with VARs
Present value example
Impulse response analysis
3 Cointegration
4 Forecasting with Factor Models
5 Forecasting with Panel Data
Timmermann (UCSD) VARs and Factors Winter, 2017 2 / 41
From univariate to multivariate models
Often information other than a variable’s own past values are relevant for
forecasting
Think of forecasting Hong Kong house prices
exchange rate, GDP growth, population growth, interest rates might be
relevant
past house prices in Hong Kong also matter (AR model)
In general we can get better models by using richer information sets
How do we incorporate additional information sources?
Vector Auto Regressions (VARs) (small set of predictors)
Factor models (many possible predictors)
Timmermann (UCSD) VARs and Factors Winter, 2017 2 / 41
Vector Auto Regressions (VARs)
Vector autoregressions generalize univariate autoregressions to the
multivariate case by letting yt be an n× 1 vector and so extend the
information set to It = {yit , yit−1, ..., yi1} for i = 1, ..., n
Many of the properties of VARs are simple multivariate generalizations of the
univariate AR model
The Wold representation theorem also extends to the multivariate case
and hence VARs and VARMA models can be used to approximate covariance
stationary multivariate (vector) processes
VARMA: Vector AutoRegressive Moving Average
Timmermann (UCSD) VARs and Factors Winter, 2017 3 / 41
VARs: definition
A pth order VAR for an n× 1 vector yt takes the form:
yt = c + A1yt−1 + A2yt−2 + ...+ Apyt−p + εt , εt ∼ WN(0,Σ)
Ai : n× n matrix of autoregressive coeffi cients for i = 1, ..., p :
Ai =
Ai11 Ai12 · · · Ai1n
Ai21 Ai22 · · · Ai2n
...
Ain1 Ainn
εt : n× 1 vector of innovations. These can be correlated across variables
VARs have the same regressors appearing in each equation
Number of parameters: n2 × p︸ ︷︷ ︸
A1,...,Ap
+ n︸︷︷︸
c
+ n(n+ 1)/2︸ ︷︷ ︸
Σ
Timmermann (UCSD) VARs and Factors Winter, 2017 4 / 41
Why do we need VARs for forecasting?
Consider a VAR with two variables: yt = (y1t , y2t )
′
IT = {y1T , y1T−1, ....y11, y2T , y2T−1, ..., y21}
Suppose y1 depends on past values of y2. Forecasting y1 one step ahead
(y1T+1) given IT is possible if we know today’s values, y1T , y2T
Suppose we want to predict y1 two steps ahead (y1T+2)
Since y1T+2 depends on y2T+1, we need a forecast of y2T+1, given IT
We need a joint model for predicting y1 and y2 given their past values
This is provided by the VAR
Timmermann (UCSD) VARs and Factors Winter, 2017 5 / 41
Example: Bivariate VAR(1)
Joint model for the dynamics in y1t and y2t :
y1t = φ11y1t−1 + φ12y2t−1 + ε1t , ε1t ∼ WN(0, σ
2
1)
y2t = φ21y1t−1 + φ22y2t−1 + ε2t , ε2t ∼ WN(0, σ
2
2)
Each variable depends on one lag of the other variable and one lag of itself
φ12 measures the impact of the past value of y2, y2t−1, on current y1t .
When φ12 6= 0, y2t−1 affects y1t
φ21 measures the impact of the past value of y1, y1t−1, on current y2t .
When φ21 6= 0, y1t−1 affects y2t
The two variables can also be contemporaneously correlated if the
innovations ε1t and ε2t are correlated and are influenced by common shocks:
Cov(ε1t , ε2t ) = σ12
If σ12 6= 0, shocks to y1t and y2t are contemporaneously correlated
Timmermann (UCSD) VARs and Factors Winter, 2017 6 / 41
Forecasting with Bivariate VAR I
One-step-ahead forecast given IT = {y1T , y2T , ..., y11, y21} :
f1T+1|T = φ11y1T + φ12y2T
f2T+1|T = φ21y1T + φ22y2T
To compute two-step-ahead forecasts, use the chain rule:
f1T+2|T = φ11f1T+1|T + φ12f2T+1|T
f2T+2|T = φ21f1T+1|T + φ22f2T+1|T
Using the expresssions for f1T+1|T and f2T+1|T , we have
f1T+2|T = φ11(φ11y1T + φ12y2T ) + φ12(φ21y1T + φ22y2T )
f2T+2|T = φ21(φ11y1T + φ12y2T ) + φ22(φ21y1T + φ22y2T )
Timmermann (UCSD) VARs and Factors Winter, 2017 7 / 41
Forecasting with Bivariate VAR II
Collecting terms, we have
f1T+2|T = (φ
2
11 + φ12φ21)y1T + φ12(φ11 + φ22)y2T
f2T+2|T = φ21(φ11 + φ22)y1T + (φ12φ21 + φ
2
22)y2T
To forecast y1 two steps ahead we need to forecast both y1 and y2 one step
ahead.
This can only be done if we have forecasting models for both y1 and y2
Therefore, we need to use a VAR for multi-step forecasting of time series that
depend on other variables
Timmermann (UCSD) VARs and Factors Winter, 2017 8 / 41
Predictive (Granger) causality
Clive Granger (1969) used a variable’s predictive content to develop a
definition of causality that depends on the conditional distribution of the
predicted variable, given other information
Statistical concept of causality closely related to forecasting
Basic principles:
cause should precede (come before) effect
a causal series should contain information useful for forecasting that is not
available from the other series (including their past)
Granger causality in the bivariate VAR:
If φ12 = 0, then y2 does not Granger cause y1 : past values of y2 do not
improve our predictions of future values of y1
If φ21 = 0, then y1 does not Granger cause y2 : past values of y1 do not
improve our predictions of future values of y2
For all other values of φ12 and φ21, y1 will Granger cause y2 and/or y2 will
Granger cause y1
Include more lags?
Timmermann (UCSD) VARs and Factors Winter, 2017 9 / 41
Granger causality tests
Each variable predicts every other variable in the general VAR
In VARs with many variables, it is quite likely that some variables are not
useful for forecasting all the other variables
Granger causality findings might be overturned by adding more variables to
the model – y2t may simply predict y1t+1 because other information (y3t
which causes both y1t+1 and y2t+1) has been left out (omitted variable)
Timmermann (UCSD) VARs and Factors Winter, 2017 10 / 41
Estimation of VARs
In suffi ciently large samples and under conventional assumptions, the least
squares estimates of (A1, ...,Ap) will be normally distributed around the true
parameter value
Standard errors for each regression are computed using the OLS estimates
OLS estimation is asymptotically effi cient
OLS estimates are generally biased in small samples
Timmermann (UCSD) VARs and Factors Winter, 2017 11 / 41
VARs in matlab
5-variable sample code on Triton Ed: varExample.m
model = vgxset(’n’,5,’nAR’,nlags,’Constant’,true); % set up the VAR model
[modelEstimate,modelStdEr,LL] = vgxvarx(model,Y(1:estimationEnd,:));
%estimate the VAR
numParams = vgxcount(model); %number of parameters
[aicForecast,aicForecastCov] =
vgxpred(modelEstimates{aicLags,1},forecastHorizon,[],); %forecast with VAR
Timmermann (UCSD) VARs and Factors Winter, 2017 12 / 41
Diffi culties with VARs
VARs initially became a popular forecasting tool because of their relative
simplicity in terms of which choices need to be made by the forecaster
When estimating a VAR by classical methods, only two choices need to be
made to construct forecasts
which variables to include (choice of y1, ..., yn)
how many lags of the variables to include (choice of p)
Risk of overparameterization of VARs is high
The general VAR has n(np + 1) mean parameters plus another n(n+ 1)/2
covariance parameters
For n = 5, p = 4 this is 105 mean parameters and 15 covariance parameters
Bayesian procedures reduce parameter estimation error by shrinking the
parameter estimates towards some target value
Timmermann (UCSD) VARs and Factors Winter, 2017 13 / 41
Choice of lag length
Typically we search over VARs with different numbers of lags, p
With a vector of constants, n variables, p lags, and T observations, the BIC
and AIC information criteria take the forms
BIC (p) = ln |Σ̂p |+ n(np + 1)
ln(T )
T
AIC (p) = ln |Σ̂p |+ n(np + 1)
2
T
Σ̂p = T−1 ∑
T
t=t ε̂t ε̂
′
t is the estimate of the residual covariance matrix
The objective is to identify the model (indexed by p) that minimizes the
information criterion
The sample code varExample.m chooses the VAR, using up to 12 lags
(maxLags)
Timmermann (UCSD) VARs and Factors Winter, 2017 14 / 41
Multi-period forecasts
VARs are ideally designed for generating multi-period forecasts. For the
VAR(1) specification
yt+1 = Ayt + εt+1, εt+1 ∼ WN(0,Σ)
the h−step-ahead value can be written
yt+h = A
hyt +
h
∑
i=1
Ah−i εt+i
The forecast under MSE loss is then
ft+h|t = A
hyt
Just like in the case with an AR(1) model!
Timmermann (UCSD) VARs and Factors Winter, 2017 15 / 41
Multi-period forecasts: 4-variable example
Forecasts using 4-variable VAR with quarterly inflation rate, unemployment
rate, GDP growth and 10-year yield
vgxplot(modelEstimates,Y,aicForecast,aicForecastCov); % plot forecast
0 50 100 150 200 250
-0.05
0
0.05
Inflation
Process Lower 1-σ Upper 1-σ
0 50 100 150 200 250
-0.05
0
0.05
GDP growth
0 50 100 150 200 250
0
10
20
Unemployment
0 50 100 150 200 250
0
10
20
10 year Treasury bond rate
Timmermann (UCSD) VARs and Factors Winter, 2017 16 / 41
Multi-period forecasts of 10-year yield (cont.)
Reserve last 5-years of data for forecast evaluation
2009.5 2010 2010.5 2011 2011.5 2012 2012.5 2013 2013.5 2014
2
2.5
3
3.5
4
4.5
5
Time
P
er
ce
nt
ag
e
po
in
ts
Actual
AIC
BIC
AR(4)
Timmermann (UCSD) VARs and Factors Winter, 2017 17 / 41
Example: Campbell-Shiller present value model I
Campbell and Shiller (1988) express the continuously compounded stock
return in period t + 1, rt+1, as an approximate linear function of the
logarithms of current and future stock prices, pt , pt+1 and dividends, dt+1:
rt+1 = k + ρpt+1 + (1− ρ)dt+1 − pt
ρ is a scalar close to (but below) one, and k is a constant
Rearranging, we get a recursive equation for log-prices:
pt = k + ρpt+1 + (1− ρ)dt+1 − rt+1
Iterating forward and taking expectations conditional on current information,
we have
pt =
k
1− ρ
+ (1− ρ)Et
[
∞
∑
j=0
ρjdt+1+j
]
− Et
[
∞
∑
j=0
ρj rt+1+j
]
Timmermann (UCSD) VARs and Factors Winter, 2017 18 / 41
Example: Campbell-Shiller present value model II
Stock prices depend on an infinite sum of expected future dividends and
expected returns
Key to the present value model is therefore how such expectations are formed
VARs can address this question since they can be used to generate
multi-period forecasts
To illustrate this point, let zt be a vector of state variables with z1t = pt ,
z2t = dt , z3t = xt ; xt are predictor variables
Define selection vectors e1 = (1 0 0)
′, e2 = (0 1 0)
′, e3 = (0 0 1)
′ so
pt = e ′1zt , dt = e
′
2zt , xt = e
′
3xt
Suppose that zt follows a VAR(1):
zt+1 = Azt + εt+1 ⇒
Et [zt+j ] = A
j zt
Timmermann (UCSD) VARs and Factors Winter, 2017 19 / 41
Example: Campbell-Shiller present value model III
If expected returns Et [rt+1+j ] are constant and stock prices only move due
to variation in dividends, we have (ignoring the constant and assuming that
we can invert (I − ρA))
pt = (1− ρ)Et
[
∞
∑
j=0
ρjdt+1+j
]
= (1− ρ)e ′2
∞
∑
j=0
ρjAj+1zt = (1− ρ)e ′2A(I − ρA)
−1zt
Nice and simple expression for the present value stock price!
The VAR gives us a simple way to compute expected future dividends
Et [dt+1+j ] for all future points in time given the current information in zt
Can you suggest other ways of doing this?
Timmermann (UCSD) VARs and Factors Winter, 2017 20 / 41
Impulse response analysis
Stationary vector autoregressions (VARs) can equivalently be expressed as
vector moving average (VMA) processes:
yt = εt + θ1εt−1 + θ2εt−2 + ...
Impulse response analysis shows how variable i in a VAR is affected by a
shock to variable j at different horizons:
∂yit+1
∂εjt
1-period impulse
∂yit+2
∂εjt
2-period impulse
∂yit+h
∂εjt
h-period impulse
Suppose we find out that variable j is higher than we expected (by one unit).
Impulse responses show how much we revise our forecasts of future values of
yit+h due to this information
How does an interest rate shock affect future unemployment and inflation?
Timmermann (UCSD) VARs and Factors Winter, 2017 21 / 41
Impulse response analysis in matlab
Four-variable model: inflation, GDP growth, unemployment, 10-year Treasury
bond rate
impulseHorizon = 24; % 24 months out
W0 = zeros(impulseHorizon,4); %baseline scenario of zero shock
W1 = W0;
W1(1,4) = sqrt(modelEstimates{aicLags,1}.Q(4,4)); % one standard
deviation shock to variable number four (interest rate)
Yimpulse =
vgxproc(modelEstimates{aicLags,1},W1,[],Y(1:estimationEnd,:)); %impulse
response
Ynoimpulse =
vgxproc(modelEstimates{aicLags,1},W0,[],Y(1:estimationEnd,:));
Timmermann (UCSD) VARs and Factors Winter, 2017 22 / 41
Impulse response analysis: shock to 10-year yield
5 10 15 20
0
0.05
0.1
0.15
0.2
Horizon
%
C
ha
ng
e
Inflation
5 10 15 20
-0.02
-0.015
-0.01
-0.005
0
Horizon
%
C
ha
ng
e
GDP growth
5 10 15 20
0
0.005
0.01
Horizon
%
C
ha
ng
e
Unemployment
5 10 15 20
0.05
0.1
0.15
Horizon
%
C
ha
ng
e
10 year Treasury bond rate
Timmermann (UCSD) VARs and Factors Winter, 2017 23 / 41
Nobel Prize Award, 2003 press release
“Most macroeconomic time series follow a stochastic trend, so that a temporary
disturbance in, say, GDP has a long-lasting effect. These time series are called
nonstationary; they differ from stationary series which do not grow over time, but
fluctuate around a given value. Clive Granger demonstrated that the statistical
methods used for stationary time series could yield wholly misleading results when
applied to the analysis of nonstationary data. His significant discovery was that
specific combinations of nonstationary time series may exhibit stationarity, thereby
allowing for correct statistical inference. Granger called this phenomenon
cointegration. He developed methods that have become invaluable in systems
where short-run dynamics are affected by large random disturbances and long-run
dynamics are restricted by economic equilibrium relationships. Examples include
the relations between wealth and consumption, exchange rates and price levels,
and short and long-term interest rates.”
This work was done at UCSD
Timmermann (UCSD) VARs and Factors Winter, 2017 24 / 41
Cointegration
Consider the variables
xt = xt−1 + εt x follows a random walk (nonstationary)
y1t = xt + u1t y1 is a random walk plus noise
y2t = xt + u2t y2 is a random walk plus noise
εt , u1t , u2t are all white noise (or at least stationary)
xt is a unit root process: (1− L)xt = εt , so L = 1 is a "root"
y1 and y2 behave like random walks. However, their difference
y1t − y2t = u1t − u2t
is stationary (mean-reverting)
Over the long run, y1 − y2 will revert to its equilibrium value of zero
Timmermann (UCSD) VARs and Factors Winter, 2017 25 / 41
Cointegration (cont.)
Future levels of random walk variables are diffi cult to predict
It is much easier to predict differences between two sets of cointegrated
variables
Example: Forecasting the level of Brent or WTI (West Texas Intermediate)
crude oil prices five years from now is diffi cult
Forecasting the difference between these prices (or the logs of their prices) is
likely to be easier
In practice we often study the logarithm of prices (instead of their level), so
percentage differences cannot become too large
Timmermann (UCSD) VARs and Factors Winter, 2017 26 / 41
Cointegration (cont.)
If two variables are cointegrated, they must both individually have a
stochastic trend (follow a unit root process) and their individual paths can
wander arbitrarily far away from their current values
There exists a linear combination that ties the two variables closely together
Future values cannot deviate too far away from this equilibrium relation
Granger representation theorem: Equilibrium errors (deviations from the
cointegrating relationship) can be used to predict future changes
Examples of possible cointegrated variables:
Oil prices in Shanghai and Hong Kong– if they differ by too much, there is an
arbitrage opportunity
Long and short interest rates
Baidu and Alibaba stock prices (pairs trading)
House prices in two neighboring cities
Chinese A and H share prices for same company. Arbitrage opportunities?
Timmermann (UCSD) VARs and Factors Winter, 2017 27 / 41
Vector Error Correction Models (VECM)
Vector error correction models (VECMs) can be used to analyze VARs with
nonstationary variables that are cointegrated
Cointegration relation restricts the long-run behavior of the variables so they
converge to their cointegrating relationship (long-run equilibrium)
Cointegration term is called the error-correction term
This measures the deviation from the equilibrium and allows for short-run
predictability
In the long-run equilibrium, the error correction term equals zero
Timmermann (UCSD) VARs and Factors Winter, 2017 28 / 41
Vector Error Correction Models (cont.)
VECM for changes in two variables, y1t , y2t with cointegrating equation
y2t = βy1t and lagged error correction term (y2t−1 − βy1t−1) :
∆y1t = α1 (y2t−1 − βy1t−1)︸ ︷︷ ︸
lagged error correction term
+ λ1∆y1t−1 + ε1t
∆y2t = α2(y2t−1 − βy1t−1) + λ2∆y2t−1 + ε2t
In the short run y1 and y2 can deviate from the equilibrium y2t = βy1t
Lagged error correction term (y2t−1 − βy1t−1) pulls the variables back towards
their equilibrium
α1 and α2 measure the speed of adjustment of y1 and y2 towards equilibrium
Larger values of α1 and α2 mean faster adjustment
Timmermann (UCSD) VARs and Factors Winter, 2017 29 / 41
Vector Error Correction Models (cont.)
In many applications (particularly with variables in logs), β = 1. Then a
forecasting model for the changes ∆y1t and ∆y2t could take the form
∆y1t = c1 +
p
∑
i=1
λ1i∆y1t−i︸ ︷︷ ︸
p AR lags
+ α1 (y2t−1 − y1t−1)︸ ︷︷ ︸
error correction term
+ ε1t
∆y2t = c2 +
p
∑
i=1
λ2i∆y2t−i + α2(y2t−1 − y1t−1) + ε2t
This can be estimated by OLS since you know the cointegrating coeffi cient,
β = 1
Include more lags of the error correction term (y2t−1 − y1t−1)? Adjustments
may be slow
Timmermann (UCSD) VARs and Factors Winter, 2017 30 / 41
House prices in San Diego and San Francisco
1990 1995 2000 2005 2010
50
100
150
200
250
Time
H
om
e
P
ric
e
In
de
x
SD
SF
Timmermann (UCSD) VARs and Factors Winter, 2017 31 / 41
Simple test for cointegration
Regress San Diego house prices on San Francisco house prices and test if the
residuals are non-stationary
use logarithm of prices (?)
Null hypothesis is that there is no cointegration (so there is no linear
combination of the two prices that is stationary)
If you reject the null hypothesis (get a low p-value), this means that the
series are cointegrated
If you don’t reject the null hypothesis (high p-value), the series are not
cointegrated
Often test has low power (fails to reject even when the series are
cointegrated)
Timmermann (UCSD) VARs and Factors Winter, 2017 32 / 41
Test for cointegration in matlab
See VecmExample.m file on Triton Ed
In matlab: egcitest : Engle-Granger cointegration test
[h, pValue, stat, cValue] = egcitest(Y )
"Engle-Granger tests assess the null hypothesis of no cointegration among
the time series in Y. The test regresses Y(:,1) on Y(:,2:end), then tests the
residuals for a unit root.
Values of h equal to 1 (true) indicate rejection of the null in favor of the
alternative of cointegration. Values of h equal to 0 (false) indicate a failure
to reject the null."
p−value of test for SD and SF house prices: 0.9351. We fail to reject the
null that the house prices are not cointegrated. Why?
Timmermann (UCSD) VARs and Factors Winter, 2017 33 / 41
House prices in San Diego and San Francisco
1990 1995 2000 2005 2010
-30
-20
-10
0
10
20
30
Time
Cointegrating Relation
Timmermann (UCSD) VARs and Factors Winter, 2017 34 / 41
Forecasting with Factor models I
Suppose we have a very large set of predictor variables, xit , i = 1, ...n, where
n could be in the hundreds or thousands
The simplest forecasting approach would be to consider a linear model with
all predictors included:
yt+1 = α+
n
∑
i=1
βi xit + φ1yt + εyt+1
This model can be estimated by OLS, assuming that the total number of
parameters, n+ 2, is small relative to the length of the time series, T
Often n > T and so linear regression methods are not feasible

Instead it is commonly assumed that the x−variables only affect y through a

small set of r common factors, Ft = (F1t , …,Frt )′, where r is much smaller

than N (typically less than ten)

Timmermann (UCSD) VARs and Factors Winter, 2017 35 / 41

Forecasting with Factor models II

This suggests using a common factor forecasting model of the form

yt+1 = α+

r

∑

i=1

βiF Fit + φ1yt + εyt+1

Suppose that n = 200 and r = 3 common factors

The general forecasting model requires fitting 202 mean parameters:

α, β1, …, β200, φ1

The simple factor model only requires estimating 5 mean parameters:

α, β1F , β2F , β3F , φ1

Timmermann (UCSD) VARs and Factors Winter, 2017 36 / 41

Forecasting with Factor models

The identity of the common factors is usually unknown and so must be

extracted from the data

Forecasting with common factor models can therefore be thought of as a

two-step process

1 Extract estimates of the common factors from the data

2 Use the factors, along with past values of the predicted variable, to select and

estimate a forecasting model

Suppose a set of factor estimates, F̂it , has been extracted. These are then

used along with past values of y to estimate a model and generate forecasts

of the form:

ŷt+1|t = α̂+

r

∑

i=1

β̂iF F̂it + φ̂1yt

Common factors can be extracted using the principal components method

Timmermann (UCSD) VARs and Factors Winter, 2017 37 / 41

Principal components

Wikipedia: “Principal component analysis (PCA) is a statistical procedure

that uses an orthogonal transformation to convert a set of observations of

possibly correlated variables into a set of values of linearly uncorrelated

variables called principal components. The number of principal components is

less than or equal to the number of original variables. This transformation is

defined in such a way that the first principal component has the largest

possible variance (that is, accounts for as much of the variability in the data

as possible), and each succeeding component in turn has the highest variance

possible under the constraint that it is orthogonal to (i.e., uncorrelated with)

the preceding components. The principal components are orthogonal because

they are the eigenvectors of the covariance matrix, which is symmetric. PCA

is sensitive to the relative scaling of the original variables.”

Timmermann (UCSD) VARs and Factors Winter, 2017 38 / 41

Empirical example

Data set with n = 132 predictor variables

Available in macro_raw_data.xlsx on Triton Ed. Uses data from Sydney

Ludvigson’s NYU website

Data series have to be transformed (e.g., from levels to growth rates) before

they are used to form principal components

Extract r = 8 common factors using principal components methods

Timmermann (UCSD) VARs and Factors Winter, 2017 39 / 41

Empirical example (cont.): 8 principal components

200 400 600

-5

0

5

PC: 1

200 400 600

-4

-2

0

2

4

PC: 2

200 400 600

-10

-5

0

5

10

PC: 3

200 400 600

-10

-5

0

5

10

PC: 4

200 400 600

-5

0

5

PC: 5

200 400 600

-4

-2

0

2

4

PC: 6

200 400 600

-10

-5

0

5

PC: 7

200 400 600

-10

-5

0

5

PC: 8

Timmermann (UCSD) VARs and Factors Winter, 2017 40 / 41

Forecasting with panel data I

Forecasting methods can also be applied to cross-sections or panel data

Key requirement is that the predictors are predetermined in time. For

example, we could build a forecasting model for a large cross-section of

credit-card holders using data on household characteristics, past payment

records etc.

The implicit time dimension is that we know whether a payment in the data

turned out fraudulent

Panel regressions take the form

yit = αi + λt + X

′

itβ+ uit , i = 1, …, n, t = 1, …,T

αi : fixed effect (e.g., firm, stock, or country level)

λt : time fixed effect

How do we predict λt+1?

Panel models can be estimated using regression methods

Do slope coeffi cients β vary across units (βi )?

bias-variance trade-off

Timmermann (UCSD) VARs and Factors Winter, 2017 41 / 41

Lecture 7: Event, Density and Volatility Forecasting

UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 1 / 57

1 Event forecasting

2 Point, interval and density forecasts

Location-Scale Models of Density Forecasts

GARCH Models

Realized Volatility Measures

3 Interval and Density Forecasts

Mean reverting processes

Random walk model

Alternative Distribution Models

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 2 / 57

Event forecasting

Discrete events are important in economics and finance

Mergers & Acquisitions – do they happen (yes = 1) or not (no = 0)?

Will a credit card transaction be fraudulent (yes = 1, no = 0)?

Will Europe enter into a recession in 2017 (yes = 1, no = 0)?

What will my course grade be? A, B , C

Change in Fed funds rate is usually in increments of 25 bps or zero. Create

bins of 0 = no change, 1 = 25 bp change, 2 = 50 bp change, etc.

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 2 / 57

University of Iowa Electronic markets: value of contract on

Republican presidential nominee

Contracts trading for a total exposure of $500 with a $1 payoff on each

contract

y = 1 : you get paid one dollar if Trump wins the Republican nomination

y = 0 : you get nothing if Trump fails to win the Republican nomination

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 3 / 57

University of Iowa Electronic markets: Democrat vs

republican win

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 4 / 57

Limited dependent variables

Limited dependent variables have a restricted range of values (bins)

Example: A binary variable takes only two values: y = 1 or y = 0

In a binary response model, interest lies in the response probability given

some predictor variables x1t , …, xkt :

P(yt+1 = 1|x1t , x2t , .., xkt )

Example: what is the probability that the Fed will raise interest rates by more

than 75 bps in 2017 given the current level of inflation, changes in oil prices,

bank lending, unemployment rate and past interest rate decisions?

Suppose y is a binary variable taking values of zero or one

E [yt+1 |xt ] = P(yt+1 = 1|xt )× 1+ P(yt+1 = 0|xt )× 0 = P(yt+1 = 1|xt )

E [.] : Expectation. P(.) : Probability

The probability of “success” (yt+1 = 1) equals the expected value of yt+1

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 5 / 57

Linear probability model

Linear probability model:

P(yt+1 = 1|x1t , .., xkt ) = β0 + β1x1t + …+ βk xkt

x1t , …, xkt : predictor variables

In the linear probability model, βj measures the change in the probability of

success when xjt changes, holding other variables constant:

∆P(yt+1 = 1|x1t , ..,∆xjt , …, xkt ) = βj∆xjt

Problems with linear probability model:

Probabilities can be bigger than one or less than zero

Is the effect of x linear? What if you are close to a probability of zero or one?

Often the linear model gives a good first idea of the slope coeffi cient βj

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 6 / 57

Linear probability model: forecasts outside [0,1]

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 7 / 57

Binary response models

To address the limitations of the linear probability model, consider a class of

binary response models of the form

P(yt+1 = 1|x1t , x2t , …, xkt ) = G (β0 + β1x1t + ..+ βk xkt )

for functions G (.) satisfying

0 ≤ G (.) ≤ 1

Probabilities are now guaranteed to fall between zero and one

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 8 / 57

Logit and Probit models

Two popular choices for G (.)

Logit model:

G (x) =

exp(x)

1+ exp(x)

Probit model:

G (x) = Φ(x) =

∫ x

−∞

φ(z)dz , φ(z) = (2π)−1/2 exp(−z2/2)

Φ(x) is the standard normal cumulative distribution function (CDF)

Logit and probit functions are increasing and steepest at x = 0

G (x)→ 0 as x → −∞, and G (x)→ 1 as x → ∞

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 9 / 57

Logit and Probit models

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 10 / 57

Logit, probit and LPM in matlab

binaryResponseExample.m on Triton Ed

lpmBeta = [ones(T,1) X]y; % estimates for linear probability model

probitBeta = glmfit(X,y,’binomial’,’link’,’probit’); % Probit model

logitBeta = glmfit(X,y,’binomial’,’link’,’logit’); % Logit model

lpmFit = [ones(T,1) X]*lpmBeta; % Calculate fitted values

probitFit = glmval(probitBeta,X,’probit’); % fitted values, probit

logitFit = glmval(logitBeta,X,’logit’); % fitted values, logit

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 11 / 57

Application: Directional investment strategy

yt+1 = r

s

t+1 − Tbillt+1 : excess return on stocks (r

s

t+1) over Tbills (Tbillt+1)

I syt+1>0 =

{

1 if yt+1 > 0

0 0therwise

Investment strategy: buy stocks if we predict yt+1 > 0, otherwise hold T-bills

forecast stocks T-bills

ft+1|t > 0, +1 0

ft+1|t ≤ 0, 0 +1

Logit/Probit model estimates the probability of a positive excess return,

prob(I syt+1>0 = 1|It )

I syt+1 = yt+1 >= 0; % create indicator for dependent variable

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 12 / 57

Fitted probabilities of a positive excess return

Use logit, probit or linear model to forecast the probability of a positive

(monthly) excess return using the lagged T-bill rate, dividend yield and

default spread as predictors

1930 1940 1950 1960 1970 1980 1990 2000 2010

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

Time

F

itt

ed

p

ro

ba

bi

lit

y

LPM

Probit

Logit

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 13 / 57

Switching strategy (cont.)

Decision rule: define the stock “weight” ωst+1|t

ωst+1|t =

{

1 if prob(I syt+1>0 = 1|It ) > 0.5

0 0therwise

Payoff on stock-bond switching (market timing) portfolio:

rt+1 = ω

s

t+1|t r

s

t+1 + (1−ω

s

t+1|t )Tbillt+1

Payoff depends on both the sign and magnitude of the predicted excess

return, even though the forecast ignores information about magnitudes

Cumulated wealth: Starting from initial wealth W0 we get

WT = W0

T

∏

τ=1

(1+ rτ)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 14 / 57

Cumulated wealth from T-bills, stocks and switching rule

1930 1940 1950 1960 1970 1980 1990 2000 2010

5

10

15

20

25

30

35

Time

F

itt

ed

p

ro

ba

bi

lit

y

switching

stocks

Tbills

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 15 / 57

Point forecasts

Point forecasts provide a summary statistic for the predictive density of the

predicted variable (Y ) given the data

This is all we need under MSE loss (suffi cient statistic)

Limitations to point forecasts:

Different loss functions L give different point forecasts (Lecture 1)

Point forecasts convey no sense of the precision of the forecast —how

aggressively should an investor act on a predicted stock return of +1%?

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 16 / 57

Interval forecasts

It is always useful to report a measure of forecast uncertainty

Addresses “how certain am I of my forecast?”

many forecasts are surrounded by considerable uncertainty

Alternatively, use scenario analysis

specify outcomes in possible future scenarios along with the probabilities of the

scenarios

Under the assumption that the forecast errors are normally distributed, we

can easily construct an interval forecast, i.e., an interval that contains the

future value of Y with a probability such as 50%, 90% or 95%

forecast the variance as well

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 17 / 57

Distribution (density) forecasts

Distribution forecasts provide a complete characterization of the forecast

uncertainty

Calculation of expected utility for many risk-averse investors requires a

forecast of the full probability distribution of returns —not just its mean

Parametric approaches assume a known distribution such as the normal

(Gaussian)

Non-parametric methods treat the distribution as unknown

bootstrap draws from the empirical distribution of residuals

Hybrid approaches that mix different distributions can also be used

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 18 / 57

Density forecasts (cont.)

To construct density forecasts, typically three estimates are used:

Estimate of the conditional mean given the data, µt+1|t

Estimate of the conditional volatility given the data, σt+1|t

Estimate of the distribution function of the innovations/shocks, Pt+1|t

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 19 / 57

Conditional location-scale processes with normal errors

yt+1 = µt+1|t + σt+1|tut+1, ut+1 ∼ N(0, 1)

µt+1|t : conditional mean of yt+1, given current information, It

σt+1|t : conditional standard deviation or volatility, given It

P(yt+1 ≤ y |It ) = P

(

yt+1 − µt+1|t

σt+1|t

≤

y − µt+1|t

σt+1|t

)

= P

(

ut+1 ≤

y − µt+1|t

σt+1|t

)

≡ N

(

y − µt+1|t

σt+1|t

)

P : probability

N : cumulative distribution function of Normal(0, 1)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 20 / 57

Nobel prize committee press release (2003)

“On financial markets, random fluctuations over time —volatility —are

particularly significant because the value of shares, options and other

financial instruments depends on their risk. Fluctuations can vary

considerably over time; turbulent periods with large fluctuations are followed

by calmer periods with small fluctuations. Despite such time-varying

volatility, in want of a better alternative, researchers used to work with

statistical methods that presuppose constant volatility. Robert Engle’s

discovery was therefore a major breakthrough. He found that the concept of

autoregressive conditional heteroskedasticity (ARCH) accurately

captures the properties of many time series and developed methods for

statistical modeling of time-varying volatility. His ARCH models have become

indispensable tools not only for researchers, but also for analysts on financial

markets, who use them in asset pricing and in evaluating portfolio risk.”

Robert Engle did the work on ARCH models at UCSD

This work is critical for modeling σt+1|t

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 21 / 57

ARCH models

Returns, rt+1, at short horizons (daily, 5-minute, weekly, even monthly) are

hard to predict – they are not strongly serially correlated

However, squared returns, r2t+1, are serially correlated and easier to predict

Volatility clustering: periods of high market volatility alternates with periods

of low volatility

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 22 / 57

Daily stock returns

10-years of daily US stock returns

2006 2008 2010 2012 2014

-8

-6

-4

-2

0

2

4

6

8

10

Time

P

er

ce

nt

ag

e

po

in

ts

S&P 500 returns

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 23 / 57

Daily stock return levels: AR(4) model estimates

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 24 / 57

Squared daily stock returns

2006 2008 2010 2012 2014

0

0.2

0.4

0.6

0.8

1

1.2

Time

P

er

ce

nt

ag

e

po

in

ts

S&P 500 returns

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 25 / 57

Squared daily stock returns: AR(4) estimates

Much stronger evidence of serial persistence (autocorrelation) in squared returns

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 26 / 57

GARCH models

Generalized AutoRegressive Conditional Heteroskedasticity

GARCH(p, q) model for the conditional variance:

εt+1 = σt+1|tut+1, ut+1 ∼ N(0, 1)

σ2t+1|t = ω+

p

∑

i=1

βiσ

2

t+1−i |t−i +

q

∑

i=1

αi ε

2

t+1−i

GARCH(1, 1) is the empirically most popular specification:

σ2t+1|t = ω+ β1σ

2

t |t−1 + α1ε

2

t

= ω+ (α1 + β1)σ

2

t |t−1 + α1σ

2

t |t−1(u

2

t − 1)︸ ︷︷ ︸

zero mean

Diffi cult to beat GARCH(1,1) in many volatility forecasting contests

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 27 / 57

GARCH(1,1) model: h-step forecasts I

α1 + β1 : measures persistence of GARCH(1,1) model

As long as α1 + β1 < 1, the volatility process will converge
Long run−or unconditional−variance is
E [σ2t+1|t ] ≡ σ
2 =
ω
1− α1 − β1
GARCH(1,1) is similar to an ARMA(1,1) model in squares:
σ2t+1|t = σ
2 + (α1 + β1)(σ
2
t |t−1 − σ
2) + α1σ
2
t |t−1(u
2
t − 1)
Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 28 / 57
Volatility modeling
GARCH(1,1) model can generate fat tails
The standard GARCH(1,1) model does not generate a skewed distribution —
that’s because the shocks are normally distributed (symmetric)
Conditional volatility estimate: estimate of the current volatility level given
all current information. This varies over time
Mean reversion: If the current conditional variance forecast σ2t+1|t > σ

2, the

multi-period variance forecast will exceed the average forecast by an amount

that declines in the forecast horizon

Unconditional volatility estimate: long-run (“average”) estimate of volatility.

This is constant over time

GARCH models can be estimated by maximum likelihood methods

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 29 / 57

Asymmetric GARCH models I

GJR model of Glosten, Jagannathan, and Runkle (1993):

σ2t+1|t = ω+ α1ε

2

t + λε

2

t I (εt < 0) + β1σ
2
t |t−1
I (εt < 0) =
{
1 if εt < 0
0 otherwise
Positive and negative shocks affect volatility differently if λ 6= 0
If λ > 0, negative shocks will affect future conditional variance more strongly

than positive shocks

The bigger effect of negative shocks is sometimes attributed to leverage (for

stock returns)

λ measures the magnitude of the leverage

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 30 / 57

Asymmetric GARCH models II

EGARCH (exponential GARCH) model of Nelson (1991):

log(σ2t+1|t ) = ω+ α1(|εt | − E [|εt |]) + γεt + β1 log(σ

2

t |t−1)

EGARCH volatility estimates are always positive in levels (the exponential of

a negative number is positive)

If γ < 0, negative shocks (εt < 0) will have a bigger effect on future conditional volatility than positive shocks γ measures the magnitude of the leverage (sign different from GJR model) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 31 / 57 GARCH estimation in matlab garchExample.m: code on Triton Ed [h, pValue, stat, cValue] = archtest(res,′ lags ′, 10); % test for ARCH model = garch(P,Q); % creates a conditional variance GARCH model with GARCH degree P and ARCH degree Q model = egarch(P,Q); % creates an EGARCH model with P lags of log(σ2t |t−1) and Q lags of ε 2 t model = gjr(P,Q); % creates a GJR model modelEstimate = estimate(model ,spReturns); % estimate GARCH model modelVariances = infer(modelEstimate,spReturns); % generate conditional variance estimate varianceForecasts = forecast(modelEstimate,10,′V 0′,modelVariances); % generate variance forecast Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 32 / 57 GARCH(1,1) and EGARCH estimates Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 33 / 57 GJR estimates Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 34 / 57 Comparison of fitted volatility estimates 500 1000 1500 2000 2500 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 Forecast horizon P er ce nt ag e po in ts Fitted Volatility estimates GARCH(1,1) EGARCH(1,1) GJR(1,1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 35 / 57 Comparison of 3 models: out-of-sample forecasts 2 4 6 8 10 12 14 16 18 20 1.15 1.2 1.25 1.3 1.35 1.4 Forecast horizon P er ce nt ag e po in ts Volatility forecasts GARCH(1,1) EGARCH(1,1) GJR(1,1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 36 / 57 Realized variance True variance is unobserved. How do we estimate it? Intuition: the higher the variance of y is in a given period, the more y fluctuates in small intervals during that period Idea: sum the squared changes in y during small intervals between time markers τ0, τ1, τ2, ..., τN within a given period Realized variance: RVt = N ∑ j=1 (yτj − yτj−1 ) 2 t − 1 = τ0 < τ1 < ... < τN = t Example: use 5-minute sampled data over 8 hours to estimate the daily stock market volatility: N = 8 ×12 = 96 observations τ0 = 8am, τ1 = 8 : 05am, τ2 = 8 : 10am, ...τN = 4pm Example: use squared daily returns to estimate volatility within a month: N = 22 daily observations (trading days) τ0 = Jan31, τ1 = Feb01, τ2 = Feb02, ..., τN = Feb28 Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 37 / 57 Realized variance Treating the realized variance as a noisy estimate of the true (unobserved) variance, we can use simple ARMA models to predict future volatility AR(1) model for the realized variance: RVt+1 = β0 + β1RVt + εt+1 The realized volatility is the square root of the realized variance Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 38 / 57 Monthly realized volatility 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 4 6 8 10 12 14 16 18 20 Time P er ce nt ag e po in ts Monthly realized volatility Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 39 / 57 Example: Linear regression model Consider the linear regression model yt+1 = β1yt + εt+1, εt+1 ∼ N(0, σ 2) The point forecast computed at time T using an estimated model is f̂T+1|T = β̂1yT The forecast error is the difference between actual value and forecast: yT+1 − f̂T+1|T = εT+1 + (β1 − β̂1)yT The MSE is MSE = E [(yT+1 − f̂T+1|T ) 2 ] = σ2ε + Var((β1 − β̂1))× y 2 T This depends on σ2ε and also on the estimation error (β1 − β̂1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 40 / 57 Example (cont.): interval forecasts Interval forecasts are similar to confidence intervals: A 95% interval forecast is an interval that contains the future value of the outcome 95% of the time If the variable is normally distributed, we can construct this as f̂T+1|T ± 1.96× SE (YT+1 − f̂T+1|T ) SE (YT+1 − f̂T+1|T ) is the standard error of the forecast error eT+1 = YT+1 − f̂T+1|T Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 41 / 57 Interval forecasts Consider the simple model yt+1 = µ+ σεt+1, εt+1 ∼ N(0, 1) A typical interval forecast is that the outcome yt+1 falls in the interval [f l , f u ] with some given probability, e.g., 95% If εt+1 is normally distributed this simplifies to f l = µ− 1.96σ f u = µ+ 1.96σ More generally, with time-varying mean (µt+1|t ) and volatility (σt+1|t ): f lt+1|t = µt+1|t − 1.96σt+1|t f ut+1|t = µt+1|t + 1.96σt+1|t What happens to forecasts for longer horizons? Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 42 / 57 Interval forecasts Mean reverting AR(1) process starting at the mean (yT = 1,E [y ] = 1, φ = 0.9, σ = 0.5) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 43 / 57 Interval forecasts Mean reverting AR(1) process starting above the mean (yT = 2, E [y ] = 1, σ = 0.5) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 44 / 57 Uncertainty and forecast horizon I Consider the AR(1) process yt+1 = φyt + εt+1, εt+1 ∼ N(0, σ2) yt+h = φyt+h−1 + εt+h = φ(φyt+h−2 + εt+h−1) + εt+h = φ2yt+h−2 + φεt+h−1 + εt+h ... yt+h = φ hyt + φ h−1εt+1 + φ h−2εt+2 + ...+ φεt+h−1 + εt+h︸ ︷︷ ︸ unpredictable future shocks Using this expression, if |φ| < 1 (mean reversion) we have ft+h|t = φ hyt Var(yt+h |It ) = σ2(1− φ2h) 1− φ2 → σ2 1− φ2 (for large h) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 45 / 57 Uncertainty and forecast horizon II The 95% interval forecast and probabilty (density) forecast for the mean reverting AR(1) process (|φ| < 1) with Gaussian shocks are 95% interval forec. φhyt ± 1.96σ √ 1−φ2h 1−φ2 density forecast N ( φhyt , σ2(1−φ2h) 1−φ2 ) This ignores parameter estimation error (φ is taken as known) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 46 / 57 Interval and density forecasts for random walk model For the random walk model yt+1 = yt + εt+1, εt+1 ∼ N(0, σ2), so yt+h = yt + εt+1 + εt+2 + ...+ εt+h−1 + εt+h Using this expression, we get ft+h|t = yt Var(yt+h |It ) = hσ2 The 95% interval and probability forecasts are 95% interval forec. yt ± 1.96σ √ h density forecast N(yt , hσ2) Width of confidence interval continues to expand as h→ ∞ Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 47 / 57 Interval forecasts: random walk model Interval forecasts for random walk model (yT = 2, σ = 1) Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 48 / 57 Alternative distributions: Two-piece normal distribution Two-piece normal distribution: dist(yt+1) = exp(−(yt+1−µt+1|t ) 2/2σ21)√ 2π(σ1+σ2)/2 for yt+1 ≤ µt+1|t exp(−(yt+1−µt+1|t ) 2/2σ22)√ 2π(σ1+σ2)/2 for yt+1 > µt+1|t

The mean of this distribution is

Et [yt+1 ] = µt+1|t +

√

2

π

(σ2 − σ1)

If σ2 > σ1, the distribution is positively skewed

The distribution has fat tails provided that σ1 6= σ2

This distribution is used by Bank of England to compute “fan charts”

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 49 / 57

Bank of England fan charts: Inflation report 02/2017

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 50 / 57

Bank of England fan charts: Inflation report 02/2016

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 51 / 57

IMF World Economic Outlook, October 2016

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 52 / 57

Alternative distributions: Mixtures of normals

Suppose

y1t+1 ∼ N(µ1, σ

2

1)

y2t+1 ∼ N(µ2, σ

2

2)

cov(y1t+1, y2t+1) = σ12

Sums of normal distributions are normally distributed:

y1t+1 + y2t+1 ∼ N(µ1 + µ2, σ

2

1 + σ

2

2 + 2σ12)

Mixtures of normal distributions are not normally distributed: Let

st+1 = {0, 1} be a random indicator variable. Then

st+1 × y1t+1 + (1− st+1)× y2t+1 6= N(., .)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 53 / 57

Moments of Gaussian mixture models

Let p1 = probability of state 1, p2 = probability of state 2

mean and variance of y :

E [y ] = p1µ1 + p2µ2

Var(y) = p2σ

2

2 + p1σ

2

1 + p1p2(µ2 − µ1)

2

skewness:

skew(y) = p1p2(µ1 − µ2)

{

3(σ21 − σ

2

2) + (1− 2p1)(µ2 − µ1)

2

}

kurtosis:

kurt(y) = p1p2(µ1 − µ2)

2

[(

p32 + p

3

1

)

(µ1 − µ2)

2

]

+6p1p2(µ1 − µ2)

2

[

p1σ

2

2 + p2σ

2

1

]

+3p1σ

4

1 + 3p2σ

4

2

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 54 / 57

Mixtures of normals: Ang and Timmermann (2012)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 55 / 57

Mixtures of normals: Marron and Wand (1992)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 56 / 57

Mixtures of normals: Marron and Wand (1992)

Timmermann (UCSD) Event, Density and Volatility Forecasting Winter, 2017 57 / 57

Lecture 8: Forecast Combination

UCSD, February 27, 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Combination Winter, 2017 1 / 49

1 Introduction: When, why and what to combine?

2 Survey of Professional Forecasters

3 Optimal Forecast Combinations: Theory

Optimal Combinations under MSE loss

4 Estimating Forecast Combination Weights

Weighting schemes under MSE loss

Forecast Combination Puzzle

Rapach, Strauss and Zhou, 2010

Elliott, Gargano, and Timmermann, 2013

Time-varying combination weights

5 Model Combination

Optimal Pool

6 Bayesian Model Averaging

7 Conclusion

Timmermann (UCSD) Combination Winter, 2017 2 / 49

Key issues in forecast combination

Why combine?

Many models or forecasts with ‘similar’predictive accuracy

Diffi cult to identify a single best forecast

State-dependent performance

Diversification gains

When to combine?

Individual forecasts are misspecified (“all models are wrong but some are

useful.”)

Unstable forecasting environment (past track record is unreliable)

Short track record; use “one-over-N” weights? (N forecasts)

What to combine?

Forecasts using different information sets

Forecasts based on different modeling approaches

Surveys, econometric model forecasts: surveys are truly forward-looking.

Econometric models are better calibrated to data

Timmermann (UCSD) Combination Winter, 2017 2 / 49

Essentials of forecast combination

Dimensionality reduction: Combination reduces the information in a large

set of forecasts to a single summary forecast using a set of combination

weights

Optimal combination chooses “optimal” weights to produce the most

accurate combined forecast

More accurate forecasts get larger weights

Combination weights also reflect correlations across forecasts

Estimation error is important for combination weights

Irrelevance Proposition: In a world with no model misspecification, infinite

data samples (no estimation error) and complete access to the information

sets underlying the individual forecasts, there is no need for forecast

combination

just use the single best model

Timmermann (UCSD) Combination Winter, 2017 3 / 49

When to combine?

Notations:

y : outcome

f̂1, f̂2 : individual forecasts

ω : combination weight

Simple combined forecast: f̂ com = ωf̂1 + (1−ω)f̂2

The combined forecast f̂ com dominates the individual forecasts f̂1 and f̂2

under MSE loss if

E

[(

y − f̂1

)2]

> E

[(

y − f̂ com

)2]

and

E

[(

y − f̂2

)2]

> E

[(

y − f̂ com

)2]

Both conditions need to hold

Timmermann (UCSD) Combination Winter, 2017 4 / 49

Applications of forecast combinations

Forecast combinations have been successfully applied in several areas of

forecasting:

Gross National Product

currency market volatility and exchange rates

inflation, interest rates, money supply

stock returns

meteorological data

city populations

outcomes of football games

wilderness area use

check volume

political risks

Estimation of GDP based on income and production measures

Averaging across values of unknown parameters

Timmermann (UCSD) Combination Winter, 2017 5 / 49

Two types of forecast combinations

1 Data used to construct the invididual forecasts are not observed:

Treat individual forecasts like any other information (data) and estimate the

best possible mapping from the forecasts to the outcome

Examples: survey forecasts, analysts’earnings forecasts

2 Data underlying the model forecasts are observed: ‘model combination’

First generate forecasts from individual models. Then combine these forecasts

Why not simply construct a single “super” model?

Timmermann (UCSD) Combination Winter, 2017 6 / 49

Survey of Economic Forecasters: Participation

1995 2000 2005 2010 2015

400

420

440

460

480

500

520

540

560

580

Timmermann (UCSD) Combination Winter, 2017 7 / 49

SPF: median, interquartile range, min, max real GDP

forecasts

1995 2000 2005 2010 2015

0

5

10

15

Timmermann (UCSD) Combination Winter, 2017 8 / 49

SPF: identity of best forecaster (unemployment rate),

ranked by MSE

1975 1980 1985 1990 1995 2000 2005 2010

100

200

300

400

500

5 Years

ID

1980 1985 1990 1995 2000 2005 2010

100

200

300

400

500

10 Years

ID

Timmermann (UCSD) Combination Winter, 2017 9 / 49

Forecast combinations: simple example

Two forecasting models using x1 and x2 as predictors:

yt+1 = β1x1t + ε1t+1 ⇒ f̂1t+1|t = β̂1tx1t

yt+1 = β2x2t + ε2t+1 ⇒ f̂2t+1|t = β̂2tx2t

Combined forecast:

f̂ comt+1|t = ωf̂1t+1|t + (1−ω)f̂2t+1|t

Could the combined forecast be better than the forecast based on the

“super” model?

yt+1 = β1x1t + β2x2t + εt+1 ⇒ f̂

Super

t+1|t = β̂1tx1t + β̂2tx2t

Timmermann (UCSD) Combination Winter, 2017 10 / 49

Combinations of forecasts: theory

Suppose the information set consists of m individual forecasts:

I = {f̂1, …., f̂m}

Find an optimal combination of the individual forecasts:

f̂ com = ω0 +ω1 f̂1 +ω2 f̂2 + …+ωm f̂m

ω0,ω1,ω2, …,ωm : unknown combination weights

The combined forecast uses the individual forecasts {f̂1, f̂2, …, f̂m} rather than

the underlying information sets used to construct the forecasts (f̂i = β̂

′

i xi )

Timmermann (UCSD) Combination Winter, 2017 11 / 49

Combinations of forecasts: theory

Because the underlying ‘data’are forecasts, they can be expected to obtain

non-negative weights that sum to unity,

0 ≤ ωi ≤ 1, i = 1, …,m

m

∑

i=1

ωi = 1

Such constraints on the weights can be used to reduce the effect of

estimation error

Should we allow ωi < 0 and go "short" in a forecast?
Negative ωi doesn’t mean that the ith forecast was bad. It just means that
forecast i can be used to offset the errors of other forecasts
Timmermann (UCSD) Combination Winter, 2017 12 / 49
Combinations of two forecasts
Two individual forecasts f1, f2 with forecast errors e1 = y − f1, e2 = y − f2
Both forecasts are assumed to be unbiased: E [e1 ] = E [e2 ] = 0
Variances of forecast errors: σ2i , i = 1, 2. Covariance is σ12
The combined forecast will also be unbiased if the weights add up to one:
f = ωf1 + (1−ω)f2 ⇒
e = y − f = y −ωf1 − (1−ω)f2 = ωe1 + (1−ω)e2
Forecast error from the combination is a weighted average of the individual
forecast errors
E [e] = 0
Var(e) = ω2σ21 + (1−ω)
2σ22 + 2ω(1−ω)σ12
Like a portfolio of two correlated assets
Timmermann (UCSD) Combination Winter, 2017 13 / 49
Combination of two unbiased forecasts: optimal weights
Solve for the optimal combination weight, ω∗ :
ω∗ =
σ22 − σ12
σ21 + σ
2
2 − 2σ12
1−ω∗ =
σ21 − σ12
σ21 + σ
2
2 − 2σ12
Combination weight can be negative if σ12 > σ

2

1 or σ12 > σ

2

2

If σ12 = 0: weights are the relative variance σ

2

2/σ

2

1 of the forecasts:

ω∗ =

σ22

σ21 + σ

2

2

=

σ−21

σ−21 + σ

−2

2

Greater weight is assigned to more precise models (small σ2i )

Timmermann (UCSD) Combination Winter, 2017 14 / 49

Combinations of multiple unbiased forecasts

f : m× 1 vector of forecasts

e = ιmy − f : vector of m forecast errors

ιm = (1, 1, …, 1)′ : m× 1 vector of ones

Assume that the individual forecast errors are unbiased:

E [e] = 0, Σe = Covar(e)

Choosing ω to minimize the MSE subject to the weights summing to one, we

get the optimal combination weights ω∗

ω∗ = argmin

ω

ω′Σeω

= (ι′mΣ

−1

e ιm)

−1Σ−1e ιm

Special case: if Σe is diagonal, ω∗i = σ

−2

i / ∑

m

j=1 σ

−2

j : inverse MSE weights

Timmermann (UCSD) Combination Winter, 2017 15 / 49

Optimality of equal weights

Equal weights (EW) play a special role in forecast combination

EW are optimal when the individual forecast errors have identical variance,

σ2, and identical pair-wise correlations ρ

nothing to distinguish between the forecasts

This situation holds to a close approximation when all models are based on

similar data and produce more or less equally accurate forecasts

Similarity to portfolio analysis: An equal-weighted portfolio is optimal if all

assets have the same mean and variance and pairwise identical covariances

Timmermann (UCSD) Combination Winter, 2017 16 / 49

Estimating combination weights

In practice, combination weights need to be estimated using past data

Once we use estimated combination weights it is diffi cult to show that any

particular weighting scheme will dominate other weighting methods

We prefer one method for some data and different methods for other data

Equal-weights avoid estimation error entirely

Why not always use equal weights then?

Timmermann (UCSD) Combination Winter, 2017 17 / 49

Estimating combination weights

If we try to estimate the optimal combination weights, estimation error

creeps in

In the case of forecast combination, the “data” (individual forecasts) is not a

random draw but (possibly unbiased, if not precise) forecasts of the outcome

This suggests imposing special restrictions on the combination weights

We might impose that the weights sum to one and are non-negative:

∑

i

ωi = 1, ωi ∈ [0, 1]

Simple combination schemes such as EW satisfy these constraints and do not

require estimation of any parameters

EW can be viewed as a reasonable prior when no data has been observed

Timmermann (UCSD) Combination Winter, 2017 18 / 49

Estimating combination weights

Simple estimation methods are diffi cult to beat in practice

Common baseline is to use a simple EW average of forecasts:

f ew =

1

m

m

∑

i=1

fi

No estimation error since the combination weights are imposed rather than

estimated (data independent)

Also works if the number of forecasts (m) changes over time or some

forecasts have short track records

Timmermann (UCSD) Combination Winter, 2017 19 / 49

Simple combination methods

Equal-weighted forecast

f ew =

1

m

m

∑

i=1

fi

Median forecast (robust to outliers)

f median = median{fi}mi=1

Trimmed mean. Order the forecasts

{f1 ≤ f2 ≤ … ≤ fm−1 ≤ fm}

Then trim the top/bottom α% before taking an average

f trim =

1

m(1− 2α)

b(1−α)mc

∑

i=bαm+1c

fi

Timmermann (UCSD) Combination Winter, 2017 20 / 49

Weights inversely proportional to MSE or rankings

Ignore correlations across forecast errors and set weights proportional to the

inverse of the models’MSE (mean squared error) values:

ωi =

MSE−1i

∑mi=1MSE

−1

i

Robust weighting scheme that weights forecast models inversely to their rank,

Ranki

ω̂i =

Rank−1i

∑mi=1 Rank

−1

i

Best model gets a rank of 1, second best model a rank of 2, etc. Weights

proportional to 1/1, 1/2, 1/3, etc.

Timmermann (UCSD) Combination Winter, 2017 21 / 49

Bates-Granger restricted least squares

Bates and Granger (1969): use plug-in weights in the optimal solution based

on the estimated variance-covariance matrix

Numerically identical to restricted least squares estimator of the weights from

regressing the outcome on the vector of forecasts ft+h|t and no intercept

subject to the restriction that the coeffi cients sum to one:

f BGt+h|t = ω̂

′

OLS ft+h|t = (ι

′Σ̂−1e ι)

−1 ι′Σ̂−1e ft+h|t

Σ̂ε = (T − h)−1 ∑T−ht=1 et+h|te

′

t+h|t : sample estimator of error covariance

matrix

Timmermann (UCSD) Combination Winter, 2017 22 / 49

Forecast combination puzzle

Empirical studies often find that simple equal-weighted forecast combinations

perform very well compared with more sophisticated combination schemes

that rely on estimated combination weights

Smith and Wallis (2009): “Why is it that, in comparisons of combinations of

point forecasts based on mean-squared forecast errors …, a simple average

with equal weights, often outperforms more complicated weighting schemes.”

Errors introduced by estimation of the optimal combination weights could

overwhelm any gains relative to using 1/N weights

Timmermann (UCSD) Combination Winter, 2017 23 / 49

Combination forecasts using Goyal-Welch Data

RMSE: Prevmean: 1.9640, Ksink: 1.9924, EW = 1.9592

1975 1980 1985 1990 1995 2000 2005 2010

-0.01

-0.005

0

0.005

0.01

0.015

Time

re

tu

rn

fo

re

ca

st

combination forecasts

PrevMean

Ksink

EW

Timmermann (UCSD) Combination Winter, 2017 24 / 49

Combination forecasts using Goyal-Welch Data

RMSE: EW = 1.9592, rolling = 1.9875, PrevBest = 2.0072

1975 1980 1985 1990 1995 2000 2005 2010

-0.015

-0.01

-0.005

0

0.005

0.01

0.015

Time

re

tu

rn

fo

re

ca

st

combination forecasts

EW

rolling

PrevBest

Timmermann (UCSD) Combination Winter, 2017 25 / 49

Rapach-Strauss-Zhou (2010)

Quarterly stock returns data, 1947-2005, 15 predictor variables

Individual univariate prediction models (i = 1, ..,N = 15):

rt+1 = αi + βi xit + εit+1 (estimated model)

r̂t+1|i = α̂i + β̂i xit (generated forecast)

Combination forecast of returns, r̂ ct+1|t :

r̂ ct+1|t =

N

∑

i=1

ωi r̂t+1|i with weights

ωi = 1/N or ωi =

DMSPE−1i

∑Nj=1 DMSPE

−1

j

DMSPEi =

t

∑

s=T0

θτ−1−s (rs+1 − r̂s+1|i )

2, θ ≤ 1

DMSPE : discounted mean squared prediction error

Timmermann (UCSD) Combination Winter, 2017 26 / 49

Rapach-Strauss-Zhou (2010): results

Timmermann (UCSD) Combination Winter, 2017 27 / 49

Rapach-Strauss-Zhou (2010): results

Timmermann (UCSD) Combination Winter, 2017 28 / 49

Empirical Results (Rapach, Strauss and Zhou, 2010)

Timmermann (UCSD) Combination Winter, 2017 29 / 49

Rapach, Strauss, Zhou: main results

Forecast combinations dominate individual prediction models for stock

returns out-of-sample

Forecast combination reduces the variance of the return forecast

Return forecasts are most accurate during economic recessions

“Our evidence suggests that the usefulness of forecast combining methods

ultimately stems from the highly uncertain, complex, and constantly evolving

data-generating process underlying expected equity returns, which are related

to a similar process in the real economy.”

Timmermann (UCSD) Combination Winter, 2017 30 / 49

Elliott, Gargano, and Timmermann (JoE, 2013):

K possible predictor variables

Generalizes equal-weighted combination of K univariate models

r̂t+1 = α̂i + β̂i xit to consider EW combination of all possible 2-variate,

3-variate, etc. models:

r̂t+1|i ,j = α̂i + β̂i xit + β̂jxjt (2 predictors)

r̂t+1|i ,j ,k = α̂i + β̂i xit + β̂jxjt + β̂k xkt (3 predictors)

For K = 12, there are 12 univariate models (k = 1), 66 bivariate models

(k = 2), 220 trivariate models (k = 3) to combine, etc.

Take equal-weighted averages over the forecasts from these models –

complete subset regressions

Timmermann (UCSD) Combination Winter, 2017 31 / 49

Elliott, Gargano, and Timmermann (JoE, 2013)

Timmermann (UCSD) Combination Winter, 2017 32 / 49

Elliott, Gargano, and Timmermann (JoE, 2013)

Timmermann (UCSD) Combination Winter, 2017 33 / 49

Adaptive combination weights

Bates and Granger (1969) propose several adaptive estimation schemes

Rolling window of the forecast models’relative performance over the most

recent win observations:

ω̂i ,t |t−h =

(

∑tτ=t−win+1 e

2

i ,τ|τ−h

)−1

∑mj=1

(

∑tτ=t−win+1 e

2

j ,τ|τ−h

)−1

Adaptive updating scheme discounts older performance, λ ∈ (0; 1) :

ω̂i ,t |t−h = λω̂i ,t−1|t−h−1 + (1− λ)

(

∑tτ=t−win+1 e

2

i ,τ|τ−h

)−1

∑mj=1

(

∑tτ=t−win+1 e

2

j ,τ|τ−h

)−1

The closer to unity is λ, the smoother the combination weights

Timmermann (UCSD) Combination Winter, 2017 34 / 49

Time-varying combination weights

Time-varying parameter (Kalman filter):

yt+1 = ω

′

t f̂t+1|t + εt+1

ωt = ωt−1 + ut , cov(ut , εt+1) = 0

Discrete (observed) state switching (Deutsch et al., 1994) conditional on

observed event happening (et ∈ At ):

yt+1 = Iet∈A(ω01 +ω

′

1 f̂t+1|t ) + (1− Iet∈A)(ω02 +ω

′

2 f̂t+1|t ) + εt+1

Regime switching weights (Elliott and Timmermann, 2005):

yt+1 = ω0st+1 +ω

′

st+1 f̂t+1|t + εt+1

pr(St+1 = st+1 |St = st ) = pst+1st

Timmermann (UCSD) Combination Winter, 2017 35 / 49

Combinations as a hedge against instability

Forecast combinations can work well empirically because they provide

insurance against model instability

The performance of combined forecasts tends to be more stable than that of

individual forecasts used in the empirical combination study of Stock and

Watson (2004)

Combination methods that attempt to explicitly model time-variations in the

combination weights often fail to perform well, suggesting that regime

switching or model ‘breakdown’can be diffi cult to predict or even to track

through time

Use simple, robust methods (rolling window)?

Timmermann (UCSD) Combination Winter, 2017 36 / 49

Combinations as a hedge against instability (cont.)

Suppose a particular forecast is correlated with the outcome only during

times when other forecasts break down. This creates a role for the forecast as

a hedge against model breakdown

Consider two forecasts and two regimes

first forecast works well only in the first state (normal state) but not in the

second state (rare state)

second forecast works well in the second state but not in the first state

second model serves as “insurance” against the breakdown of the first model

like a portfolio asset

Timmermann (UCSD) Combination Winter, 2017 37 / 49

Classical approach to density combination

Problem: we do not directly observe the outcome density−we only observe a

draw from this−and so cannot directly choose the weights to minimize the

loss between this object and the combined density

Kullback Leibler (KL) loss for a linear combination of densities ∑mi=1 ωipit (y)

relative to some unknown true density p(y) is given by

KL =

∫

p(y) ln

(

p(y)

∑mi=1 ωipi (y)

)

dy

=

∫

p(y) ln (p(y)) dy −

∫

p(y) ln

(

m

∑

i=1

ωipi (y)

)

dy

= C − E ln

(

m

∑

i=1

ωipi (y)

)

C is constant for all choices of the weights ωi

Minimizing the KL distance is the same as maximizing the log score in

expectation

Timmermann (UCSD) Combination Winter, 2017 38 / 49

Classical approach to density combination

Use of the log score to evaluate the density combination is popular in the

literature

Geweke and Amisano (2011) use this approach to combine GARCH and

stochastic volatility models for predicting the density of daily stock returns

Under the log score criterion, estimation of the combination weights becomes

equivalent to maximizing the log likelihood. Given a sequence of observed

outcomes {yt}Tt=1, the sample analog is to maximize

ω̂ = argmax

ω

T−1

T

∑

t=1

ln

(

m

∑

i=1

ωipit (yt )

)

s.t. ωi ≥ 0,

m

∑

i=1

ωi = 1 for all i

Timmermann (UCSD) Combination Winter, 2017 39 / 49

Prediction pools with two models (Geweke-Amisano, 2011)

With two models, M1,M2, we have a predictive density

p(yt |Yt−1,M) = ωp(yt |Yt−1,M1) + (1−ω)p(yt |Yt−1,M2)

and a predictive log score

T

∑

t=1

log [wp(yt |Yt−1,M1) + (1− w)p(yt |Yt−1,M2)] , ω ∈ [0, 1]

Empirical example: Combine GARCH and stochastic volatility models for

predicting the density of daily stock returns

Timmermann (UCSD) Combination Winter, 2017 40 / 49

Log predictive score as a function of model weight,

S&P500, 1976-2005 (Geweke-Amisano, 2011)

Timmermann (UCSD) Combination Winter, 2017 41 / 49

Weights in pools of multiple models, S&P500, 1976-2005

(Geweke-Amisano, 2011)

Timmermann (UCSD) Combination Winter, 2017 42 / 49

Optimal prediction pool – time-varying combination weights

Petttenuzzo-Timmermann (2014)

Timmermann (UCSD) Combination Winter, 2017 43 / 49

Model combination – Bayesian Model Averaging

When constructing the individual forecasts ourselves, we can base the

combined forecast on information on the individual models’fit

Methods such as BMA (Bayesian Model Averaging) can be used

BMA weights predictive densities by the posterior probabilities (fit) of the

models, Mi

Models that fit the data better get higher weights in the combination

Timmermann (UCSD) Combination Winter, 2017 44 / 49

Bayesian Model Averaging (BMA)

pc (y) =

m

∑

i=1

ωip(y |Mi )

m models: M1, ….,Mm

BMA weights predictive densities by the posteriors of the models, Mi

BMA is a model averaging procedure rather than a predictive density

combination procedure per se

BMA assumes the availability of both the data underlying each of the

densities, pi (y) = p(y |Mi ), and knowledge of how that data is employed to

obtain a predictive density

Timmermann (UCSD) Combination Winter, 2017 45 / 49

Bayesian Model Averaging (BMA)

The combined model average, given data (Z ) is

pc (y |Z ) =

m

∑

i=1

p(y |Mi ,Z )p(Mi |Z )

p(Mi |Z ) : Posterior probability for model i , given the data, Z

p(Mi |Z ) =

p(Z |Mi )p(Mi )

∑mj=1 p(Z |Mj )p(Mj )

Marginal likelihood of model i is

P(Z |Mi ) =

∫

P(Z |θi ,Mi )P(θi |Mi )dθi

p(θi |Mi ) : prior density of model i’s parameters

p(Z |θi ,Mi ) : likelihood of the data given the parameters and the model

Timmermann (UCSD) Combination Winter, 2017 46 / 49

Constructing BMA estimates

Requirements:

List of models M1, …,Mm

Prior model probabilities p(M1), …., p(Mm)

Priors for the model parameters P(θ1 |M1), …,P(θm |Mm)

Computation of p(Mi |Z ) requires computation of the marginal likelihood

p(Z |Mi ) which can be time consuming

Timmermann (UCSD) Combination Winter, 2017 47 / 49

Alternative BMA schemes

Raftery, Madigan and Hoeting (1997) MC3

If the models’marginal likelihoods are diffi cult to compute, one can use a

simple approximation based on BIC:

ωi = P(Mi |Z ) ≈

exp(−0.5BICi )

∑mi=1 exp(−0.5BICi )

Remove models that appear not to be very good

Madigan and Raftery (1994) suggest removing models for which p(Mi |Z ) is

much smaller than the posterior probability of the best model

Timmermann (UCSD) Combination Winter, 2017 48 / 49

Conclusion

Combination of forecasts is motivated by

misspecified forecasting models due to parameter instability, omitted variables

etc.

diversification across forecasts

private information used to compute individual forecasts (surveys)

Simple, robust estimation schemes tend to work well

optimal combination weights are hard to estimate in small samples

Even if they do not always deliver the most precise forecasts, forecast

combinations generally do not deliver poor performance and so represent a

relatively safe choice

Empirically, equal-weighted survey forecasts work well for many

macroeconomic variables, but they tend to be biased and not very precise for

stock returns

Timmermann (UCSD) Combination Winter, 2017 49 / 49

Lecture 9: Forecast Evaluation

UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Forecast Evaluation Winter, 2017 1 / 50

1 Forecast Evaluation: Absolute vs. relative performance

2 Properties of Optimal Forecasts – Theoretical concepts

3 Evaluation of Sign (Directional) Forecasts

4 Evaluating Interval Forecasts

5 Evaluating Density Forecasts

6 Comparing forecasts I: Tests of equal predictive accuracy

7 Comparing Forecasts II: Tests of Forecast Encompassing

Timmermann (UCSD) Forecast Evaluation Winter, 2017 2 / 50

Forecast evaluation

Given an observed series of forecasts, ft+h|t , and outcomes, yt+h ,

t = 1, …,T , we want to know if the forecasts were “optimal”or poor

Forecast evaluation is closely related to how we measure forecast accuracy

Absolute performance measures the accuracy of an individual forecast

relative to the outcome, using either economic (loss-based) or statistical

measures of performance

Forecast optimality, effi ciency

A forecast that isn’t obviously deficient could still be poor

Relative performance compares the performance of one or several forecasts

against some benchmark—horse race between competing forecast models

Forecast comparisons: test of equal predictive accuracy

Two forecasts could be poor, but one is less bad than the other

Forecast encompassing tests (tests for dominance)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 2 / 50

Forecast evaluation (cont.)

Forecast evaluation amounts to understanding if a model’s predictive

accuracy is “good enough”

How accurate should we be able to forecast Chinese GDP growth? What’s a

reasonable R2 or RMSE?

How about forecasting stock returns? Expect low R2

How much does the forecast horizon matter to the degree of predictability?

Some variables are easier to predict than others. Why?

Unconditional forecast or random walk forecast are natural benchmarks

ARMA models are sometimes used

Timmermann (UCSD) Forecast Evaluation Winter, 2017 3 / 50

Forecast evaluation (cont.)

Informal methods – graphical plots, decompositions

Formal methods – deal with how to formally test if a forecasting model

satisfies certain “optimality criteria”

Evaluation of a forecasting model requires an estimate of its expected loss

Good forecasting models produce ‘small’average losses, while bad models

produce ‘large’average losses

Good performance in a given sample could be due to luck or could reflect

the performance of a genuinely good model

Power of statistical tests varies. Can we detect the difference between an

R2 = 1% versus R2 = 1.5%? Depends on the sample size

Test results depend on the loss function and the information set

Rejection of forecast optimality suggests that we can improve the forecast (at

least in theory)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 4 / 50

Optimality Tests

Establish a benchmark for what constitutes an optimal or a “good” forecast

Effi cient forecasts: Constructed given knowledge of the true data generating

process (DGP) using all currently available information

sets the bar very high: in practice we don’t know the true DGP

Forecasts are effi cient (rational) if they fully utilize all available information

and this information cannot be used to construct a better forecast

weak versus strong rationality (just like tests of market effi ciency)

unbiasedness (forecast error should have zero mean)

orthogonality tests (forecast error should be unpredictable)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 5 / 50

Effi cient Forecast: Definition

A forecast is effi cient (optimal) if no other forecast using the available data,

xt ∈ It , can be used to generate a smaller expected loss

Under MSE loss:

f̂ ∗t+h|t = arg

f̂ (xt )

minE

[

(yt+h − f̂ (xt ))2

]

If we can use information in It to produce a more accurate forecast, then the

original forecast is suboptimal

Effi ciency is conditional on the information set

weak form forecast effi ciency tests include only past forecasts and past

outcomes It = {yt , yt−1, …, f̂t |t−1, et |t−1, …}

strong form effi ciency tests extend this to include all other variables xt ∈ It

Timmermann (UCSD) Forecast Evaluation Winter, 2017 6 / 50

Optimality under MSE loss

First order condition for an optimal forecast under MSE loss:

E [

∂(yt+h − ft+h|t )2

∂ft+h|t

] = −2E

[

yt+h − ft+h|t

]

= −2E

[

et+h|t

]

= 0

Similarly, conditional on information at time t, It :

E [et+h|t |It ] = 0

Expected value of the forecast error must equal zero given current

information, It

Test E [et+h|txt ] = 0 for all variables xt ∈ It known at time t

If the forecast is optimal, no variable known at time t can predict its future

forecast error et+h|t . Otherwise the forecast wouldn’t be optimal

If I can predict my forecast will be too low, I should increase my forecast

Timmermann (UCSD) Forecast Evaluation Winter, 2017 7 / 50

Optimality properties under Squared Error Loss

1 Forecasts are unbiased: the forecast error et+h|t has zero mean, both

conditionally and unconditionally:

E [et+h|t ] = E [et+h|t |It ] = 0

2 h-period forecast errors (et+h|t ) are uncorrelated with information available

at the time the forecast was computed (It ). In particular, single-period

forecast errors, et+1|t , are serially uncorrelated:

E [et+1|tet |t−1 ] = 0

3 The variance of the forecast error (et+h|t ) increases (weakly) in the forecast

horizon, h :

Var(et+h+1|t ) ≥ Var(et+h|t ), for all h ≥ 1

On average it’s harder to predict distant outcomes than outcomes in the near

future

Timmermann (UCSD) Forecast Evaluation Winter, 2017 8 / 50

Optimality properties under Squared Error Loss (cont.)

Optimal forecasts are unbiased. Why? If they were biased, we could improve

the forecast simply by correcting for the bias

Suppose ft+1|t is biased:

yt+1 = 1+ ft+1|t + εt+1, εt+1 ∼ WN(0, σ

2),

Bias-corrected forecast:

f ∗t+1|t = 1+ ft+1|t

is more accurate than ft+1|t

Forecast errors from optimal model should be unpredictable:

Suppose et+1 = 0.5et so the one-step forecast error is serially correlated

Adding back 0.5et to the original forecast yields a more accurate forecast:

f ∗

t+1|t = ft+1|t + 0.5et is better than f

∗

t+1|t

Variance of (optimal) forecast error increases in the forecast horizon

We learn more information as we get closer to the forecast “target” and

increase our information set

Timmermann (UCSD) Forecast Evaluation Winter, 2017 9 / 50

Illustration for MA(2) process

Yt = εt + θ1εt−1 + θ2εt−2, εt ∼ WN(0, σ2)

t yt ft+h|t et+h|t

T + 1 εT+1 + θ1εT + θ2εT−1 θ1εT + θ2εT−1 εT+1

T + 2 εT+2 + θ1εT+1 + θ2εT θ2εT εT+2 + θ1εT+1

T + 3 εT+3 + θ1εT+2 + θ2εT+1 0 εT+3 + θ1εT+2 + θ2εT+1

From these results we see that

E [et+h|t ] = 0 for t = T , h = 1, 2, 3, …

Var(eT+3|T ) ≥ Var(eT+2|T ) ≥ Var(eT+1|T )

Timmermann (UCSD) Forecast Evaluation Winter, 2017 10 / 50

Regression tests of optimality under MSE loss

Effi ciency regressions test if any variable xt known at time t can predict the

future forecast error et+1|t :

et+1|t = β

′xt + εt+1, εt+1 ∼ WN(0, σ2)

H0 : β = 0 vs H1 : β 6= 0

Unbiasedness tests set xt = 1:

et+1|t = β0 + εt+1

Mincer-Zarnowitz regression uses yt+1 on LHS and sets xt = (1, f̂t+1|t ):

yt+1 = β0 + β1 f̂t+1|t + εt+1

H0 : β0 = 0, β1 = 1

Zero intercept, unit slope—use F test

Timmermann (UCSD) Forecast Evaluation Winter, 2017 11 / 50

Regression tests of optimality: Example

Suppose that f̂t+1|t is biased:

yt+1 = 0.2+ 0.9f̂t+1|t + εt+1, εt+1 ∼ WN(0, σ

2).

Q: How can we easily produce a better forecast?

Answer:

f̂ ∗t+1|t = 0.2+ 0.9f̂t+1|t

will be an unbiased forecast

What if

yt+1 = 0.3+ f̂t+1|t + εt+1,

εt+1 = ut+1 + θ1ut , ut ∼ WN(0, σ2)

Can we improve on this forecast?

Timmermann (UCSD) Forecast Evaluation Winter, 2017 12 / 50

A question of power

In small samples with little predictability, forecast optimality tests may not

have much power (ability to detect deviations from forecast optimality)

Rare to find individual forecasters with a long track record

Predictive ability changes over time

Need a long out-of-sample data set (evaluation sample) to be able to tell

with statistical confidence if a forecast is suboptimal

Timmermann (UCSD) Forecast Evaluation Winter, 2017 13 / 50

Testing non-decreasing variance of forecast errors

Suppose we have forecasts recorded for three different horizons, h = S ,M, L,

with S < M < L (short, medium, long)
µe =
[
E [e2t+S |t ],E [e
2
t+M |t ],E [e
2
t+L|t ]
]′
: MSE values
MSE differentials (Long-Short, Medium-Short):
∆eL−M ≡ E
[
e2t+L|t
]
− E
[
e2t+M |t
]
∆eM−S ≡ E
[
e2t+M |t
]
− E
[
e2t+S |t
]
We can test if the expected value of the squared forecast errors is weakly
increasing in the forecast horizon:
∆eL−M ≥ 0
∆eM−S ≥ 0
Distant future is harder to predict than the near future
Timmermann (UCSD) Forecast Evaluation Winter, 2017 14 / 50
Evaluating the rationality of the “Greenbook” forecasts
Patton and Timmermann (2012) study the Fed’s “Greenbook” forecasts of
GDP growth, GDP deflator and CPI inflation
Data are quarterly, over the period 1982Q1 to 2000Q4, approx. 80
observations
Greenbook forecasts and actuals constructed from real-time Federal Reserve
publications. These are aligned in “event time”
6 forecast horizons: h = 0, 1, 2, 3, 4, 5
Increasing MSE and decreasing MSF
Greenbook forecasts of GDP growth, 1982Q1-2000Q4
-5 -4 -3 -2 -1 0
0
1
2
3
4
5
6
7
8
Forecasts and forecast errors, GDP growth
Forecast horizon
V
ar
ia
nc
e
MSE
V[forecast]
V[actual]
Bias vs. forecast horizon (in months)
Analysts’EPS forecasts at different horizons: Biases
2 3 4 5 6 7 8 9 10 11 12
1
2
3
4
5
6
7
horizon
P
er
ce
nt
ag
e
po
in
ts
Bias in forecast errors (AAPL)
RMSE vs. forecast horizon (in months)
Analysts’EPS forecasts at different horizons: RMSE
2 3 4 5 6 7 8 9 10 11 12
0.5
1
1.5
2
2.5
horizon
R
M
S
E
RMSE (AAPL)
Optimality tests that do not rely on the outcome
Many macroeconomic data series are revised: preliminary, first release, latest
data vintage could be used to measure the outcome
volatility is also unobserved
Under MSE loss, forecast revisions should be unpredictable
If I could predict today that my future forecast of the same event will be
different in a particular direction (higher or lower), then I should incorporate
this information into my current forecast
Let ∆ft+h = ft+h|t+1 − ft+h|t be the forecast revision. Then
E [∆ft+h |It ] = 0
Forecast revisions are a martingale difference process (zero mean)
This can be tested through a simple regression that doesn’t use the outcome:
∆ft+h = α+ δxt + εt+h
Timmermann (UCSD) Forecast Evaluation Winter, 2017 19 / 50
Forecast evaluation for directional forecasting
Suppose we are interested in evaluating the forecast (f ) of the sign of a
variable, y . There are four possible outcomes:
forecast/outcome sign(y) > 0 sign(y) ≤ 0

sign(f ) > 0 true positive false positive

sign(f ) ≤ 0 false negative true negative

If stock returns (y) are positive 60% of the time and we always predict a

positive return, we have a “hit rate” of 60%. Is this good?

We need a test statistic that doesn’t reward “broken clock” forecasts (always

predict the same sign) with no informational content or value

Timmermann (UCSD) Forecast Evaluation Winter, 2017 20 / 50

Who is the better forecaster?

Timmermann (UCSD) Forecast Evaluation Winter, 2017 21 / 50

Information in forecasts

Both forecasters have a ‘hit rate’of 80%, with 8 out of 10 correct predictions

(sum elements on the diagonal)

There is no information in the first forecast (the forecaster always says

“increase”)

There is some information in the second forecast: both increases and

decreases are successfully predicted

Not enough to only look at the overall hit rate

Timmermann (UCSD) Forecast Evaluation Winter, 2017 22 / 50

Forecast evaluation for sign forecasting

Suppose we are interested in predicting the sign of yt+h using the sign

(direction) of a forecast ft+h|t

P : probability of a correctly predicted sign (positive or negative)

Py : probability of a positive sign of y

Pf : probability of a positive sign of the forecast, f

Define the sign indicator

I (zt ) =

{

1 if zt ≥ 0

0 if zt < 0
Sample estimates of sign probabilities with T observations:
P̂ =
1
T
T
∑
t=1
I (yt+hft+h|t )
P̂y =
1
T
T
∑
t=1
I (yt+h)
P̂f =
1
T
T
∑
t=1
I (ft+h|t ) > 0

Timmermann (UCSD) Forecast Evaluation Winter, 2017 23 / 50

Sign test

In large samples we can test for sign predictability using the

Pesaran-Timmermann sign statistic

ST =

P̂ − P̂∗√

v̂ar(P̂)− v̂ar(P̂∗)

∼ N(0, 1), where

P̂∗ = P̂y P̂f + (1− P̂y )(1− P̂f )

v̂ar(P̂) = T−1P̂∗(1− P̂∗)

v̂ar(P̂∗) = T

−1(2P̂y − 1)2P̂f (1− P̂f ) + T−1(2P̂f − 1)2P̂y (1− P̂y )

+4T−2P̂y P̂f (1− P̂y )(1− P̂f )

This test is very simple to compute and has been used in studies of market

timing (financial returns) and studies of business cycle forecasting

Timmermann (UCSD) Forecast Evaluation Winter, 2017 24 / 50

Forecast evaluation for event forecasting

Leitch and Tanner (American Economic Review, 1990)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 25 / 50

Forecasts of binary variables: Liu and Moench (2014)

Liu and Moench (2014): What predicts U.S. Recessions? Federal Reserve

Bank of New York working paper

St ∈ {0, 1} : true state of the economy (recession indicator)

St = 1 in recession

St = 0 in expansion

Forecast the probability of a recession using a probit model

Pr(St+1 = 1|Xt ) = Φ(β0 + β1Xt ) ≡ Pt+1|t

The log-likelihood function for β = (β0 β1)

′ is

ln l(β) =

T−1

∑

t=0

[St+1 ln (Φ(β0 + β1Xt )) + (1− St+1) ln (1−Φ(β0 + β1Xt ))]

Timmermann (UCSD) Forecast Evaluation Winter, 2017 26 / 50

Evaluating recession forecasts: Liu-Moench

Pt+1|t ∈ [0, 1] : prediction of St+1 given information known at time t, Xt

Blue: Xt = {term spread}

Green: Xt = {term spread, lagged term spread}

Red: Xt = {term spread, lagged term spread, additional predictor}

Timmermann (UCSD) Forecast Evaluation Winter, 2017 27 / 50

Evaluating binary recession forecasts

Split [0, 1] using 100 evenly spaced thresholds

ci = [0, 0.01, 0.02, …., 0.98, 0.99, 1]

For each threshold, ci , compute the prediction model’s classification:

Ŝt+1|t (ci ) =

{

1 if Pt+1|t ≥ ci

0 if Pt+1|t < ci
True positive (TP) and false positive (FP) indicators:
I tpt+1(ci ) =
{
1 if St+1 = 1 and Ŝt+1|t (ci ) = 1
0 otherwise
I fpt+1(ci ) =
{
1 if St+1 = 0 and Ŝt+1|t (ci ) = 1
0 otherwise
Timmermann (UCSD) Forecast Evaluation Winter, 2017 28 / 50
Estimating the true positive and false positive rates
Using the true St+1 and the classifications, Ŝt+1|t , calculate the percentage
of true positives, PTP, and the percentage of false positives, PFP
PTP(ci ) =
1
n1
T
∑
t=1
I tpt
PFP(ci ) =
1
n0
T
∑
t=1
I fpt
n1 : number of times St = 1 (recessions)
n0 : number of times St = 0 (expansions)
n0 + n1 = n : sample size
Timmermann (UCSD) Forecast Evaluation Winter, 2017 29 / 50
Creating the ROC curve
Each ci produces a set of values (PFPi ,PTPi )
Plot (PFPi ,PTPi ) across all thresholds ci with PFP on the x-axis and PTP
on the y axis
Connecting these points gives the Receiver Order Characteristics curve
ROC curve
plots all possible combinations of PTP(ci ) and PFP(ci ) for ci ∈ [0, 1]
is an increasing function in [0, 1]
as c → 0, TP(c) = FP(c) = 1
as c → 1, TP(c) = FP(c) = 0
Area Under the ROC (AUROC) curve measures accuracy of the classification
Perfect forecast: ROC curve lies in the top left corner
Random guess: ROC curve follows the 45 degree diagonal line
Timmermann (UCSD) Forecast Evaluation Winter, 2017 30 / 50
Evaluating recession forecasts: Liu-Moench
Timmermann (UCSD) Forecast Evaluation Winter, 2017 31 / 50
Estimation and inference on AUROC
Y Rt : observations of Yt classified as recessions (St = 1)
Y Et : observations of Yt classified as expansions (St = 0)
Nonparametric estimate of AUROC:
ÂUROC =
1
n1n0
nE
∑
i=1
nR
∑
j=1
[
I (Y Ri > Y

E

j ) +

1

2

I (Y Ri = Y

E

j )

]

Asymptotic standard error of AUROC:

σ2 =

1

n1n0

[

AUROC (1− AUROC ) + (n1 − 1)(Q1 − AUROC 2)

+(n0 − 1)(Q2 − AUROC 2)

]

Q1 =

AUROC

2− AUROC

; Q2 =

2AUROC 2

1+ AUROC

Timmermann (UCSD) Forecast Evaluation Winter, 2017 32 / 50

Comparing AUROC for two forecasts

Suppose we have two sets of AUROC estimates ̂AUROC1, ̂AUROC2, σ̂21, σ̂

2

2

We also need an estimate, r , of the correlation between AUROC1,AUROC2

We can test if the AUROC are the same using a t-statistic:

t =

AUROC1 − AUROC2√

σ21 + σ

2

2 − 2rσ1σ2

Timmermann (UCSD) Forecast Evaluation Winter, 2017 33 / 50

Evaluating recession forecasts: Liu-Moench (table 3)

Timmermann (UCSD) Forecast Evaluation Winter, 2017 34 / 50

Evaluating Interval forecasts

Interval forecasts predict that the future outcome yt+1 should lie in some

interval

[pLt+1|t (α); p

U

t+1|t (α)]

α ∈ (0, 1) : probability that outcome falls inside the interval forecast

(coverage)

pLt+1|t (α) : lower bound of interval forecast

pUt+1|t (α) : upper bound of interval forecast

Timmermann (UCSD) Forecast Evaluation Winter, 2017 35 / 50

Unconditional test of correctly specified interval forecast

Define the indicator variable

1yt+1 = 1{yt+1 ∈ [p

L

t+1|t (α); p

U

t+1|t (α)]}

=

{

1 if outcome falls inside interval

0 if outcome falls outside interval

Test for correct unconditional (“average”) coverage:

E [1yt+1 ] = α

Use this to evaluate fan charts which show interval forecasts for different

values of α

Test correct coverage for α = 25%, 50%, 75%, 90%, 95%, etc.

Timmermann (UCSD) Forecast Evaluation Winter, 2017 36 / 50

Test of correct conditional coverage

Test for correct conditional coverage: For all Xt

E [1yt+1 |Xt ] = α

Test this implication by estimating, say, a probit model

Pr(1yt+1 = 1) = Φ(β0 + β1xt )

Under the null of a correct interval forecast model, H0 : β1 = 0

Timmermann (UCSD) Forecast Evaluation Winter, 2017 37 / 50

Evaluating Density forecasts: Probability integral transform

The probability integral transform (PIT), Ut+1, of a continuous cumulative

density function PY (yt+1 |xt ) is defined as the outcome evaluated at the

model-specified conditional CDF:

Ut+1 = PY (yt+1 |x1, x2, …, xt )

=

∫ yt+1

−∞

py (y |x1, x2, …, xt )dy

yt+1 : realized value of the outcome (observed value of y)

x1, x2, …, xt : predictors (data)

pY (y |x1, x2, …, xt ) : conditional density of y

PIT value: how likely is it to observe a value equal to or smaller than the

actual outcome (yt+1), given the density forecast py ?

we don’t want this to be very small or very large most of the time

Timmermann (UCSD) Forecast Evaluation Winter, 2017 38 / 50

Probability integral transform: Example

Suppose that our prediction model is a GARCH(1,1)

yt+1 = β0 + β1xt + εt+1, εt+1 ∼ N(0, σ

2

t+1|t )

σ2t+1|t = α0 + α1ε

2

t + β1σ

2

t |t−1

Then we have

pY (y |x1, x2, …, xt ) = N(β0 + β1xt , σ

2

t+1|t )

and so

ut+1 =

∫ yt+1

−∞

pY (y |x1, x2, …, xt )dy

Φ(

yt+1 − (β0 + β1xt )

σt+1|t

)

This is the standard cumulative normal function, Φ, evaluated at

zt+1 = [yt+1 − (β0 + β1xt )]/σt+1|t

Timmermann (UCSD) Forecast Evaluation Winter, 2017 39 / 50

Understanding the PIT

By construction, PIT values lie between zero and one

If the density forecasting model pY (y |x1, x2, …, xt ) is correctly specified, U

will be uniformly distributed on [0, 1]

The sequence of PIT values û1, û2, …, ûT should be independently and

identically distributed Uniform(0, 1)—they should not be serially correlated

If we apply the inverse Gaussian CDF, Φ−1 , to the ût values to get

ẑt = Φ−1(ût ), we get a sequence of i.i.d. N(0, 1) variables

We can therefore test that the density forecasting model is correctly specified

through simple regression tests:

ẑt = µ+ εt H0 : µ = 0

ẑt = µ+ ρẑt−1 + εt H0 : µ = ρ = 0

ẑ2t = µ+ ρẑ

2

t−1 + εt H0 : µ = 1, ρ = 0

Timmermann (UCSD) Forecast Evaluation Winter, 2017 40 / 50

Working in the PIT: graphical test

Timmermann (UCSD) Forecast Evaluation Winter, 2017 41 / 50

Tests of Equal Predictive Accuracy: Diebold-Mariano Test

Test if two forecasts generate the same average (MSE) loss

E [e21t+1 ] = E [e

2

2t+1 ]

Diebold and Mariano (1995) propose a simple and elegant method that

accounts for sampling uncertainty in average losses

Setup: two forecasts with associated loss MSE1, MSE2

MSE1 ∼ N(µ1,Ω11),MSE2 ∼ N(µ2,Ω22),

Cov(MSE1,MSE2) = Ω12

Loss differential in period t + h (dt+h) is

dt+h = e

2

1t+h|t − e

2

2t+h|t

Timmermann (UCSD) Forecast Evaluation Winter, 2017 42 / 50

Tests of Equal Loss – Diebold Mariano Test

Suppose we observe samples of forecasts, forecast errors and forecast

differentials dt+h

These form the basis of a test of the null of equal predictive accuracy:

H0 : E [dt+h ] = 0

To test H0, regress the time series dt+h on a constant and conduct a t−test

on the constant, µ:

dt+h = µ+ εt+h

µ > 0 suggests that MSE1 > MSE2 and so forecast 2 produces the smallest

squared forecast errors and is best

µ < 0 suggests that MSE1 < MSE2 and so forecast 1 produces the smallest squared forecast errors and is best Timmermann (UCSD) Forecast Evaluation Winter, 2017 43 / 50 Comparing forecast methods - Giacomini and White (2006) Consider two set of forecasts, f̂1t+h|t (ωin1) and f̂2t+h|t (ωin2), each computed using a rolling estimation window of length ωi Each forecast is a function of the data and parameter estimates Different choices for the window length, ωin, used in the rolling regressions alter what is being tested If we change the window length, ωi , we also change the forecasts being compared {f̂1t+h|t , f̂2t+h|t} Big models are more affected by estimation error The idea is to compare forecasting methods—not just forecasting models Timmermann (UCSD) Forecast Evaluation Winter, 2017 44 / 50 Conditional Tests - Giacomini and White (2006) We can also test if one method performs better than another one in different environments - regress forecast differential on current information xt ∈ It : dt+h = µ+ βxt + εt+h Analysts may be better at forecasting stock returns than econometric models in recessions or, say, in environments with low interest rates Switch between forecasts? Choose the forecast with the smallest conditional expected squared forecast error Timmermann (UCSD) Forecast Evaluation Winter, 2017 45 / 50 Tests of Forecast Encompassing Encompassing: one model contains all the information (relevant for forecasting) of another model plus some additional information f1 encompasses f2 provided that for all values of ω MSE (f1) ≤ MSE (ωf1 + (1−ω)f2) Equality only holds for ω = 1 One forecast (f1) encompasses another (f2) when the information in the second forecast does not help improve on the forecasting performance of the first forecast Timmermann (UCSD) Forecast Evaluation Winter, 2017 46 / 50 Encompassing Tests Under MSE loss Use OLS to regress the outcome on two forecasts to test for forecast encompassing: yt+1 = β1 f̂1t+1|t + β2 f̂2t+1|t + εt+1 Forecast 1 (f̂1t+1|t ) encompasses (dominates) forecast 2 (f̂2t+1|t ) if β1 = 1 and β2 = 0. If this holds, only use forecast 1 Equivalently, if β2 = 0 in the following regression, forecast 1 encompasses forecast 2: ê1t+1 = β2 f̂2t+1|t + ε1t+1 Forecast 2 doesn’t explain model 1’s forecast error If β1 = 0 in the following regression, forecast 2 encompasses forecast 1: ê2t+1 = β1 f̂1t+1|t + ε2t+1 Timmermann (UCSD) Forecast Evaluation Winter, 2017 47 / 50 Forecast encompassing vs tests of equal predictive accuracy Suppose we cannot reject a test of equal predictive accuracy for two forecasts: E [e21t+h|t ] = E [e 2 2t+h,t ] Then it is optimal to use equal weights in a combined forecast, rather than use only one or the other forecast Forecast encompassing tests if it is optimal to assign a weight of unity on one forecast and zero on the other completely ignore one forecast Tests of equal predictive accuracy and tests of forecast encompassing examine very different hypotheses about how useful the forecasts are Timmermann (UCSD) Forecast Evaluation Winter, 2017 48 / 50 Comparing Two Forecast Methods Three possible outcomes of model comparison: One forecast method completely dominates another method Encompassing; choose the dominant forecasting method One forecast is best, but does not contain all useful information from the second model Combine forecasts using non-equal weights The forecasts have the same expected loss (MSE) Combine forecasts using equal weights Timmermann (UCSD) Forecast Evaluation Winter, 2017 49 / 50 Forecast evaluation: Conclusions Forecast evaluation is very important - health check of forecast models Variety of diagnostic tests available for Optimality tests: point, interval and density forecasts Sign/direction forecasts Forecast comparisons Timmermann (UCSD) Forecast Evaluation Winter, 2017 50 / 50 Lecture 10: Model Instability UCSD, Winter 2017 Allan Timmermann1 1UC San Diego Timmermann (UCSD) Breaks Winter, 2017 1 / 42 1 Forecasting under Model Instability: General Issues 2 How costly is it to Ignore Model Instability? 3 Limitations of Tests for Model Instability Frequent, small versus rare and large changes Using pre-break data: Trade-offs 4 Ad-hoc Methods for Dealing with Breaks Intelligent ways of using a rolling window 5 Modeling the Instability Process Time-varying parameters Regime switching Change point models 6 Real-time monitoring of forecasting performance 7 Conclusions and Practical Lessons Timmermann (UCSD) Breaks Winter, 2017 2 / 42 Model instability is everywhere... Model instability affects a majority of macroeconomic and financial variables (Stock and Watson, 1996) Great Moderation: Sharp drop in the volatility of macroeconomic variables around 1984 Zero lower bound for US interest rates Stock and Watson (2003) conclude “... forecasts based on individual indicators are unstable. Finding an indicator that predicts well in one period is no guarantee that it will predict well in later periods. It appears that instability of predictive relations based on asset prices (like many other candidate leading indicators) is the norm.” Strong evidence of instability for prediction models fitted to stock market returns: Ang and Bekaert (2007), Paye and Timmermann (2006) and Rapach and Strauss (2006) Timmermann (UCSD) Breaks Winter, 2017 2 / 42 Maclean and Pontiff (2012) "We investigate the out-of-sample and post-publication return predictability of 82 characteristics that have been shown to predict cross-sectional returns by academic publications in peer-reviewed journals... We estimate post-publication decay to be about 35%, and we can reject the hypothesis that there is no decay, and we can reject the hypothesis that the cross-sectional predictive ability disappears entirely. This finding is better explained by a discrete change in predictive ability, rather than a declining time-trend in predictive ability." (p. 24) Timmermann (UCSD) Breaks Winter, 2017 3 / 42 Sources of model instability Model parameters may change over time due to shifting market conditions (QE) changing regulations and government policies (Dodd-Frank) new technologies (fracking; iphone) mergers and acquisitions; spinoffs shift in behavior (market saturation; self-destruction of predictable patterns under market effi ciency) Timmermann (UCSD) Breaks Winter, 2017 4 / 42 Strategies for dealing with model instability Ignore it altogether Test for large, discrete breaks and use only data after the most recent break to estimate the forecasting model Ad-hoc approaches that discount past data rolling window estimation exponential discounting of past observations (risk-metrics) adaptive approaches Model the break process itself If multiple breaks occurred in the past, we may want to model the possibility of future breaks, particularly for long forecast horizons Forecast combination Timmermann (UCSD) Breaks Winter, 2017 5 / 42 Ignoring model instability Timmermann (UCSD) Breaks Winter, 2017 6 / 42 How costly is it to ignore breaks? "All models are wrong but some are useful" (George Box) Full-sample estimated parameters of forecasting models are an average of time-varying coeffi cients these may or may not be useful for forecasting fail to detect valuable predictors wrongly include irrelevant predictors Model instability can show up as a disparity between a forecasting model’s in-sample and out-of-sample performance or in differences in the model’s forecasting performance across different historical subsamples Timmermann (UCSD) Breaks Winter, 2017 7 / 42 How do breaks affect forecasting performance? Forecasting model yt+1 = β ′ txt + εt+1, t = 1, ...,T yt+1 : outcome we are interested in predicting βt : (time-varying) parameters of the data generating process β̂t : parameter estimates xt : predictors known at time t T : present time (time where we generate our forecast of yT+1) ŷT+1 = β̂ ′ T xT : forecast Timmermann (UCSD) Breaks Winter, 2017 8 / 42 Failing to find genuine predictability time, t Tbreak ßt Tβ̂ breakpre−β breakpost−β 11 ++ += tttt xy εβ 0 T Timmermann (UCSD) Breaks Winter, 2017 9 / 42 Wrongly identifying predictability time, t Tbreak ßt Tβ̂ breakpre−β breakpost−β 11 ++ += tttt xy εβ T Timmermann (UCSD) Breaks Winter, 2017 10 / 42 Breaks to forecast models What happens when the parameters of the forecasting model change over time so the full-sample estimates of the forecasting model provide poor guidance for the forecasting model at the end of the sample (T ) where the forecast gets computed? βt : regression parameters of the forecasting model ‘t’subscript indicates that the parameters vary over time Assuming that parameter changes are not random, we would prefer to construct forecasts using the parameters, βT , at the point of the forecast (T ) Instead, the estimator based on full-sample information, β̂T , will typically converge not to βT but to the average value for βt computed over the sample t = 1, ...,T Timmermann (UCSD) Breaks Winter, 2017 11 / 42 Take-aways: Careful with full-sample tests A forecasting model that uses a good predictor might generate poor out-of-sample forecasts because the parameter estimates are an average of time-varying coeffi cients Valuable predictor variables may appear to be uninformative because the full-sample parameter estimate β̂T is close to zero Conversely, full-sample tests might indicate that certain predictors are useful even though, at the time of the forecast, βT is close to zero Under model instability, all bets could be off! Timmermann (UCSD) Breaks Winter, 2017 12 / 42 Testing for parameter instability There are many different ways to model and test for parameter instability Tests for a single discrete break Tests for multiple discrete breaks Tests for random walk breaks (break every period) Timmermann (UCSD) Breaks Winter, 2017 13 / 42 Testing for breaks: Known break date If the date of the possible break in the coeffi cients (t1) is known, the null of no break can be tested using a dummy variable interaction regression Let Dt (t0) be a binary variable that equals zero before the break date, tB , and is one afterwards: Dt (tB ) = { 0 if t < tB 1 if t ≥ tB We can test for a break in the intercept: yt+1 = β0 + β1Dt (tB ) + β2xt + εt+1 H0 : β1 = 0 (no break) Or we can test for a break in the slope: yt+1 = β0 + β1xt + β2Dt (tB )xt + εt+1 H0 : β2 = 0 (no break) The t−test for β1 or β2 is called a Chow test Timmermann (UCSD) Breaks Winter, 2017 14 / 42 Models with a single break I What happens if the time and size of the break are unknown? We can try to estimate the date and magnitude of the break For each date in the sample we can compute the sum of squared residuals (SSR) associated with that choice of break date. Then choose the value t̂B that minimizes the SSR SSR(t1) = T−1 ∑ t=1 (yt+1 − βxt − dxt1(t ≤ tB ))2 We “trim” the data by searching only for breaks between lower and upper limits that exclude 10-15% of the data at both ends to have some minimal amount of data for parameter estimation and evaluation In a sample with T = 100 observations, test for break at t = 16, ...., 85 Timmermann (UCSD) Breaks Winter, 2017 15 / 42 Models with a single break II Potentially we can compute a better forecast based on the post-break parameter values. For the linear regression yt+1 = { (β+ d)xt + εt+1 t ≤ t̂B βxt + εt+1 t > t̂B

we could use the forecast fT+1 = β̂xT where β̂ is estimated from a regression

that replaces the unknown break date with the estimated break date t̂B

In practice, estimates of the break date are often inaccurate

Should we always exclude pre-break data points?

Not if the break is small or the post-break data sample is short

Bias-variance trade-off

Timmermann (UCSD) Breaks Winter, 2017 16 / 42

Many small breaks versus few large ones

time, t

Tbreak

ßt

Tβ̂

breakpre−β

breakpost−β

11 ++ += tttt xy εβ

T

Timmermann (UCSD) Breaks Winter, 2017 17 / 42

0 5 10 15

0

0.2

0.4

0.6

0.8

1

Single Break at 50%

P

ow

er

δ

0 5 10 15

0

0.2

0.4

0.6

0.8

1

Two Breaks, 40% and 60%

P

ow

er

δ

0 5 10 15

0

0.2

0.4

0.6

0.8

1

Random Breaks with Probability 10%

P

ow

er

δ

0 5 10 15

0

0.2

0.4

0.6

0.8

1

Random Breaks with Probability 1

P

ow

er

δ

qLL

Nyblom

SupF

AP

Timmermann (UCSD) Breaks Winter, 2017 18 / 42

Practical lessons from break point testing

Tests designed for one type of breaks (frequent, small ones) typically also

have the ability to detect other types of breaks (rare, large ones)

Rejections of a test for a particular break process do not imply that the break

process tested for is “correct”

Rather, it could be one of many processes

Imagine a medical test that can tell if the patient is sick or not but cannot

tell if the patient suffers from diabetes or coronary disease

Timmermann (UCSD) Breaks Winter, 2017 19 / 42

Estimating the break date

We can attempt to estimate the date and magnitude of any breaks

In practice, estimates of the break date are often inaccurate

Costs of mistakes in estimation of the break date:

Too late: If the estimated break date falls after the true date, the resulting

parameter estimates are ineffi cient

Too early: If the estimated break date occurs prior to the true date, the

estimates will use pre-break data and hence be biased

If you know the time of a possible break, use this information

Introduction of Euro

Change in legislation

Timmermann (UCSD) Breaks Winter, 2017 20 / 42

Should we only use post-break data?

The expected performance of the forecasting model can sometimes be

improved by including pre-break observations to estimate the parameters of

the forecasting model

Adding pre-break observations introduces a bias in the forecast but can also

reduce the variance of the estimator

If the size of the break is small (small bias) and the break occurs late in the

sample so the additional observations reduce the variance of the estimates by

a large amount, the performance of the forecasting model can be improved by

incorporating pre-break data points

Timmermann (UCSD) Breaks Winter, 2017 21 / 42

Consequences of a small break

time, t

Tbreak

ßt

Tβ̂

breakpre−β

breakpost−β

11 ++ += tttt xy εβ

T

Timmermann (UCSD) Breaks Winter, 2017 22 / 42

Including pre-break data (Pesaran and Timmermann, 2007)

The optimal pre-break window that minimizes the mean squared (forecast)

error (MSE) is longer, the

smaller the R2 of the prediction model (noisy returns data)

smaller the size of the break (small bias)

shorter the post-break window (post-break-only estimates are very noisy)

Timmermann (UCSD) Breaks Winter, 2017 23 / 42

Stopping rule for determining estimation window

First, estimate the time at which the most recent break occurred, T̂b

If no break is detected, use all the data

If a break is detected, estimate the MSE using only data after the break date,

t = T̂b + 1, …,T

Next, compute the MSE by including an additional observation, t = T̂b , …,T

If the new estimate reduces the MSE, continue by adding an additional data

point (T̂b − 1) to the sample and again compute the MSE

Repeat until the data suggest that including additional pre-break data no

longer reduces the MSE

Timmermann (UCSD) Breaks Winter, 2017 24 / 42

Downweighting past observations

In situations where the form of the break process is unknown, we might use

adaptive methods that do not depend directly on the nature of the breaks

Simple weighted least squares scheme puts greater weight on recent

observations than on past observations by choosing parameters

β̂T =

[

T−1

∑

t=0

ωtxtx

′

t

]−1 [T−1

∑

t=0

ωtxtyt

]

The forecast is β̂

′

T xT

Expanding window regressions set ωt = 1 for all t

Rolling regressions set ωt = I{T − ω̄ ≤ t ≤ T − 1} : use last ω̄ obs.

Discounted least squares sets ωt = λ

T−t for λ ∈ (0, 1)

Timmermann (UCSD) Breaks Winter, 2017 25 / 42

Careful with rolling regressions!

Rolling regressions employ an intuitive trade-off

Short estimation windows reduce the bias in the estimates due to the use of

stale data that come from a different “regime”

This bias reduction is achieved at the cost of a decreased precision in the

parameter estimates as less data get used

We hope that the bias reduction more than makes up for the increased

parameter estimation error

However, there does not exist a data generating process for which a rolling

window is optimal, so how do we choose the length of the estimation window?

Timmermann (UCSD) Breaks Winter, 2017 26 / 42

Cross-validation and choice of estimation window

Treat rolling window as a choice variable

If the last P observations are used for cross-validation, we can choose the

length of the rolling estimation window, ω, to minimize the out-of-sample

MSE criterion

MSE (ω) = P−1

T

∑

t=T−P+1

(yt − x ′t−1 β̂t−ω+1:t )

2

β̂t−ω+1:t : OLS estimate of β that uses observations [t −ω+ 1 : t]

This method requires a suffi ciently long evaluation window, P, to yield

precise MSE estimates

not a good idea if the candidate break date, Tb , is close to T

Timmermann (UCSD) Breaks Winter, 2017 27 / 42

Using model averaging to deal with breaks

A more robust approach uses model averaging to deal with the underlying

uncertainty surrounding the selection of the estimation window

Combine forecasts associated with estimation windows ω ∈ [ω0,ω1 ]:

ŷT+1|T =

∑ω1ω=ω0 (x

′

T β̂T−ω+1:T )MSE (ω)

−1

∑ω1ω=ω0 MSE (ω)

−1

Example: daily data with ω0 = 200 days, ω1 = 500 days

If the break is very large, models that start the estimation sample after the

break will receive greater weight (they have small MSE) than models that

include pre-break data and thus get affected by a large (squared) bias term

Timmermann (UCSD) Breaks Winter, 2017 28 / 42

Modeling the instability process

Several methods exist to capture the process giving rise to time variation in

the model parameters

Time-varying parameter model (random walk or mean-reverting)

Markov switching

Change point process

These are parametric approaches that assume a particular break process

Many small changes versus few large “breaks”

Timmermann (UCSD) Breaks Winter, 2017 29 / 42

Time-varying parameter models

Small “breaks” every period

Time-varying parameter (TVP) model:

yt+1 = x

′

tβt+1 + εyt+1

βt+1 − β̄ = κ(βt − β̄) + εβt+1

This model is in state space form with yt being the observable process and

βt being the latent “state”

Use Kalman filter or MCMC methods

Volatility may also be changing over time—stochastic volatility

Allowing too much time variation in parameters (large σεβ) leads to poor

forecasts (signal-to-noise issue)

Timmermann (UCSD) Breaks Winter, 2017 30 / 42

Regime switching models: History repeats

Markov switching processes take the form

yt+1 = µst+1 + εt+1, εt+1 ∼ N(0,Σst+1 )

Pr(st+1 = j |st = i) = pij , i , j ∈ {1, …,K}

st+1 : underlying state, st+1 ∈ {1, 2, …,K} for K ≥ 2

Key assumption: the same K states repeat

Is this realistic?

Use forward-looking information in state transitions:

pij (zt ) = Φ(αij + βij zt )

e.g., zt = ∆Leading Indicatort

Timmermann (UCSD) Breaks Winter, 2017 31 / 42

Markov Switching models: been there, seen that

time, t

ßt

1β

2β

11 1 ++

+=

+ ttst

xy

t

εβ

T

1β

2β

Timmermann (UCSD) Breaks Winter, 2017 32 / 42

Change point models: History doesn’t repeat

Change point models allow the number of states to increase over time and do

not impose that the states are drawn repeatedly from the same set of values

Example: Assuming K breaks up to time T , for i = 0, …,K

yt+1 = µi + Σi εt+1, τi ≤ t ≤ τi+1

Assuming that the probability of remaining within a particular state is

constant, but state-specific, the transitions for this class of models are

P =

p11 p12 0 · · · 0

0 p22 p23 · · · 0

…

…

…

…

…

0 · · · 0 pKK pK ,K+1

0 0 · · · 0 1

pi ,i+1 = 1− pii

The process either remains in the current state or moves to a new state

Timmermann (UCSD) Breaks Winter, 2017 33 / 42

Change point models: a new era arises

time, t

ßt

1β

2β

11 1 ++

+=

+ ttst

xy

t

εβ

T

3β

4β

Timmermann (UCSD) Breaks Winter, 2017 34 / 42

Monitoring stability of forecasting models

Timmermann (UCSD) Breaks Winter, 2017 35 / 42

Monitoring the predictive accuracy

To study the real-time evolution in the accuracy of a model’s forecasts, plot

the Cumulative Sum of Squared prediction Error Difference (CSSED) for

some benchmark against the competitor model up to time t :

CSSEDm,t =

t

∑

τ=1

(

e2Benmk ,τ − e

2

m,τ

)

eBenmk ,τ = yτ − ŷτ,Benmk : forecast error (Benmk)

em,τ = yτ − ŷτ,m : forecast error (model m)

Positive and rising values of CSSED indicate that the point forecasts

generated by model m are more accurate than those produced by the

benchmark

Timmermann (UCSD) Breaks Winter, 2017 36 / 42

Comparison of break models

Pettenuzzo and Timmermann (2015)

Benchmark: Constant parameter linear model (LIN)

Competitors:

Time-varying parameter, stochastic volatility (TVP-SV)

Markov switching with K regimes (MSK )

Change point model with K regimes (CPK )

Timmermann (UCSD) Breaks Winter, 2017 37 / 42

Data and models (US inflation)

Pt : quarterly price index for the GDP deflator

πt = 400× ln (Pt/Pt−1) : annualized quarterly inflation rate

Prediction model: backward-looking Phillips curve

∆πt+1 = µ+ β(L)ut + λ(L)∆πt + εt+1, εt+1 ∼ N

(

0, σ2ε

)

∆πt+1 = πt+1 − πt : quarter-on-quarter change in the annualized inflation

rate

ut : quarterly unemployment rate

Timmermann (UCSD) Breaks Winter, 2017 38 / 42

Cumulative sum of squared forecast error differentials

(quarterly inflation, 1970-2012)

Timmermann (UCSD) Breaks Winter, 2017 39 / 42

Challenges to forecasting with breaks

1 Tests for model instability can detect many different types of instability and

tend to be uninformative about the nature of the instability

2 The forecaster therefore often does not have a good idea of which specific

way to capture model instability

3 Future parameter values might change again over the forecast horizon if this

is long. Forecasting procedures require modeling both the probability and

magnitude of future breaks

1 Models with rare changes to the parameters have little or nothing to say about

the chance of a future break in the parameters

2 Think about forecasting the growth in the Chinese economy over the next 25

years. Many things could change—future “breaks” could occur

Timmermann (UCSD) Breaks Winter, 2017 40 / 42

Conclusions: Practical Lessons

Model instability poses fundamental challenges to forecasting: All bets could

be off if ignored

Important to monitor model stability

Use model forecasts with more caution if models appear to be breaking down

Forecast combinations offer a promising tool to handle instability

Combine forecasts from models using short, medium and long estimation

windows

Combine different types of models, allowing combination weights to evolve

over time as new and better models replace old (stale) ones

Timmermann (UCSD) Breaks Winter, 2017 41 / 42

Conclusion: Practical lessons (cont.)

Models that allow parameters to change are adaptive – they catch up with a

shift in the data generating process

Adaptive approaches can work well but have obvious limitations

Models that attempt to predict instability or breaks require forward-looking

information

use information in option prices (VIX) or in financial fragility indicators

Diffi cult to predict exact timing and magnitude of breaks, but the risk of a

break may be time-varying

Model instability is not just a nuisance but also poses an opportunity for

improved forecasting performance

Timmermann (UCSD) Breaks Winter, 2017 42 / 42

Lecture 10: Data mining – Pitfalls in Forecasting

UCSD, Winter 2017

Allan Timmermann1

1UC San Diego

Timmermann (UCSD) Data mining Winter, 2017 1 / 23

1 Data mining

Opportunities and Challenges

Skill or Luck

Bonferroni Bound

2 Comparing Many Forecasts: Reality Check

3 Hal White’s Reality Check

Data snooping and technical trading rules

Timmermann (UCSD) Data mining Winter, 2017 2 / 23

Data mining (Wikipedia)

“Data mining is the process of sorting through large amounts of data and

picking out relevant information. It is usually used by business intelligence

organizations, and financial analysts, but is increasingly being used in the sciences

to extract information from the enormous data sets generated by modern

experimental and observational methods. It has been described as “the nontrivial

extraction of implicit, previously unknown, and potentially useful information from

data”and “the science of extracting useful information from large data sets or

databases.

The term data mining is often used to apply to the two separate processes of

knowledge discovery and prediction. Knowledge discovery provides explicit

information that has a readable form and can be understood by a user (e.g.,

association rule mining). Forecasting, or predictive modeling provides

predictions of future events and may be transparent and readable in some

approaches (e.g., rule-based systems) and opaque in others such as neural

networks. Moreover, some data-mining systems such as neural networks

are inherently geared towards prediction and pattern recognition, rather

than knowledge discovery.”

Timmermann (UCSD) Data mining Winter, 2017 2 / 23

Model selection and data mining

In the context of economic/financial forecasting, data mining is often used in

a negative sense as the practice of using a data set more than once for

purposes of selecting, estimating and testing a model

If you get to evaluate a model on the same data used to develop/estimate the

model, chances are you are overfitting the data

This practice is necessitated because we are limited to short (time-series)

samples which cannot be easily replicated/generated

We only have one history of quarterly US GDP, fund-manager performance etc.

We cannot use experiments to generate new data on such a large scale

If we have panel data with large cross-sections and no fixed effects, then we

can keep a large separate evaluation sample for model validation

Timmermann (UCSD) Data mining Winter, 2017 3 / 23

Data mining as a source of new information

Statistical analysis guided strictly by theory may not discover unknown

relationships that have not yet been stipulated by theory

Before the emergence of germ theory, medical doctors didn’t understand why

some patients got infected, while others didn’t. It was the use of patient data

and a search for correlations that helped doctors find out that those who

washed their hands between patient visits infected fewer of them

Big data present a similar opportunity for finding new empirical relationships

which can then be interpreted by theorists

If we have a really big data set that is deep (multiple entries for each

variable) and wide (many variables), we can use machine learning to detect

novel and interesting patterns on one subset of data, then test it on another

(independent) or several other cross-validation samples

Man versus machine: testing if a theoretical model performs better than a

data mined model, using independent (novel) data

Can we automate the development of theories (hypotheses)?

Timmermann (UCSD) Data mining Winter, 2017 4 / 23

Data mining as an overfitting problem

Data mining can cause problems for inference with small samples

“[David Lenweber, managing director of First Quadrant Corporation in

Pasadena, California] sifted through a United Nations CD-Rom and

discovered that historically, the single best prediction for the Standard &

Poor’s 500 stock index was butter production in Bangladesh.” (Coy,

1997, Business Week)

“Is it reasonable to use the standard t-statistic as a valid meaure of

significance when the test is conducted on the same data used by many

earlier studies whose results influenced the choice of theory to be tested?”

(Merton, 1987)

“. . . given enough computer time, we are sure that we can find a mechanical

trading rule which “works”on a table of random numbers —provided that we

are allowed to test the rule on the same table of numbers which we used to

discover the rule”. Jensen and Bennington (1970)

Timmermann (UCSD) Data mining Winter, 2017 5 / 23

Skill or luck?

A student comes to the professor’s offi ce and reports a stock market return

prediction model with a t−statistic of 3. Should the professor be impressed?

If the student only fitted a single model: Yes

What if the student experimented with hundreds of models?

A similar issue arises more broadly in the assessment of performance

Forecasting models in financial markets

Star mutual funds

How many mutual fund “stars” should we expect to find by random chance?

What if the answer is four and we only see one star in the actual data?

Lucky penny

Newsletter scam

Timmermann (UCSD) Data mining Winter, 2017 6 / 23

Timmermann (UCSD) Data mining Winter, 2017 7 / 23

Dealing with over-fitting

Report the full set (and number) of forecasting models that were considered

Good practice in all circumstances

The harder you had to work to find a ‘good’model, the more skeptical you

should be that the model will produce good future forecasts

Problem: Diffi cult to keep track of all models

What about other forecasters whose work influenced your study? Do you know

how many models they considered? Collective data mining

Even if you keep track of all the models you studied, how do you account for

the correlation between the forecasts that they generate?

Timmermann (UCSD) Data mining Winter, 2017 8 / 23

Dealing with over-fitting (cont.)

Use data from alternative sources

Seeking independent evidence

This is a way to ‘tie your hands’by not initially looking at all possible data

This strategy works if you have access to similar and genuinely independent

data

Often such data are diffi cult to obtain

Example: Use European or Asian data to corroborate results found in the US

Problem: What if the data are correlated? US, European and Asian stock

market returns are highly correlated and so are not independent data sources

Timmermann (UCSD) Data mining Winter, 2017 9 / 23

Dealing with over-fitting (cont.)

Reserve the last portion of your data for out-of-sample forecast evaluation

Problem: what if the world has changed?

Maybe the forecasting model truly performed well in a particular historical

sample (the “in-sample” period), but broke down in the subsequent sample

Example: Performance of small stocks

Small cap stocks have not systematically outperformed large cap stocks in the

35 years since the size effect was publicized in the early 1980s

Timmermann (UCSD) Data mining Winter, 2017 10 / 23

Bonferroni bound

Suppose we are interested in testing if the best model among k = 1, ….,m

models produces better forecasts than some benchmark forecasting model

Let pk be the p−value associated with the null hypothesis that model k does

not produce more accurate forecasts than the benchmark

This could be based on the t−statistic from a Diebold-Mariano test

The Bonferroni Bound says that the p-value for the null that none of the m

models is superior to the benchmark satisfies an upper bound

p ≤ Min(m×min(p1, …, pm), 1)

The smallest of the p-values (which produces the strongest evidence against

the null that no model beats the benchmark) gets multiplied by the number of

tested models, m

mindless datamining weakens the evidence!

Bonferroni bound holds for all possible correlations between test statistics

Timmermann (UCSD) Data mining Winter, 2017 11 / 23

Bonferroni bound: example

Bonferroni bound: Probability of observing a p-value as small as pmin among

m forecasting models is less than or equal to m× pmin

Example: m = 10, pmin = 0.02

Bonferroni bound = Min(10× 0.02, 1) = 0.20

In a sample with 10 p-values, there is at most a 20% chance that the

smallest p-value is less than 0.02

Test is conservative (doesn’t reject as often as it should): Suppose you have

20 models whose tests have a correlation of 1. Effectively you only have one

(independent) forecast. However, even if p1 = p2 = … = p20 = 0.01, the

Bonferroni bound gives a value of

p ≤ 20× 0.01 = 0.20

Timmermann (UCSD) Data mining Winter, 2017 12 / 23

Reverse engineering the Bonferroni bound

Suppose a student reports a p-value of 0.001 (one in a thousand)

For this to correspond to a Bonferroni p-value of 0.05, at least 50 models

must have been considered since 50× 0.001 = 0.05

Is this likely?

What is a low p-value in a world with data snooping?

The conventional criterion that a variable is significant if its p-value falls

below 0.05 isn’t true anymore

Back to the example with a t−statistic of 3 reported by a student. How

many models would the student have had to have looked at?

prob(t ≥ 3) = 0.0013

0.05/0.0013 = 37

Answer: 37 models

Timmermann (UCSD) Data mining Winter, 2017 13 / 23

Testing for Superior Predictive Ability

How confident can we be that the best forecast is genuinely better than some

benchmark, given that the best forecast is selected from a potentially large

set of forecasts?

Skill or luck? In a particular sample, a forecast model may produce a small

average loss even though in expectation (i.e., across all samples we could

have seen) the model would not have been so good

A search across multiple forecast models may result in the discovery of a

genuinely good model (skill), but it may also uncover a bad model that just

happens to perform well in a given sample (luck)

Tests used in model comparisons typically ignore any search that preceded

the selection of the prediction models

Timmermann (UCSD) Data mining Winter, 2017 14 / 23

White (2000) Reality Check

Forecasts generated recursively using an expanding estimation window

m models used to compute m out-of-sample (average) losses

f0t+1|t : forecast from benchmark (model 0)

fkt+1|t : forecast from alternative model k, k = 1, …,m

dkt+1 = (yt+1 − f0t+1)2 − (yt+1 − fkt+1)2 : MSE difference for the

benchmark (model 0) relative to model k

d̄k : sample average of dkt+1

d̄k > 0 suggests model k outperformed benchmark (model 0)

d̄ = (d̄1, …, d̄m)

′ : m× 1 vector of sample averages of MSE differences

measured relative to the benchmark

d̄∗ = (d̄1(β

∗

1), …, d̄m(β

∗

m))

′ : m× 1 vector of sample averages of MSE

differences measured relative to the benchmark

β∗i : p limt→∞{β̂it} : probability limit of β̂it

Timmermann (UCSD) Data mining Winter, 2017 15 / 23

White (2000) Reality Check (cont.)

White’s reality check tests the null hypothesis that the benchmark model is

not inferior to any of the m alternatives:

H0 : max

k=1,…,m

E [d∗kt+1 ] ≤ 0

If all models perform as well as the benchmark, d̄∗ = (d̄∗1 , …, d̄

∗

m) has mean

zero and so its maximum also has mean zero

Examining the maximum of d̄ is the same as searching for the best model

Alternative hypothesis: the best model outperforms the benchmark, i.e.,

there exists a superior model k such that E [d∗kt+1 ] > 0

Timmermann (UCSD) Data mining Winter, 2017 16 / 23

White (2000) Reality Check (cont.)

White shows conditions under which (⇒ means convergence in distribution)

max

k=1,…,m

T 1/2(d̄k − E [d̄∗k ])⇒ max

k=1,…,m

{Zk}

Zk , k = 1, …,m are distributed as N(0,Ω)

Problem: We cannot easily determine the distribution of the maximum of Z

because the m×m covariance matrix Ω is unknown

Timmermann (UCSD) Data mining Winter, 2017 17 / 23

White (2000) Reality Check (cont.)

Hal White (2000) developed a bootstrap for drawing the maximum from

N(0,Ω)

1 Draw values of dkt+1 with replacement to generate bootstrap samples of d̄k .

These draws use the same ‘t’across all models to preserve the correlation

structure across the test statistics d̄

2 Compute the maximum value of the test statistic across these samples

3 Compare these to the actual values of the maximum test statistic, i.e.,

compare the best performance in the actual data to the quantiles from the

bootstrap to obtain White’s bootstrapped Reality Check p-value for the null

hypothesis that no model beats the benchmark

Important to impose the null hypothesis: recentering the bootstrap

distribution around d̄k = 0

Timmermann (UCSD) Data mining Winter, 2017 18 / 23

White (2000) Reality Check: stationary bootstrap

Let τl be a random time index between R and T

R: beginning of evaluation sample

T : end of evaluation sample

For each bootstrap, b, generate a sample estimate

d̄bk = ∑

T

t=R (yτl − fkτl−1)

2 as follows:

1 Set t = R + 1. Draw τl = τR at random, independently, and uniformly from

{R + 1, …,T }

2 Increase t by 1. If t > T , stop. Otherwise, draw a standard uniform random

variable, U , independently of all other random variables

1 If U < q, draw τl at random, independently, and uniformly, from {R + 1, ...,T } 2 If U ≥ q, expand the block by setting τl = τl−1 + 1; if τl > T , reset τl = R + 1

3 Repeat step 2

Timmermann (UCSD) Data mining Winter, 2017 19 / 23

Sullivan, Timmermann, and White, JF 1999

Investigates the performance of 7,846 technical trading rules applied to daily

data

filter rules, moving averages, support and resistance, channel breakouts,

on-balance volume averages

The “best” technical trading rule looks very good on its own

However, when accounting for the possibility that the best technical trading

rule was selected from a large set of possible rules during the period with

liquid trading it is no longer possible to rule out that even the best rule is not

significantly better than the benchmark buy-and-hold strategy

Timmermann (UCSD) Data mining Winter, 2017 20 / 23

Sullivan, Timmermann, and White, JF 1999: Results

Timmermann (UCSD) Data mining Winter, 2017 21 / 23

Sullivan, Timmermann, and White, JF 1999: Results

Timmermann (UCSD) Data mining Winter, 2017 22 / 23

Conclusions

Data mining poses both an opportunity and a challenge to constructing

forecasting models in economics and finance

Less of a concern if we can generate new data or use cross-sectional holdout

data

More of a concern if we only have one (short) historical time-series

Methods exist for quantifying how data mining affects statistical tests

Timmermann (UCSD) Data mining Winter, 2017 23 / 23