Basics Simple regression Regression assumptions Multiple regression

BUSANA 7001 – Predictive and Visual Analytics for Business

Week 4: Predictive analytics using multiple regressions

£ius BUSANA 7001, Week 4 1/88

Copyright By cscodehelp代写 加微信 cscodehelp

Basics Simple regression Regression assumptions Multiple regression

Simple regression

Regression assumptions

Multiple regression

£ius BUSANA 7001, Week 4 2/88

Basics Simple regression Regression assumptions Multiple regression

Introduction

The purpose of quantitative analysis is to nd or test certain relations.

Correlation coecients shed some light on the direction of the linear relation:

• positive

• negative or

• no linear relation.

£ius BUSANA 7001, Week 4 3/88

Basics Simple regression Regression assumptions Multiple regression

Let’s investigate the relation between car price and car length using SAS provided data set.

/* Creating data file: */

DATA work.car_data;

SET SAShelp.Cars;

/* Correlation coefficient: */

proc corr data=work.car_data;

var invoice length;

£ius BUSANA 7001, Week 4 4/88

Basics Simple regression Regression assumptions Multiple regression

Example II

£ius BUSANA 7001, Week 4 5/88

Basics Simple regression Regression assumptions Multiple regression

Example III

Correlation coecient between car price and length is 0.16659 (p-val.=0.0005).

=⇒ The relation is positive and statistically signicant.

However, correlation coecient does not let us estimate or predict

the car price given certain car length:

• e.g., what is approximately price of 180-inch length car?

£ius BUSANA 7001, Week 4 6/88

Basics Simple regression Regression assumptions Multiple regression

Introduction II

One needs to use regression analysis in order to answer this question!

Regressions:

Simple regression: y = β0 + β1x

Multiple linear regression: y = β0 + β1×1 + β2×2 + · · · + βnxn

y is the dependent variable

x, x1, x2, xn β0

β1, β2, βn

are independent variables is intercept

are slopes.

£ius BUSANA 7001, Week 4 7/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression

Let’s regress using SAS car price on car length:

• INVOICE = f(LENGTH) = intercept + slope × LENGTH.

/* OLS regression: */

PROC REG DATA=work.car_data;

MODEL invoice=length;

£ius BUSANA 7001, Week 4 8/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression II

£ius BUSANA 7001, Week 4 9/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression III

£ius BUSANA 7001, Week 4 10/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression IV

No obvious trends or patterns in the residuals.

£ius BUSANA 7001, Week 4 11/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression V

£ius BUSANA 7001, Week 4 12/88

Basics Simple regression Regression assumptions Multiple regression

95% condence interval vs. 95% prediction interval

Condence intervals tell you how well you have determined the mean. Assume that the data are randomly sampled from a normal distribution and you are interested in

determining the mean. If you sample many times, and calculate a condence interval of the mean from each sample, you’d expect 95% of those intervals to include the true value of the population mean.

Prediction intervals tell you where you can expect to see the next data point sampled. Assume that the data are

randomly sampled from a normal distribution. Collect a sample of data and calculate a prediction interval. Then sample one more value from the population. If you repeat this process many times, you’d expect the prediction interval to capture the individual value 95% of the time.

Source: https://www.graphpad.com/support/faqid/1506/.

£ius BUSANA 7001, Week 4 13/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression: Interpretation of the results

We got that:

• intercept = 8131.76

• slope = 204.69

• INVOICE = 8131.76 + 204.69 × LENGTH.

Car price increases by $204.69 for each additional inch of car length.

£ius BUSANA 7001, Week 4 14/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression: Predictions

Suppose we would like to estimate the price of 180-inch length car: • INVOICE = 8131.76 + 204.69 × LENGTH

• INVOICE = 8131.76 + 204.69 × 180 = $28,712.4

£ius BUSANA 7001, Week 4 15/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression: Predictions II

Let’s check the actual prices of 180-inch length cars:

PROC PRINT DATA=work.car_data (obs=20);

var make model length invoice;

where length = 180;

£ius BUSANA 7001, Week 4 16/88

Basics Simple regression Regression assumptions Multiple regression

Simple regression: Predictions III

=⇒ most of the cars are more expensive than our predicted value.

£ius BUSANA 7001, Week 4 17/88

Basics Simple regression Regression assumptions Multiple regression

R-squared is a goodness of t or accuracy measure.

The higher the R-squared, the better the model.

R-squared is the ratio of the variation explained to the total variation (of the dependent variable).

0 ≤ R-squared ≤ 1.

£ius BUSANA 7001, Week 4 18/88

Basics Simple regression Regression assumptions Multiple regression

R-squared and correlation

• R-squared of the regression model = 0.0278 • Correlation coecient = 0.16659.

Correlation2 = R-squared 0.166592 = 0.0278

£ius BUSANA 7001, Week 4 19/88

Basics Simple regression Regression assumptions Multiple regression

R-squared and correlation II

R-squared of the regression model = 0.0278

=⇒ Car length can explain only 2.78% of variation in car prices: • explanatory power of the model is weak

• there should be other factors that better explain car prices.

£ius BUSANA 7001, Week 4 20/88

Basics Simple regression Regression assumptions Multiple regression

Predicted values and residuals

Predicted values of car prices (INVOICEpred) can be computed from:

• INVOICEpred = 8131.76 + 204.69 × LENGTH, where LENGTH is the actual car length.

Residuals (INVOICEres) are then computed as: • INVOICEres = INVOICE INVOICEpred,

where INVOICE is the actual car price.

£ius BUSANA 7001, Week 4 21/88

Basics Simple regression Regression assumptions Multiple regression

Predicted values and residuals II

Consider the following 180-inch length cars: • LX: $18,630

• Chevrolet Corvette 2dr: $39,068.

The predicted car prices are the same: $28,712.4.

The residuals are:

• LX: $10,082.4

• Chevrolet Corvette 2dr: $10,355.6.

£ius BUSANA 7001, Week 4 22/88

Basics Simple regression Regression assumptions Multiple regression

Predicted values and residuals III

We can compute predicted values and residuals manually or we can ask SAS to do this:

PROC REG DATA=work.car_data;

MODEL invoice=length;

OUTPUT OUT=work.reg_results r=Res p=Pred;

`Pred’ are predicted values `Res’ description are residuals.

£ius BUSANA 7001, Week 4 23/88

Basics Simple regression Regression assumptions Multiple regression

Regression analysis using SAS Visual Analytics

One should use `Linear regression’ object (from `SAS Visual Statistics’ list).

£ius BUSANA 7001, Week 4 24/88

Basics Simple regression Regression assumptions Multiple regression

Assumptions of linear regression

1. Linear relationship 2. Normality

3. Independence

4. Homoscedasticity

£ius BUSANA 7001, Week 4 25/88

Basics Simple regression Regression assumptions Multiple regression

1. Linear relationship

The relation between the independent variable(x) and the dependent variable (y) is linear.

Detecting a linear relationship is fairly simple.

In most cases, linearity is clear from the scatterplot.

Relevant SAS code:

/* Scatterplot: */

proc gplot data=work.car_data;

title ‘Scatter plot of Invoice and length’;

plot Invoice * Length=1;

£ius BUSANA 7001, Week 4 26/88

Basics Simple regression Regression assumptions Multiple regression

1. Linear relationship II

=⇒ no obvious non-linear relation found.

£ius BUSANA 7001, Week 4 27/88

Basics Simple regression Regression assumptions Multiple regression

2. Normality

The dependent variable y is distributed normally for each value of the independent variable x.

Outliers are the main reason for the violation of this assumption.

To check for the normality, one can use:

• scatterplots (y vs x)

• histogram of standardized residuals

• A normal probability versus residual probability distribution plot (P-P plot).

=⇒ there are a few outliers.

£ius BUSANA 7001, Week 4 28/88

Basics Simple regression Regression assumptions Multiple regression

3. Independence

The values of y should depend on independent variables but not on its own previous values.

The violation of this assumption is observed mostly in time series data (e.g., gross domestic product (GDP))

Autocorrelation coecients for dierent lags can help detect dependencies:

• correlation between GDP and GDP lagged by 1 period • correlation between GDP and GDP lagged by 2 periods • correlation between GDP and GDP lagged by 3 periods • correlation between GDP and GDP lagged by 4 periods.

£ius BUSANA 7001, Week 4 29/88

Basics Simple regression Regression assumptions Multiple regression

4. Homoscedasticity

The variance in y is the same at each stage of x.

There is no special segment or an interval in x where the dispersion in y is distinct.

Scatterplots (y vs x) can be used to detect heteroscedasticity (which is opposite of homoscedasticity).

In our example, the variance is higher when car is 175-205 inch long but statistical tests imply that the model does not suer from heteroscedasticity

Plots with the residual versus predicted values can also be used to detect heteroscedasticity.

£ius BUSANA 7001, Week 4 30/88

Basics Simple regression Regression assumptions Multiple regression

4. Homoscedasticity II

The simple way to deal with heteroscedasticity is to segment out the data and build dierent regression lines for dierent intervals.

In general, if the rst three assumptions are satised, then heteroscedasticity might not even exist.

As a rule of thumb, rst three assumptions need to be xed before attempting to x heteroscedasticity.

£ius BUSANA 7001, Week 4 31/88

Basics Simple regression Regression assumptions Multiple regression

4. Homoscedasticity III

To detect heteroscedasticity, one can use White’s and Breusch-Pagan tests.

Procedure MODEL (rather than REG) includes them.

Relevant SAS code (both procedures give the same results):

PROC REG DATA=work.car_data;

MODEL invoice=EngineSize;

PROC MODEL DATA=work.car_data;

PARMS a1 b1;

invoice = a1 + b1 * EngineSize;

FIT invoice / WHITE PAGAN=(1 EngineSize);

£ius BUSANA 7001, Week 4 32/88

Basics Simple regression Regression assumptions Multiple regression

Tests’ results:

4. Homoscedasticity IV

=⇒ H0 of no heteroscedasticity is rejected.

£ius BUSANA 7001, Week 4 33/88

Basics Simple regression Regression assumptions Multiple regression

Solutions:

4. Homoscedasticity V

• adjust standard errors (a.k.a., (heteroskedasticity) robust standard errors, White-Huber standard errors etc.)

• transform non-normally distributed variables (e.g., using natural logarithm).

£ius BUSANA 7001, Week 4 34/88

Basics Simple regression Regression assumptions Multiple regression

Adjusted standard errors

Option ACOV adjusts standard errors:

PROC REG DATA=work.car_data;

MODEL invoice=EngineSize / ACOV;

This option can be used in SAS procedure REG only.

£ius BUSANA 7001, Week 4 35/88

Basics Simple regression Regression assumptions Multiple regression

Adjusted standard errors II

The robust standard errors can be used even under homoskedasticity.

Then the robust standard errors will become just regular standard errors.

£ius BUSANA 7001, Week 4 36/88

Basics Simple regression Regression assumptions Multiple regression

Log-transformed variables

Variables with positive values can be log-transformed.

DATA work.car_data;

SET work.car_data;

log_MSRP=log(MSRP);

log_length=log(length);

BUSANA 7001, Week 4

Basics Simple regression Regression assumptions Multiple regression

Log-transformed variables II

£ius BUSANA 7001, Week 4 38/88

Basics Simple regression Regression assumptions Multiple regression

Log-transformed variables III

£ius BUSANA 7001, Week 4 39/88

Basics Simple regression Regression assumptions Multiple regression

Log-transformed variables IV

Let’s estimate 3 regression models and predict MSRP of a 180 inches long car.

PROC REG DATA=work.car_data;

MODEL log_MSRP= length;

PROC REG DATA=work.car_data;

MODEL MSRP= log_length;

PROC REG DATA=work.car_data;

MODEL log_MSRP= log_length;

BUSANA 7001, Week 4

Basics Simple regression Regression assumptions Multiple regression

Presentation of results

SAS does not present regression results properly.

We should manually make tables.

We should also make table descriptions that include:

• variable denitions

• that t-statistics based on standard errors are reported in brackets.

• that ***, **, and * indicate signicance at 1%, 5%, and 10% levels, respectively.

£ius BUSANA 7001, Week 4 41/88

Basics Simple regression Regression assumptions Multiple regression

We get the following results:

Log-transformed variables V

Dependent variable:

Independent variables LENGTH

ln(LENGTH) Intercept

ln(MSRP) Model 1

0.0095*** [6.05]

8.4922*** [28.83]

0.079 0.077 428

MSRP Model 2

44,467*** [3.70] 199,551*** [3.18]

0.031 0.029 428

ln(MSRP) Model 3

1.8039*** [6.16] 0.8446 [0.55]

0.082 0.080 428

Adjusted R

Number of observations

BUSANA 7001, Week 4

Basics Simple regression Regression assumptions Multiple regression

Description for the previous table

Table 1: Determinants of car prices

This table presents the results of OLS regressions where the dependent variable is the manufacturer suggested retail price (MSRP) or its natural logarithm. LENGTH is a car length in inches. The absolute values of t-statistics based on standard errors are reported in brackets. ***, **, and * indicate signicance at 1%, 5%, and 10% levels, respectively.

The description above should be above the table.

£ius BUSANA 7001, Week 4 43/88

Basics Simple regression Regression assumptions Multiple regression

Log-transformed variables VI

Let’s compute predicted values of MRSP for 180 inches long car.

ln(180) ≈ 5.1930

Model 1: ln(MSRP) = 8.4922 + 0.0095 × 180 ≈ 10.2022. MSRP = exp(10.2022) ≈ 26,962.44

Model 2: MSRP = 199551 + 44,467 × 5.1930 ≈ 31,364.21 Model 3: ln(MSRP) = 0.8446 + 1.8039 × 5.1930 ≈ 10.2122.

MSRP = exp(10.2122) ≈ 27,232.73

Models 1 and 3 are preferred (one of the reasons is higher R2). The dierence between the predictions of Models 1 and 3 is 1%. However, the dierence between the predictions of Models 2 and 3 is 15%.

£ius BUSANA 7001, Week 4 44/88

Basics Simple regression Regression assumptions Multiple regression

When linear regression can’t be used

If any of the 4 assumptions are not satised, the linear regressions should not be used.

Linear regression can’t be used when:

• the relation between y and x is nonlinear

• the errors are not normally distributed

• there is a dependency within the values of the dependent variable

• the variance pattern of y is not the same for the entire range of x.

£ius BUSANA 7001, Week 4 45/88

Basics Simple regression Regression assumptions Multiple regression

Multiple regression

Let’s investigate the relation between the discount and car length.

It is likely that the discount is impacted by many other factors besides car length:

• car manufacturer (`make’) • car model (`model’)

• car type (`type’)

• drivetrain (`drivetrain’)

• production place (`origin’)

• engine (`enginesize’, `cylinders’, `horsepower’)

• fuel eciency (`MPG_City’, `MPG_Highway’)

• and maybe car weight (`weight’) and wheel base (`wheelbase’).

£ius BUSANA 7001, Week 4 46/88

Basics Simple regression Regression assumptions Multiple regression

Multiple regression II

If our regression model does not include any of the important independent variables, then the model suers from the omitted variable bias.

OLS estimator is likely to be biased and inconsistent due to omitted variable bias.

=⇒ coecient estimates might become unreliable.

One should include as many variables as possible in the regression

model if the data is available.

I found that some variables in the previous slide cannot be included in the model together with car length (due to multicollinearity).

£ius BUSANA 7001, Week 4 47/88

Basics Simple regression Regression assumptions Multiple regression

is a phenomenon due to a high interdependency between the independent variables.

If we include highly correlated independent variables in the same regression model, then this could cause multicollinearity.

Implications of multicollinearity:

• it increases the variance of the coecient estimates and make the estimates very sensitive to minor changes in the model

• coecient estimates might become unstable and dicult to interpret

• T-test results are not trustworthy etc.

£ius BUSANA 7001, Week 4 48/88

Basics Simple regression Regression assumptions Multiple regression

Multiple regression III

First, let’s check the summary statistics of the discount.

/* Creating discount variable: */

DATA work.car_data;

SET work.car_data;

discount=msrp/invoice-1;

proc univariate data=work.car_data plots;

var discount;

£ius BUSANA 7001, Week 4 49/88

Basics Simple regression Regression assumptions Multiple regression

Properties of DISCOUNT

£ius BUSANA 7001, Week 4 50/88

Basics Simple regression Regression assumptions Multiple regression

Properties of DISCOUNT II

£ius BUSANA 7001, Week 4 51/88

Basics Simple regression Regression assumptions Multiple regression

Properties of DISCOUNT III

£ius BUSANA 7001, Week 4 52/88

Basics Simple regression Regression assumptions Multiple regression

Properties of DISCOUNT IV

£ius BUSANA 7001, Week 4 53/88

Basics Simple regression Regression assumptions Multiple regression

Correlation matrix

Let’s look at the correlation matrix of discount and other numerical variables:

proc corr data=work.car_data;

var discount invoice enginesize cylinders horsepower

MPG_City MPG_Highway Length weight wheelbase;

£ius BUSANA 7001, Week 4 54/88

Basics Simple regression Regression assumptions Multiple regression

Correlation matrix II

£ius BUSANA 7001, Week 4 55/88

Basics Simple regression Regression assumptions Multiple regression

Determinants of DISCOUNT

Let’s consider engine size as potential determinant for discount.

SAS code for scatterplot:

proc gplot data=work.car_data;

title ‘Scatter plot of discount and engine size’;

plot Discount * Enginesize=1;

£ius BUSANA 7001, Week 4 56/88

Basics Simple regression Regression assumptions Multiple regression

Determinants of DISCOUNT II

£ius BUSANA 7001, Week 4 57/88

Basics Simple regression Regression assumptions Multiple r

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com