# 程序代写代做代考 Department of Industrial Engineering & Operations Research

Department of Industrial Engineering & Operations Research

IEOR160 Operations Research I

Project Due Date: 05/01/2017

Optimization is an essential part of statistics and data analysis. In this project you will use nonlinear

and integer optimization to answer questions for a “best subsets problem” common in regression

analysis for diabetes progression.

Problem description

For 442 diabetes patients ten baseline variables were collected: age, sex, body mass index (BMI),

average blood pressure (BP), and six blood serum measurements. Additionally, for each patient, a

quantitative measure of disease progression one year after the baseline was also collected. The data

can be found at http://web.stanford.edu/∼hastie/Papers/LARS/diabetes.data. Standard-

ized data is available in http://web.stanford.edu/∼hastie/Papers/LARS/diabetes.sdata.txt.

You are asked to construct a regression model to predict the disease progression from the baseline

observations. The model should be able to accurately predict disease progression for future patients,

and also indicate which independent variables are important factors in disease progression.

Part 1

Assuming that the relationship between the dependent variable y (measure of disease progression)

and the independent variables x1, . . . , x10 is linear, i.e.,

y =

10∑

j=1

βjxj + �,

where � is a normally distributed error term, regression coefficients were fitted with the least squared

error approach (using the first 250 data points), and the following results were observed.

independent var. β std. error t-stat p-value

Age -59.6 80.4 -0.7 0.459

Sex -241.6 84.6 -2.9 0.005

BMI 535.1 95.0 5.6 0.000

BP 241.7 91.7 2.6 0.009

S1 -844.9 627.7 -1.3 0.180

S2 407.4 525.2 0.8 0.439

S3 -224.3 311.0 -0.7 0.471

S4 285.2 221.0 1.3 0.198

S5 762.4 243.8 3.1 0.002

S6 169.6 87.1 1.9 0.053

In order to increase the predictive value and interpretability of the regression coefficients, doctors

would like to have a model that only uses four independent variables.

IEOR160 – Project – 1

http://www.stanford.edu/~hastie/Papers/LARS/diabetes.data

http://www.stanford.edu/~hastie/Papers/LARS/diabetes.sdata.txt

For questions (a)–(d), use only the first 250 patients in the dataset.

(a) Based on the p-values of the regression coefficients given above, which four independent vari-

ables seem more important?

(b) In order to find the best combination of four independent variables, a possible method (imple-

mented in most statistical software) is to fit a regression for all possible combinations of four

variables, and choose the “best” one (e.g., with the largest R2). In this case, it involves fitting(

10

4

)

= 210 regressions. Alternatively, heuristics are also used (e.g. stepwise selection). In

this question, you need to use a simple heuristic: using only the independent variables corre-

sponding to Sex, BMI, BP, S5 and S6, fit a regression for all possible subsets of four variables

and select the one with best R2 value. What variables are used? What are the regression

coefficients? What is R2?

(c) Use Lasso with regularization parameter λ ∈ {200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400}.

Which value of λ would you use? What are the corresponding regression coefficients? What

is the value of R2?

(d) Use mixed-integer optimization of find the model with best R2 value that uses four independent

variables (you may assume that there exists an optimal solution where |βj | ≤ 1000 for j =

1, . . . , 10). What are the regression coefficients? What is the value of R2?

(e) We now want to test the out-of-sample accuracy of the five methods used above. Using the

regression obtained in parts (b)–(d) and the regression coefficients presented in the table, which

method results in a better prediction for the remaining patients in the dataset (i.e., patients

251 to 442)?

Part 2

Now suppose the regression model includes all second order interactions of the independent variables:

y =

10∑

j=1

βjxj +

10∑

j=1

10∑

k=j

xjxkβjk + �.

We are now interested in methods that use at most 10 independent variables (out of 65). Observe

that in this case approaches that enumerate all possible subsets need to compute

(

65

10

)

≈ 1.7 · 1011

regressions.

For questions (f) and (g), use only the first 250 data points.

(f) Use Lasso to find regression coefficients that satisfy the restriction of having only 10 indepen-

dent variables. Which value of the regularization parameter did you use? Why? What are the

regression coefficients? What is R2?

(g) Use mixed-integer optimization of find the model with best R2 value that uses 10 indepen-

dent variables (you may assume that there exists an optimal solution where |βj | ≤ 1000 for

j = 1, . . . , 10 and |βjk| ≤ 5000 for j = 1, . . . , 10, k = j, . . . , 10). What are the regression

coefficients? What is R2?

(h) What is the out-of-sample accuracy of the methods used in part (f)–(g)?

IEOR160 – Project – 2