# CS计算机代考程序代写 algorithm decision tree cache IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021

IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021

Name:

Instructions:

Practice Midterm Exam 2

March 2021

1. Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page.

2. You are allowed one (double sided) 8.5 x 11 inch note sheet and a simple pocket calculator. The use of any other note sheets, textbook, computer, cell phone, other electronic device besides a simple pocket calculator, or other study aid is not permitted.

3. You will have until 5:00PM to turn in the exam.

4. Whenever a question asks for a numerical answer (such as 2.7), you may write your answer as an

expression involving simple arithmetic operations (such as 2(1) + 1(0.7)).

5. Good luck!

1

IEOR 142 Practice Midterm Exam, Page 2 of 15 March 2021

1 True/False and Multiple Choice Questions – 48 Points

Instructions: Please circle exactly one response for each of the following 16 questions. Each question is worth 3 points. There will be no partial credit for these questions.

1. In logistic regression, it is assumed that the dependent variable Y corresponds to a probability value in the interval [0, 1].

A. True

B. False

2. Consider a previously trained logistic regression model for predicting whether images correspond to cats or dogs. If p denotes the model’s prediction of the probability that a new image is a dog, then it must be the case that 0 < p < 1.
A. True
B. False
3. The probability model underlying logistic regression includes an assumption that the feature vector X comes from a multivariate normal distribution.
A. True
B. False
4. The probability model underlying linear discriminant analysis (LDA) includes an assumption that, con- ditional on observing the value of the dependent variable (i.e., Y = k for some k ∈ {1, 2, . . . , K}), the feature vector X comes from a multivariate normal distribution.
A. True
B. False
5. If we use k-fold cross validation on a training set to select a final model, then there is no need to evaluate the performance of this model on a test set since it is impossible for this model to overfit the training set.
A. True
B. False
6. One of the main reasons that boosting is effective is because, at every iteration, the algorithm finds a new decision tree that is very large (i.e., its depth is very big).
A. True
B. False
7. Consider a training set with n = 1000 data points. Then, using 10-fold cross to select the cp parameter in CART is roughly 100 times cheaper in terms of total computation time than using leave-one-out cross validation.
A. True
B. False
IEOR 142 Practice Midterm Exam, Page 3 of 15 March 2021
8. Due to the random initialization, it is possible for the K-means algorithm to return different cluster assignments on the same dataset.
A. True
B. False
9. It is not possible to train an LDA (linear discriminant analysis) model on a dataset that includes one or more categorical features.
A. True
B. False
10. Consider a time series Xt for t = 1, 2, . . . T . Then the correct model equation for an auto-regressive model is Xt = β0 + β1t + Xt−1 + εt where β0 and β1 are unknown coefficients that must be trained from the data and εt is the error term.
A. True
B. False
11. The Zipf distribution of word frequencies has a light tail that decays exponentially fast. In other words, uncommon words make up a very small percentage of all of the text in a typical document corpus.
A. True
B. False
12. Consider a binary classification problem for predicting whether or not a student will pass a midterm exam based on how many hours the student has spent studying. Suppose that 90% of students will pass the exam, and 10% will fail. Given that a student will pass the exam, the number of hours spent studying is a continuous random variable with density function fpass(x). Likewise, given that a student will fail the exam, the number of hours spent studying is a continuous random variable with density function ffail(x). Which of the following is a correct expression for Pr(Pass | 10 hours spent studying), the probability that the student passes given 10 hours spent studying?
A. Pr(Pass | 10 hours spent studying) = fpass(10) fpass(10) + ffail(10)
B. Pr(Pass | 10 hours spent studying) = 0.9·fpass(10) 0.9·fpass(10) + 0.1·ffail(10)
C. Pr(Pass | 10 hours spent studying) = 0.9 · fpass(10)
D. Pr(Pass | 10 hours spent studying) = 0.9·fpass(10) fpass(10) + ffail(10)
IEOR 142 Practice Midterm Exam, Page 4 of 15 March 2021
13. Consider the CART algorithm for binary classification. A desirable property of an impurity function for the CART algorithm is that a split never increases the total impurity cost of the tree (i.e., using the notation from class we have ∆ ≥ 0). Figure 1 below depicts four potential impurity functions that might be used in CART, each as a function of the proportion p of observations with Y = 0 in the current bucket. Which of these impurity functions have the desirable property mentioned above?
A. All four
B. Only A, B, and C
C. Only B and C
D. Only C
Figure 1
14. In a binary classification problem, an effective method for building a model and generating a confidence interval for its TPR (true positive rate) is to first split the entire dataset into a training set and a test set and then:
A. Train the model on the training set, then use the bootstrap on the entire dataset B. Train the model on the training set, then use the bootstrap on the training set
C. Train the model on the training set, then use the bootstrap on the test set
D. Train the model on the entire dataset, then use the bootstrap on the test set
IEOR 142 Practice Midterm Exam, Page 5 of 15 March 2021
15. Consider the following ROC curves for two different models – the “dashed model” whose ROC curve is the dashed curve, and the “solid model” whose ROC curve is the solid curve. The baseline is also drawn for comparison. Which statement below is a correct interpretation of these ROC curves:
A. Since the AUC of the solid model is greater than the AUC of the dashed model, the solid model is always preferable over the dashed model
B. If we desire a model with FPR between 0.8 and 0.9, then the solid model is strictly preferable over the dashed model
C. If we desire a model with TRP between 0.8 and 0.9, then the solid model is strictly preferable over the dashed model
D. None of the above
16. A linear regression model was trained on a training set of size 145, and estimates of OSR2 (out-of- sample R2) and MAE (mean absolute error) were computed on a test set of size 63. The test set OSR2 is 0.732 and the test set MAE is 38.5. The bootstrap was then used in order to estimate the bias of the estimates of OSR2 and MAE. First, B = 10,000 bootstrap datasets were used, then a second analysis with B = 1, 000, 000 bootstrap datasets was done. The results are summarized in the table below.
A correct interpretation of these results is:1
A. Both the OSR2 estimate based on the test data and the MAE estimate based on the test data
are most likely unbiased.
B. Both the OSR2 estimate based on the test data and the MAE estimate based on the test data are most likely biased.
C. The MAE estimate based on the test data is most likely unbiased, but more bootstrap samples (B ≫ 1, 000, 000) are needed to determine if OSR2 is biased or not.
D. The MAE estimate based on the test data is most likely unbiased, whereas the OSR2 estimate based on the test data is most likely biased.
1Recall that “unbiased” means that the bias of an estimator is 0, and “biased” means that the bias is not equal to 0.
Bias of OSR2 (Bootstrap Estimate)
Bias of MAE (Bootstrap Estimate)
B = 10, 000
0.0300
0.223
B = 1, 000, 000
0.0297
0.00927
IEOR 142 Practice Midterm Exam, Page 6 of 15 March 2021
2 Short Answer Questions – 52 Points
Instructions: Please provide justification and/or show your work for all questions, but please try to keep your responses brief. Your grade will depend on the clarity of your answers, the reasoning you have used, as well as the correctness of your answers.
1. (20 points) This problem concerns a computer hardware dataset from the 1980s with n = 208 obser- vations, each corresponding to a different CPU (central processing unit) model. For each CPU model, the dataset includes a variable called PRP, which is a measure of the relative performance of that CPU model as compared to a particular baseline model as published by a magazine.2 We are interested in predicting the PRP value of a CPU model based on its basic attributes. Table 1 below describes these attributes in more detail.
Table 1: Description of the dataset.
Variable Description
MYCT Machine cycle time in nanoseconds MMIN Minimum main memory in kilobytes MMAX Maximum main memory in kilobytes CACH Cache memory in kilobytes
CHMIN Minimum channels in units CHMAX Maximum channels in units PRP Published relative performance
Summary statistics for the full dataset are provided below:
MYCT MMIN MMAX CACH
Min. : 17.0 Min. : 64 Min. : 64 Min. : 0.0
1st Qu.: 50.0 1st Qu.: 768 1st Qu.: 4000 1st Qu.: 0.0 1st Qu.: 1.000
Median : 110.0 Median : 2000 Median : 8000 Median : 8.0 Median : 2.000
Mean : 204.2
3rd Qu.: 225.0
Max. :1500.0
CHMAX
Min. : 0.00
Mean : 2881
3rd Qu.: 4000
Max. :32000
PRP
Min. : 6.0
Mean :11824
3rd Qu.:16000
Max. :64000
Mean : 24.1
3rd Qu.: 32.0
Max. :256.0
Mean : 4.644
3rd Qu.: 6.000
Max. :52.000
1st Qu.: 5.00 1st Qu.: 27.0
Median : 8.00 Median : 49.5
Mean : 17.74
3rd Qu.: 24.00
Max. :176.00
Mean : 105.2
3rd Qu.: 111.5
Max. :1150.0
CHMIN
Min. : 0.000
2For more details, see the paper “Attributes of the performance of central processing units: a relative performance prediction model” by Phillip Ein-Dor and Jacob Feldmesser, Communications of the ACM, Volume 30 Issue 4, April 1987.
IEOR 142 Practice Midterm Exam, Page 7 of 15 March 2021
The dataset was split into a training set with 145 observations and a test set with 63 observations, and a linear regression model was built, using the training data, to predict PRP based on the five independent variables. The output from R is given below.
> summary(mod1)

Call:

lm(formula = PRP ~ MYCT + MMIN + MMAX + CACH + CHMIN + CHMAX,

data = train.cpu)

Residuals:

Min 1Q Median 3Q Max

-168.674 -15.280 1.707 20.551 229.450

Coefficients:

Estimate Std. Error t value

Pr(>|t|)

0.00181 **

0.08437 .

(Intercept) -28.27

MYCT 0.02895

MMIN 0.01806

MMAX 0.002705

CACH 0.6985

CHMIN 2.906

CHMAX 0.3732

—

8.888105 -3.18

0.016653

0.001742

0.000695

0.139216

0.969768

0.276328

1.74

10.37 < 0.0000000000000002 ***
3.89
5.02
3.00
1.35
0.00015 ***
0.0000016 ***
0.00323 **
0.17897
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 48.67 on 138 degrees of freedom
Multiple R-squared: 0.88, Adjusted R-squared: 0.8748
F-statistic: 168.7 on 6 and 138 DF, p-value: < 0.00000000000000022
The training data was further used to compute a correlation table, the output of which is given below.
> cor(train.cpu)

MYCT MMIN MMAX CACH CHMIN CHMAX PRP

MYCT 1.0000000 -0.3248537 -0.4116224 -0.3358850 -0.3046321 -0.2948241 -0.3240128

MMIN -0.3248537 1.0000000 0.7968010 0.5795231 0.5764991 0.2821091 0.8907329

MMAX -0.4116224 0.7968010 1.0000000 0.5445642 0.5800799 0.3708809 0.8150132

CACH -0.3358850 0.5795231 0.5445642 1.0000000 0.5731102 0.3622669 0.6974907

CHMIN -0.3046321 0.5764991 0.5800799 0.5731102 1.0000000 0.5812387 0.6887875

CHMAX -0.2948241 0.2821091 0.3708809 0.3622669 0.5812387 1.0000000 0.4107311

PRP -0.3240128 0.8907329 0.8150132 0.6974907 0.6887875 0.4107311 1.0000000

Furthermore, variance inflation factors for the independent variables in the linear regression model were also computed.

> vif(mod1)

MYCT MMIN MMAX CACH CHMIN CHMAX

1.258289 3.139825 3.138405 1.777128 2.292270 1.589360

IEOR 142 Practice Midterm Exam, Page 8 of 15 March 2021

Please answer the following questions.

(a) (4 points) A CPU manufacturer is considering a new model with machine cycle time of 75 nanosec- onds, minimum main memory of 2,500 kilobytes, maximum main memory of 10,000 kilobytes, cache memory of 30 kilobytes, a minimum of 4 channels and a maximum of 8 channels. Use the R output on the previous pages to make a prediction for the published relative performance of the proposed CPU model.

Solution:

yˆ = −28.27 + 0.02895 ∗ (75) + 0.01806 ∗ (2500) + 0.002705 ∗ (10000)+ 0.6985 ∗ (30) + 2.906 ∗ (4) + 0.3732 ∗ (8)

= 81.67

(b) (4 points) Is there a high degree of multicollinearity present in the training set? On what have you based your answer?

Solution: No, there is not a high degree of multicollinearity present in the training set since the VIF values for all coefficients are relatively low, e.g., less than 5.

IEOR 142 Practice Midterm Exam, Page 9 of 15 March 2021 (c) (4 points) Based on the R output on the previous pages, is there enough evidence to conclude that

the true coefficient corresponding to CHMAX is not equal to 0? On what have you based your answer?

Solution: No, there is not enough evidence to conclude that this coefficient is not equal to 0 since the p-value associated with that coefficient, 0.17897, is too large. Indeed, at any reasonable significant level (e.g., 0.05 or 0.10) we would not be able to reject the null hypothesis that this coefficient is equal to 0.

(d) (4 points) Consider adding a new independent variable to the model called CHAVG, which is defined as CHAVG = (CHMIN + CHMAX)/2. Is it possible for this new variable to improve the linear regression model for predicting PRP? Explain your answer.

Solution: No, it is not possible for this new variable to improve the linear regression model since CHAVG is a linear combination of existing features. Any relationship between PRP and CHAVG is already captured by the existing model. Put another way, if we were to add CHAVG to the model then there would be perfect multicollinearity between CHAVG, CHMIN, and CHMAX and so removing any one of them would not hurt the model.

(e) (4 points) Consider adding a new independent variable to the model called CACHSQUARED, which is defined as CACHSQUARED = (CACH)2. Is it possible for this new variable to improve the linear regression model for predicting PRP? Explain your answer.

Solution: Yes, it is possible for CACHSQUARED to improve the linear regression model since CACHSQUARED is a nonlinear transformation of an existing feature. Indeed, if there is a strong nonlinear (in particular, quadratic) relationship between PRP and CACH, then this new variable will improve the model.

IEOR 142 Practice Midterm Exam, Page 10 of 15 March 2021

2. (10 points) Please answer the following two questions concerning Bagging and Random Forests.

(a) (5 points) Andrew has ambitiously tried to write his own code for Bagging (Bootstrap Aggregation) of CART models. Unfortunately, he has a bug in his code when generating the bootstrap datasets because he accidentally implemented sampling without replacement (instead of sampling with re- placement). You may assume that there are no other bugs in the Bagging code. Andrew is now preparing to run some experiments on several different datasets (both regression and classification problems) in order to compare the test set performance of his Bagging implementation with that of the basic CART method with cp = 0 and all other parameters set to their default values. You may also assume that these other parameters for CART are set to the same default values within the Bagging code. What do you expect the results of Andrew’s experiments to be? In other words, how do you expect the test set performance of Andrew’s Bagging implementation to compare to that of CART with cp = 0? Briefly explain your response.

Solution: With Andrew’s bug, Bagging will perform exactly the same as CART with cp = 0 on all datasets. The reason for this is that since we are sampling without replacement here, all bootstrap datasets will be the same as the original training set. Moreover, each CART model that we fit will be exactly the same, and the bagged model that is obtained by averaging the individual CART models will also be the same as each of the individual CART models. In summary, there is no advantage of Bagging here.

IEOR 142 Practice Midterm Exam, Page 11 of 15 March 2021

(b) (5 points) Meng has now decided to write her own code for Random Forests, using Andrew’s code with the same bug (sampling without replacement instead of sampling with replacement) as a starting point. You may assume that there are no other bugs in her Random Forests code and that mtry (a.k.a. m) is set to something less than the number of features p. Meng is now preparing to run some tests on several different datasets (both regression and classification problems) in order to compare the test set performance of her Random Forests implementation with that of the basic CART method with cp = 0. What do you expect the results of Meng’s experiments to be? In other words, how do you expect the test set performance of Meng’s Random Forests implementation to compare to that of CART with cp = 0? Briefly explain your response.

Solution: Now we would expect that Random Forests may sometimes perform better than CART, and will usually always perform at least as well as CART. The reason for this is that, although the bug still exists, Random Forests will produce different trees due to the random- ness of selecting which features to split on based on the value of mtry, which is smaller than p. Therefore, Random Forests may still produce a somewhat diverse set of trees which may outperform CART.

IEOR 142 Practice Midterm Exam, Page 12 of 15 March 2021

3. (22 points) In this problem, we will revisit the Lending Club loans.csv dataset from Lecture 3. Recall that we would like to build a model to predict the variable not.fully.paid (which is equal to 1 if the borrower defaults on the loan) based on the given six independent variables, which are summarized in Table 2 below.

Variable

installment

log.annual.inc

fico

revol.bal

inq.last.6mths

pub.rec

not.fully.paid

Table 2: Description of the dataset.

Description

Monthly loan installment in dollars Log(annual income)

FICO score

Revolving balance in thousands of dollars Number of inquiries in the past six months Number of deleterious public records

Equal to 1 if the borrower defaults on the loan

Recall that a positive observation corresponds to someone who defaults, i.e., not.fully.paid = 1. We will consider the same scenario as in class, whereby we lose $4,000 every time a borrower defaults on a loan, and we gain a profit of $1,000 every time a borrower does not default. This scenario is summarized by the decision tree shown in Figure 2 below.

Figure 2: Simple Lending Decision Tree

In this problem, we will consider applying CART and Bagging on this dataset.

IEOR 142 Practice Midterm Exam, Page 13 of 15 March 2021 Please answer the following questions.

(a) (6 points) Determine specific numerical values of LF P and LF N such that training a CART model in order to minimize

LF N (# of False Negatives) + LF P (# of False Positives)

is the same as training a CART model in order to maximize total profit, and explain why the two

are equivalent.

Solution: Let’s first consider the case of a positive observation, i.e., someone who is actually a defaulter. Then, the loss of a false negative is clearly the cost associated with funding the loan and having that person default, which is LFN = 4000. Now consider, the case of a negative observation, i.e., someone who will not default. If we incorrectly classify this person as a positive, then we end up not funding the loan and we lose out on $1000. Thus, the loss of a false negative can be regarded as the opportunity cost which is LF P = 1000.

A CART model was trained using the values of LF P and LF N determined in part (a) above. The tree corresponding to this model is displayed in Figure 3. Recall that a prediction of not.fully.paid = 1 corresponds to a “bad risk,” and a prediction of not.fully.paid = 0 corresponds to a “good risk.”

fico >= 740

yes

no

0

inq.last.6mths < 3
yes
no
fico >= 665

yes

01

no

1

Figure 3: CART Tree

Use the CART tree above to answer the following two questions.

IEOR 142 Practice Midterm Exam, Page 14 of 15 March 2021

(b) (4 points) Consider a new potential borrower with a FICO score of 650, and suppose that no other information is known about this borrower. Does the CART model above classify this borrower as a good risk or a bad risk? Explain your answer. If you do not have enough information about the borrower to answer this question precisely, then explain what additional information is needed and how this additional information affects the classification of the model.

Solution: The CART model will classify this borrower as a bad risk. To see this, we first start at the top of the tree. Since the FICO score is not greater than or equal to 740, we move to the right branch. The next step depends on whether or not the number of inquires is less than three; since this information is not known, we consider both cases separately. If the number of inquires is not less than three, then the model predicts a bad risk. On the other hand, if the number of inquires is less than three, then since the FICO score is not greater than or equal to 665, the model predicts a bad risk as well. In both cases, the model predicts a bad risk.

(c) (4 points) Consider a new potential borrower with 4 inquiries in the past six months, and suppose that no other information is known about this borrower. Does the CART model above classify this borrower as a good risk or a bad risk? Explain your answer. If you do not have enough information about the borrower to answer this question precisely, then explain what additional information is needed and how this additional information affects the classification of the model.

Solution: The answer to this depends on the FICO score. If the FICO score is greater than or equal to 740, then the model predicts a good risk based on the root node. On the other hand, if the FICO score is less than 740, then we move down the right branch and since the number of inquires is 4, which is not less than 3, we predict a bad risk.

IEOR 142 Practice Midterm Exam, Page 15 of 15 March 2021

(d) (4 points) Let us consider applying Bagging (Bootstrap Aggregation) of CART trees for this prob-

lem. Suppose that we construct B different bootstrap training sets, and for each bootstrap training

set we train a CART model using the standard choice of LF N = LF P = 1. In this case, the propor-

tion values in each bucket can be used to define probability estimates. Given a new feature vector

x corresponding to a new potential borrower, let fˆ(x) denote the prediction of the probability b

that this borrower will default by the bth CART model. Explain precisely how you would use the predicted probability values fˆ(x),fˆ(x),…,fˆ (x) to classify the new borrower as either a good

risk or a bad risk.

12B

Solution: Since in this case CART gives us probability estimates, we can average them together

to obtain an even better probability estimate (due to the variance reduction properties of

Bagging). That is, define pˆ := 1 B fˆ (x). Now, we can use the threshold value of p = 0.20 B b=1 b

derived in class to classify this potential borrower. In particular, if pˆ > 0.20, then we classify this borrower as a bad risk. Otherwise, if pˆ < 0.20, then we classify this borrower as a good risk.
(e) (4 points) Repeat part (d) instead under the assumption that each CART model is trained using
the values of LF P and LF N determined in part (a) (instead of LF N = LF P = 1). For this question,
you should also instead assume that fˆ (x) is the classification of the borrower as either a good risk b
or a bad risk. That is, fˆ (x) takes the value of either 0 or 1. b
Solution: Since each CART tree was already trained using the loss values in part (a), we can simply take the majority vote. That is, define pˆ := 1 B fˆ (x). Now, if pˆ > 0.50, then more

B b=1 b

of the individual models classify as a bad risk and so bad risk is our final classification. On the

other hand, if pˆ < 0.50, then more of the individual models classify as a good risk and so good risk is our final classification.