# CS计算机代考程序代写 algorithm decision tree cache IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021

IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021
Name:
Instructions:
Practice Midterm Exam 2
March 2021
1. Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page.
2. You are allowed one (double sided) 8.5 x 11 inch note sheet and a simple pocket calculator. The use of any other note sheets, textbook, computer, cell phone, other electronic device besides a simple pocket calculator, or other study aid is not permitted.
3. You will have until 5:00PM to turn in the exam.
expression involving simple arithmetic operations (such as 2(1) + 1(0.7)).
5. Good luck!
1

IEOR 142 Practice Midterm Exam, Page 2 of 15 March 2021
1 True/False and Multiple Choice Questions – 48 Points
Instructions: Please circle exactly one response for each of the following 16 questions. Each question is worth 3 points. There will be no partial credit for these questions.
1. In logistic regression, it is assumed that the dependent variable Y corresponds to a probability value in the interval [0, 1].
A. True B. False
Call:
lm(formula = PRP ~ MYCT + MMIN + MMAX + CACH + CHMIN + CHMAX,
data = train.cpu)
Residuals:
Min 1Q Median 3Q Max
-168.674 -15.280 1.707 20.551 229.450
Coefficients:
Estimate Std. Error t value
Pr(>|t|)
0.00181 **
0.08437 .
(Intercept) -28.27
MYCT 0.02895
MMIN 0.01806
MMAX 0.002705
CACH 0.6985
CHMIN 2.906
CHMAX 0.3732

8.888105 -3.18
0.016653
0.001742
0.000695
0.139216
0.969768
0.276328
1.74
10.37 < 0.0000000000000002 *** 3.89 5.02 3.00 1.35 0.00015 *** 0.0000016 *** 0.00323 ** 0.17897 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 48.67 on 138 degrees of freedom Multiple R-squared: 0.88, Adjusted R-squared: 0.8748 F-statistic: 168.7 on 6 and 138 DF, p-value: < 0.00000000000000022 The training data was further used to compute a correlation table, the output of which is given below. > cor(train.cpu)
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
MYCT 1.0000000 -0.3248537 -0.4116224 -0.3358850 -0.3046321 -0.2948241 -0.3240128
MMIN -0.3248537 1.0000000 0.7968010 0.5795231 0.5764991 0.2821091 0.8907329
MMAX -0.4116224 0.7968010 1.0000000 0.5445642 0.5800799 0.3708809 0.8150132
CACH -0.3358850 0.5795231 0.5445642 1.0000000 0.5731102 0.3622669 0.6974907
CHMIN -0.3046321 0.5764991 0.5800799 0.5731102 1.0000000 0.5812387 0.6887875
CHMAX -0.2948241 0.2821091 0.3708809 0.3622669 0.5812387 1.0000000 0.4107311
PRP -0.3240128 0.8907329 0.8150132 0.6974907 0.6887875 0.4107311 1.0000000
Furthermore, variance inflation factors for the independent variables in the linear regression model were also computed.
> vif(mod1)
MYCT MMIN MMAX CACH CHMIN CHMAX
1.258289 3.139825 3.138405 1.777128 2.292270 1.589360

IEOR 142 Practice Midterm Exam, Page 8 of 15 March 2021
(a) (4 points) A CPU manufacturer is considering a new model with machine cycle time of 75 nanosec- onds, minimum main memory of 2,500 kilobytes, maximum main memory of 10,000 kilobytes, cache memory of 30 kilobytes, a minimum of 4 channels and a maximum of 8 channels. Use the R output on the previous pages to make a prediction for the published relative performance of the proposed CPU model.
(b) (4 points) Is there a high degree of multicollinearity present in the training set? On what have you based your answer?

IEOR 142 Practice Midterm Exam, Page 9 of 15 March 2021 (c) (4 points) Based on the R output on the previous pages, is there enough evidence to conclude that
the true coefficient corresponding to CHMAX is not equal to 0? On what have you based your answer?
(d) (4 points) Consider adding a new independent variable to the model called CHAVG, which is defined as CHAVG = (CHMIN + CHMAX)/2. Is it possible for this new variable to improve the linear regression model for predicting PRP? Explain your answer.
(e) (4 points) Consider adding a new independent variable to the model called CACHSQUARED, which is defined as CACHSQUARED = (CACH)2. Is it possible for this new variable to improve the linear regression model for predicting PRP? Explain your answer.

IEOR 142 Practice Midterm Exam, Page 10 of 15 March 2021
2. (10 points) Please answer the following two questions concerning Bagging and Random Forests.
(a) (5 points) Andrew has ambitiously tried to write his own code for Bagging (Bootstrap Aggregation) of CART models. Unfortunately, he has a bug in his code when generating the bootstrap datasets because he accidentally implemented sampling without replacement (instead of sampling with re- placement). You may assume that there are no other bugs in the Bagging code. Andrew is now preparing to run some experiments on several different datasets (both regression and classification problems) in order to compare the test set performance of his Bagging implementation with that of the basic CART method with cp = 0 and all other parameters set to their default values. You may also assume that these other parameters for CART are set to the same default values within the Bagging code. What do you expect the results of Andrew’s experiments to be? In other words, how do you expect the test set performance of Andrew’s Bagging implementation to compare to that of CART with cp = 0? Briefly explain your response.

IEOR 142 Practice Midterm Exam, Page 11 of 15 March 2021
(b) (5 points) Meng has now decided to write her own code for Random Forests, using Andrew’s code with the same bug (sampling without replacement instead of sampling with replacement) as a starting point. You may assume that there are no other bugs in her Random Forests code and that mtry (a.k.a. m) is set to something less than the number of features p. Meng is now preparing to run some tests on several different datasets (both regression and classification problems) in order to compare the test set performance of her Random Forests implementation with that of the basic CART method with cp = 0. What do you expect the results of Meng’s experiments to be? In other words, how do you expect the test set performance of Meng’s Random Forests implementation to compare to that of CART with cp = 0? Briefly explain your response.

IEOR 142 Practice Midterm Exam, Page 12 of 15 March 2021
3. (22 points) In this problem, we will revisit the Lending Club loans.csv dataset from Lecture 3. Recall that we would like to build a model to predict the variable not.fully.paid (which is equal to 1 if the borrower defaults on the loan) based on the given six independent variables, which are summarized in Table 2 below.
Variable
installment
log.annual.inc
fico
revol.bal
inq.last.6mths
pub.rec
not.fully.paid
Table 2: Description of the dataset.
Description
Monthly loan installment in dollars Log(annual income)
FICO score
Revolving balance in thousands of dollars Number of inquiries in the past six months Number of deleterious public records
Equal to 1 if the borrower defaults on the loan
Recall that a positive observation corresponds to someone who defaults, i.e., not.fully.paid = 1. We will consider the same scenario as in class, whereby we lose \$4,000 every time a borrower defaults on a loan, and we gain a profit of \$1,000 every time a borrower does not default. This scenario is summarized by the decision tree shown in Figure 2 below.
Figure 2: Simple Lending Decision Tree
In this problem, we will consider applying CART and Bagging on this dataset.

IEOR 142 Practice Midterm Exam, Page 13 of 15 March 2021 Please answer the following questions.
(a) (6 points) Determine specific numerical values of LF P and LF N such that training a CART model in order to minimize
LF N (# of False Negatives) + LF P (# of False Positives)
is the same as training a CART model in order to maximize total profit, and explain why the two
are equivalent.
A CART model was trained using the values of LF P and LF N determined in part (a) above. The tree corresponding to this model is displayed in Figure 3. Recall that a prediction of not.fully.paid = 1 corresponds to a “bad risk,” and a prediction of not.fully.paid = 0 corresponds to a “good risk.”
fico >= 740
yes
no
0
inq.last.6mths < 3 yes no fico >= 665
yes
01
no
1
Figure 3: CART Tree
Use the CART tree above to answer the following two questions.

IEOR 142 Practice Midterm Exam, Page 14 of 15 March 2021
(b) (4 points) Consider a new potential borrower with a FICO score of 650, and suppose that no other information is known about this borrower. Does the CART model above classify this borrower as a good risk or a bad risk? Explain your answer. If you do not have enough information about the borrower to answer this question precisely, then explain what additional information is needed and how this additional information affects the classification of the model.
(c) (4 points) Consider a new potential borrower with 4 inquiries in the past six months, and suppose that no other information is known about this borrower. Does the CART model above classify this borrower as a good risk or a bad risk? Explain your answer. If you do not have enough information about the borrower to answer this question precisely, then explain what additional information is needed and how this additional information affects the classification of the model.

IEOR 142 Practice Midterm Exam, Page 15 of 15 March 2021
(d) (4 points) Let us consider applying Bagging (Bootstrap Aggregation) of CART trees for this prob-
lem. Suppose that we construct B different bootstrap training sets, and for each bootstrap training
set we train a CART model using the standard choice of LF N = LF P = 1. In this case, the propor-
tion values in each bucket can be used to define probability estimates. Given a new feature vector
x corresponding to a new potential borrower, let fˆ(x) denote the prediction of the probability b
that this borrower will default by the bth CART model. Explain precisely how you would use the predicted probability values fˆ(x),fˆ(x),…,fˆ (x) to classify the new borrower as either a good