# CS计算机代考程序代写 algorithm decision tree IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021

IEOR 142: Introduction to Machine Learning and Data Analytics, Spring 2021
Name:
Instructions:
Practice Midterm Exam 1
March 2021
1. Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, continue on the back of the page.
2. You are allowed one (double sided) 8.5 x 11 inch note sheet and a simple pocket calculator. The use of any other note sheets, textbook, computer, cell phone, other electronic device besides a simple pocket calculator, or other study aid is not permitted.
3. You will have until 5:00PM to turn in the exam.
expression involving simple arithmetic operations (such as 2(1) + 1(0.7)).
5. Good luck!
1

IEOR 142 Practice Midterm Exam, Page 2 of 16 March 2021
1 True/False and Multiple Choice Questions – 48 Points
Instructions: Please circle exactly one response for each of the following 12 questions. Each question is worth 4 points. There will be no partial credit for these questions.
1. Suppose that we train a classification model that has accuracy equal to 1 (i.e., perfect 100% accuracy) on the test set, and that the test set contains at least one positive observation and at least one negative observation. Then the TPR (true positive rate) of that model on the test set is also equal to 1.
A. True B. False
2. Suppose that we train a classification model that has accuracy equal to 0.99 on the test set, and that the test set contains at least one positive observation and at least one negative observation. Then, without any other information, the most definitive statement we can make about the TPR (true positive rate) of that model on the test set is:
A. The TPR is equal to 0.99
B. The TPR is equal to 1
C. The TPR is at least 0.90
D. The TPR is between 0 and 1
3. Consider two linear regression models trained on the same training set. Model A uses 15 independent variables and has a training set R2 value of 0.79. Model B uses 10 independent variables and has a training set R2 value of 0.68. Then, when comparing the two models on the same test set, Model A must have a higher value of OSR2 than Model B.
A. True B. False
4. The main purpose of bagging (bootstrap aggregating) is to estimate the out-of-sample error. A. True
B. False
5. Boosting is inherently sequential since each new decision tree is trained in a way that uses information from the previously trained decision trees, whereas Random Forests is inherently parallelizable since each individual decision tree is trained independently of all the others.
A. True B. False
6. In multiple linear regression (p > 1), it is possible for a subset of the independent variables to all have large VIF values and at the same time have somewhat small pairwise correlation values with each other.
A. True B. False
7. Suppose that LF N = 2 and LF P = 1. Let p denote the probability that a given observation is a positive. Then, in order to minimize expected cost, an optimal policy is to assign an observation as a positive if and only if p is greater than 1/3.
A. True B. False

IEOR 142 Practice Midterm Exam, Page 3 of 16 March 2021
8. The Random Forests method tends to produce many uncorrelated trees (which are then averaged to- gether) since:
A. Each individual tree is trained on a fresh bootstrap sample of the training set
B. When training each individual tree, only a randomly selected subset of the features are con- sidered at each split
C. Both (a) and (b) are true
D. Both (a) and (b) are false
9. Suppose that we have a dataset consisting of n = 2,342 observation vectors xi. We are interested in constructing between five to ten different clusters to assign each observation to. If we use the K-means algorithm for this task, then to select the final number of clusters K:
A. We must run the K-means algorithm twice, with K = 5 and then with K = 10
B. We must run the K-means algorithm only once with K = 10
C. We must run the K -means algorithm six times with K = 5, 6, 7, 8, 9, 10
D. The K-means algorithm will automatically choose the number of clusters K for us
10. Consider the following ROC curve based on a logistic regression model for predicting lung cancer – here having lung cancer is a “positive outcome.” The baseline is also drawn for comparison. Suppose that a doctor would like to minimize the number of times that she tells a patient that they do not have lung cancer when they actually do. At the same time, the doctor is only willing to incorrectly tell a patient that they have lung cancer when they actually do not at most 50% of the time. Then, which point on the ROC curve should the doctor use to determine the correct threshold value?
A. A B. B C. C D. D

IEOR 142 Practice Midterm Exam, Page 4 of 16 March 2021
Call:
glm(formula = Cancellation ~ VehicleModelId + OnlineBooking +
MobileSiteBooking, family = “binomial”, data = bookings.train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.47911 -0.36831 -0.16594 -0.00007 3.00057
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.9830 7811.7349
-0.003 0.998
0.002 0.998
0.000 1.000
0.000 1.000
0.000 1.000
0.002 0.998
0.000 1.000
0.000 1.000
0.000 1.000
0.000 1.000
0.000 1.000
0.002 0.998
0.000 1.000
0.000 1.000
4.657 3.20e-06 ***
5.246 1.55e-07 ***
VehicleModelId12
VehicleModelId17
VehicleModelId23
VehicleModelId24
VehicleModelId28
VehicleModelId64
VehicleModelId65
VehicleModelId85
VehicleModelId86
VehicleModelId87
VehicleModelId89
VehicleModelId90
VehicleModelId91
OnlineBooking 1.6218 0.3482
MobileSiteBooking 2.1716 0.4139

17.7046 7811.7349
0.3730 14433.6515
0.3039 8411.4265
0.2656 7939.7061
16.6912 7811.7349
0.6338 12587.2832
0.4756 7995.5594
0.3798 7904.2323
1.1222 9541.5994
0.1810 9248.4331
17.4925 7811.7349
0.1935 8331.4000
0.6338 12587.2832
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 698.90 on 2361 degrees of freedom
Residual deviance: 612.88 on 2346 degrees of freedom
AIC: 644.88
Number of Fisher Scoring iterations: 19

IEOR 142 Practice Midterm Exam, Page 10 of 16 March 2021
(a) (4 points) Recall that there are 14 possible values for VehicleModelId in this dataset, and note that there are only 13 coefficients associated with VehicleModelId above. Briefly explain why there is no coefficient associated with VehicleModelId10 above.

IEOR 142 Practice Midterm Exam, Page 11 of 16 March 2021
(b) (4 points) A second logistic regression model (with VehicleModelId removed) was built (using the training data) to predict Cancellation based on the features OnlineBooking and MobileSite- Booking. The corresponding R code and its output are shown below.
> log.mod2 <- glm(Cancellation ~ OnlineBooking + MobileSiteBooking, + data = bookings.train, + family = "binomial") > summary(log.mod2)
Call:
glm(formula = Cancellation ~ OnlineBooking + MobileSiteBooking,
family = “binomial”, data = bookings.train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.4290 -0.3157 -0.3157 -0.1371 3.0568
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.6625
OnlineBooking 1.6883
MobileSiteBooking 2.3231

0.3177 -14.675 < 2e-16 *** 0.3470 4.865 1.14e-06 *** 0.4117 5.643 1.67e-08 *** Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 698.90 on 2361 degrees of freedom Residual deviance: 653.63 on 2359 degrees of freedom AIC: 659.63 Number of Fisher Scoring iterations: 7 Briefly explain the reasoning for removing VehicleModelId. IEOR 142 Practice Midterm Exam, Page 12 of 16 March 2021 (c) (6 points) Suppose that a new rider makes a booking on the mobile site and that the assigned driver has a vehicle model with ID 65. Using the second logistic regression model from part (b), make a prediction for the probability that this driver will cancel the booking. (d) (3 points) Figure 5 displays a confusion matrix, computed on the test set, for the logistic regression model. Here, the optimal threshold value of pthresh derived in question (2) was used. Note that the columns correspond to whether or not the booking would have been reassigned according to the policy described in part (2). (As a reminder, the observations all occurred before YourCabs started considering the possibility of reassigning booking requests.) Figure 5: Confusion Matrix for Logistic Regression Model With Threshold pthresh What is the accuracy, TPR, and FPR of the logistic regression model with threshold pthresh? Do Not Reassign Booking Reassign Booking No Cancellation 909 69 Cancellation 26 9 (e) (3 points) Using the information given in question (2) and part (d) above, what is the total test set profit of the policy implied by the logistic regression model with threshold equal to pthresh? For this problem, you may assume that whenever a booking request is reassigned YourCabs deterministically makes a profit equal to the expected profit gained per reassignment. IEOR 142 Practice Midterm Exam, Page 13 of 16 March 2021 4. (10 points) Next, a CART model was built (using the training data) to predict Cancellation based on the features VehicleModelId, OnlineBooking, and MobileSiteBooking. The values of LF N and LF P were set to LFN = 11.34568 and LFP = 1, which correspond to values such that minimizing loss is equivalent to maximizing profit according to the decision tree in question (2). 10 fold cross-validation was used to select the cp parameter, and the output of the corresponding R code is shown below. CART 2362 samples 3 predictor 2 classes: ’0’, ’1’ No pre-processing Resampling: Cross-Validated (10 fold) Summary of sample sizes: 2126, 2126, 2126, 2125, 2126, 2125, ... Resampling results across tuning parameters: cp AvgLoss Accuracy 0.000 0.3633967 0.9212705 0.001 0.3617089 0.9229582 0.002 0.3617089 0.9229582 0.003 0.3617089 0.9229582 0.004 0.3617089 0.9229582 0.005 0.3617089 0.9229582 0.006 0.3617089 0.9229582 0.007 0.3617089 0.9229582 0.008 0.3617089 0.9229582 0.009 0.3617089 0.9229582 0.010 0.3617089 0.9229582 0.011 0.3617089 0.9229582 0.012 0.3723178 0.9255006 0.013 0.3723178 0.9255006 0.014 0.3723178 0.9255006 0.015 0.3723178 0.9255006 0.016 0.3785430 0.9280430 0.017 0.3892983 0.9348226 0.018 0.3938286 0.9390599 0.019 0.3938286 0.9390599 0.020 0.3962402 0.9454159 0.021 0.3962402 0.9454159 0.022 0.3959629 0.9500769 0.023 0.3959629 0.9500769 0.024 0.3944210 0.9559841 0.025 0.3944210 0.9559841 0.026 0.3944210 0.9559841 0.027 0.3944210 0.9559841 0.028 0.3944210 0.9559841 0.029 0.3944210 0.9559841 0.030 0.3944210 0.9559841 0.031 0.3944210 0.9559841 IEOR 142 Practice Midterm Exam, Page 14 of 16 March 2021 0.032 0.3944210 0.9559841 0.033 0.3944210 0.9559841 0.034 0.3944210 0.9559841 0.035 0.3944210 0.9559841 0.036 0.3944210 0.9559841 0.037 0.3944210 0.9559841 0.038 0.3944210 0.9559841 0.039 0.3944210 0.9559841 0.040 0.3897600 0.9606451 Please answer the following questions concerning the CART model. (a) (3 points) Using the output above, select a value of cp based on the average loss criterion. (b) (3 points) Using the output above, select a value of cp based on the accuracy criterion. IEOR 142 Practice Midterm Exam, Page 15 of 16 March 2021 (c) (4 points) The tree corresponding to one of the previously selected models is displayed in Figure 6. Describe in words the type of bookings that the CART tree predicts to be cancellations. Figure 6: CART Model. VehicleModelId12 < 0.5 yes no 0 OnlineBooking < 0.5 MobileSiteBooking < 0.5 0 01 IEOR 142 Practice Midterm Exam, Page 16 of 16 March 2021 5. (8 points) So far, we have not used the columns BookingDateTime or TripDateTime in any of the models. Describe two distinct ways to construct numerical feature(s) based on BookingDateTime and/or TripDateTime that may be incorporated into any of the previous models. Note that for full credit at least one of your feature generation methods should incorporate both of BookingDateTime and TripDateTime.