# CS代考 End-of-year Examinations, 2019 – cscodehelp代写

End-of-year Examinations, 2019

STAT318 / STAT462-19S2 (C)

Family Name First Name Student Number Venue

Seat Number

_____________________ _____________________ |__|__|__|__|__|__|__|__| ____________________ ________

No electronic/communication devices are permitted. No exam materials may be removed from the exam room.

Mathematics and Statistics EXAMINATION

End-of-year Examinations, 2019

STAT318 / STAT462 -19S2 (C) Data Mining

Examination Duration: 120 minutes

Exam Conditions:

Closed Book exam: Students may not bring in any written or printed materials. Calculators with a ‘UC’ sticker approved.

Materials Permitted in the Exam Venue:

None.

Materials to be Supplied to Students:

1 x Standard 16-page UC answer book.

Instructions to Students:

Use black or blue ink only.

Answer all FOUR questions.

Write your answers in the answer booklet provided.

All questions carry equal marks.

Show all working.

Page 1 of 6

End-of-year Examinations, 2019 STAT318/STAT462-19S2 (C)

Questions Start on Page 3

Page 2 of 6

End-of-year Examinations, 2019 STAT318/STAT462-19S2 (C)

1. (a) Suppose that we have the following 100 market basket transactions.

Transaction

{apple} 10 {apple, carrot} 10 {apple, banana, carrot} 21 {apple, banana, grape} 27 {apple, banana, carrot, orange} 11 {banana, grape} 3 {carrot, orange} 11 {apple, grape, orange} 7

100

For example, there are 10 transactions of the form {apple, carrot}

i. Compute the support of {orange}, {apple, banana}, and {apple, banana, orange}.

ii. Compute the confidence of the association rules: {apple, banana} → {orange}; and

{orange} → {apple, banana}.

Is confidence a symmetric measure? Justify your answer.

iii. Find the 3-itemset(s) with the largest support.

iv. If minsup = 0.1, is {carrot, orange} a maximal frequent itemset? Justify your answer.

Frequency

v. Lift is defined as

Lift(X → Y ) = s(X ∪ Y ) , s(X)s(Y )

where s( ) denotes support. What does it mean if Lift(X → Y ) = 1.

(b) This question examines linear discriminant analysis (LDA) and quadratic discrimi- nant analysis (QDA) for a 3-class classification problem.

i. Explain the difference between LDA and QDA.

ii. Briefly describe Bayes classifier and the Bayes error rate.

iii. Under what conditions does the testing error rate for QDA equal Bayes error rate?

Page 3 of 6 TURN OVER

End-of-year Examinations, 2019 STAT318/STAT462-19S2 (C)

2. (a)

Describe two potential advantages of regression trees over other statistical learning methods.

(b) When growing a regression tree using CART, two types of splits are considered. Describe these splits and provide an example for each.

(c) A regression tree has three types of nodes: the root node, internal nodes and terminal nodes. Describe each node and explain how predictions are made using a regression tree.

(d) Large bushy regression trees tend to over-fit the training data. Briefly explain what is meant by over-fitting and under-fitting the training data using regression trees.

(e) The predictive performance of a single regression tree can be substantially improved by aggregating many decision trees.

i. Briefly explain the method of bagging regression trees.

ii. Explain the difference between bagging and random forest.

iii. Briefly explain two differences between boosted regression trees and random forest.

Page 4 of 6

End-of-year Examinations, 2019 STAT318/STAT462-19S2 (C)

3. (a)

Using one or two sentences, explain the main difference between regression and classification problems.

(b) The expected test MSE, for a given x0, can be decomposed into the sum of three fundamental quantities:

ˆ2ˆˆ2 E[y0−f(x0)] =V(f(x0))+[Bias(f(x0))] +V(ε).

Briefly explain each of these three quantities.

(c) Provide a sketch typical of training error, testing error, and the irreducible error, on a single plot, against the flexibility of a statistical learning method. The x-axis should represent the flexibility and the y-axis should represent the error. Make sure the plot is clearly labelled. Explain why each of the three curves has the shape displayed in your plot.

(d) Describe two situations where we would generally expect the testing MSE of an inflexible statistical learning method to be better than a flexible method.

(e) Would we generally expect the training MSE of a flexible statistical learning method to be better or worse than an inflexible method? Why?

Page 5 of 6 TURN OVER

End-of-year Examinations, 2019 STAT318/STAT462-19S2 (C)

4. (a)

Using one or two sentences, explain the difference between supervised learning and unsupervised learning.

(b) Supposethatwehavefivepoints,x1,…,x5,withthefollowingdissimilaritymatrix:

x1

x1 0

x2 0.90

x3 0.59

x4 0.45

x5 0.65

x2 x3 x4 x5 0.90 0.59 0.45 0.65

0 0.36 0.53 0.02 . 0.36 0 0.56 0.15 0.53 0.56 0 0.24 0.02 0.15 0.24 0

For example, the dissimilarity between x1 and x2 is 0.9 and the dissimilarity between x3 and x5 is 0.15.

i. Briefly explain the agglomerative hierarchical clustering algorithm.

ii. Using the dissimilarity matrix above, sketch the dendrogram that results from hierarchically clustering these points using single linkage. Clearly label your dendrogram and include all merging dissimilarities.

iii. Suppose we want a clustering with two clusters. Which points are in each cluster for single linkage?

iv. Repeat parts ii. and iii. using complete linkage.

v. Describe one disadvantage of agglomerative hierarchical clustering.

(c) Describe one advantage and one disadvantage of the k-means clustering algorithm.

End of Examination

Page 6 of 6