School of Computing and Information Systems The University of Melbourne

COMP90049 Introduction to Machine Learning (Semester 1, 2022)

Week 5: Sample Solutions

1. How is holdout evaluation different to cross-validation evaluation? What are some reasons we would prefer one strategy over the other?

Copyright By cscodehelp代写 加微信 cscodehelp

In a holdout evaluation strategy, we partition the data into a training set and a test set: we build the model on the former and evaluate on the latter.

In a cross-validation evaluation strategy, we do the same as above, but a number of times, where each iteration uses one partition of the data as a test set and the rest as a training set (and the partition is different each time).

Why we prefer cross-validation to holdout evaluation strategy? Because, holdout is subject to some random variation, depending on which instances are assigned to the training data, and which are assigned to the test data. Any instance that forms part of the model is excluded from testing, and vice versa. This could mean that our estimate of the performance of the model is way off or changes a lot from data set to data set.

While Cross–validation mostly solves this problem: we’re averaging over a bunch of values, so that one weird partition of the data won’t throw our estimate of performance completely off; also, each instance is used for testing, but also appears in the training data for the models built on the other partitions. It usually takes much longer to cross-validate, however, because we need to train a model for every test partition.

2. A confusion matrix is a summary of the performance of a (supervised) classifier over a set of development (“test”) data, by counting the various instances:

2 3 1 531 371 035

Classified

(i). Calculate the classification accuracy of the system. Find the error rate for the system.

In this context, Accuracy is defined as the fraction of correctly identified instances, out of all of the instances. In the case of a confusion matrix, the correct instances are the ones enumerated along the main diagonal (classified as a and actually a etc.):

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = # of correctly identified instance 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

= 10 + 5 + 7 + 5

10 + 2 + 3 + 1 + 2 + 5 + 3 + 1 + 1 + 3 + 7 + 1 + 3 + 0 + 3 + 5

= 27=54% 50

Error rate is just the complement of accuracy:

𝐸𝑟𝑟𝑜𝑟 𝑅𝑎𝑡𝑒 = # of incorrectly identified instance = 1 − 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 1 − 54% 𝑡𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠

(ii). Calculate the precision, recall and F-score (where β = 1), for class d.

Precision for a given class is defined as the fraction of correctly identified instances of that class, from the times that class was attempted to be classified. We are interested in the true positives (TP) where we attempted to classify an item as an instance of said class (in this case, d) and it was actually of that class (d): in this case, there are 5 such instances. The false positives (FP) are those items that we attempted to classify as being of class d, but they were actually of some other class: there are 3 + 0 + 3 = 6 of those.

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= 𝑇𝑃 = 5 = 5 ≈45% 𝑇𝑃 + 𝐹𝑃 5 + 3 + 0 + 3 11

Recall for a given class is defined as the fraction of correctly identified instance of that class, from the times that class actually occurred. This time, we are interested in the true positives, and the false negatives (FN): those items that were actually of class d, but we classified as being of some other class; there are 1 + 1 + 1 = 3 of those.

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃 = 5 = 5 ≈ 62% 𝑇𝑃 + 𝐹𝑁 5 + 1 + 1 + 1 8

F-score is a measure which attempts to combine Precision (P) and Recall (R) into a single score. In general, it is calculated as:

𝐹 =(1+𝛽”)𝑃.𝑅

By far, the most typical formulation is where the parameter 𝛽 is set to 1: this means that Precision and Recall are equally important to the score, and that the score is a harmonic mean.

In this case, we have calculated the Precision of class d to be 0.45 and the Recall to be 0.62. The F-score where (𝛽 = 1) of class d is then:

𝐹# =2𝑃.𝑅= 2×0.45×0.62≈53% 𝑃 + 𝑅 0.45 + 0.62

(iii). Why can’t we do this for the whole system? How can we consider the whole system?

The concept of precision and recall is defined per-class on the bases of one (interesting) class vs the rest of (not interesting) classes.

Since this system, similar to all other multiclass classifiers, considers all classes (a, b, c and d) as interesting, we need to calculate the precision and recall per each class (vs the rest) and then find the average for the whole system.

As covered in the lectures, there are multiple methods for calculating the average precision and recall for the whole system. We can use methods like Macro Averaging, Micro Averaging or Weighted Averaging.

Our choice of method depends on our goal and the domain of the system. In cases where we want to emphasis on identifying the behaviour of the system for small classes Macro averaging can be a better choice. While in situations that we want to

(𝛽”.𝑃) + 𝑅

evaluate the system mostly based on its behaviour in detecting the large classes Micro averaging would be a better option.

3. For the following dataset:

ID Outl Temp Humi Wind PLAY TRAINING INSTANCES

AshhFN BshhTN CohhFY DrmhFY ErcnFY FrcnTN

TEST INSTANCES GocnT?

(i). Classify the test instances using the method of 0-R.

0-R is the quintessentially baseline classifier: we throw away all of the attributes, other than the class labels, and just predict each test instance according to whichever label is most common in the training data. (Hence, it is also common called the “majority class classifier”.)

In this case, the two labels Y and N are equally common in the training data — so we are required to apply a tiebreaker. Remember that we’re simply choosing one label to be representative of the entire collection, and both labels seem equally good for that here: so, let’s say N.

Consequently, both test instances are classified as N.

(ii). Classify the test instances using the method of 1-R.

1-R is a slightly better baseline, which requires us to choose a single attribute to represent the entire decision–making process. For example, if Outl is our preferred attribute, then we base the classification of each test instance solely based on its value of Outl. (This is sometimes called a “Decision Stump”.)

Given our preferred attribute, we will label a test instance according to whichever label in the training data was most common, for training instances with the corresponding attribute value.

How do we decide which attribute to choose? Well, the most common method is simply by counting the errors made on the training instances.

Let’s say we chose Outl: first, we need to observe the predictions for the 3 values (s, o, and r), and then we will count up the errors that those predictions will make on the training data (it turns out that we can do this simultaneously):

• When Outl is s, there are two such instances in the training data. Both of these in- stances are labelled as N: our label for this attribute value will be N. We will therefore predict the label of both of these training instances correctly (we will predict N, and both of these instances are actually N).

• When Outl is o, there is just a single instance in the training data, labelled as Y: our label for this attribute value will be Y. We will therefore predict the label of this training instance correctly.

• When Outl is r, there are three such instances in the training data. Two of these instances are labelled as Y, the other is N: our label for this attribute value will be Y (as there are more Y instances than N instances — you can see that we’re applying the method of 0-R here). We will therefore make one error: the instance which was actually n.

In total, that is 1 error for Outl. Hopefully, you can see that this is a very wordy explanation of a very simple idea. (We will go through the other attributes more quickly.)

• When Temp is h, there are two N instances and one Y: we will make one

• When Temp is m, there is one Y instance: we won’t make any mistakes.

• When Temp is c, there is one N instance and one Y instance: we will make one mistake.

This is 2 errors in total for Temp, which means that it’s less good that Outl.

We’ll leave the other attributes as an exercise, and assume that Outl was the best

attribute (it’s actually tied with Wind): how do the test instances get classified?

• For test instance G, Outl is o — the 0-R classifier over the o instances from

the training data gives Y, so we predict Y for this instance.

• For test instance H, Outl = s — the 0-R classifier over the s instances from

the training data gives N, so we predict N for instance H.

Given the above dataset, we wished to perform feature selection on this dataset, where the

class is PLAY:

Which of Humi and Wind has the greatest Pointwise Mutual Information for the class Y? What about N?

To determine Pointwise Mutual Information (PMI), we compare the joint probability to the product of the prior probabilities as follows:

𝑃𝑀𝐼(𝐴, 𝐶) = 𝑙𝑜𝑔 𝑃(𝐴 ∩ 𝐶) ” 𝑃(𝐴)𝑃(𝐶)

Note that this formulation only really makes sense for binary attributes and binary classes (which we have here.)

𝑃𝑀𝐼(𝐻𝑢𝑚𝑖=h,𝑃𝑙𝑎𝑦=𝑌)=𝑙𝑜𝑔 𝑃(𝐻𝑢𝑚𝑖=h ∩𝑃𝑙𝑎𝑦=𝑌) ” 𝑃(𝐻𝑢𝑚𝑖 = h)𝑃(𝑃𝑙𝑎𝑦 = 𝑌)

=𝑙𝑜𝑔” 6 =𝑙𝑜𝑔”(1)=0 43

𝑃𝑀𝐼(𝑊𝑖𝑛𝑑=𝑇,𝑃𝑙𝑎𝑦=𝑌)=𝑙𝑜𝑔 𝑃(𝑊𝑖𝑛𝑑=𝑇∩𝑃𝑙𝑎𝑦=𝑌) ” 𝑃(𝑊𝑖𝑛𝑑 = 𝑇)𝑃(𝑃𝑙𝑎𝑦 = 𝑌)

= 𝑙𝑜𝑔” 6 = 𝑙𝑜𝑔”(0) = −∞ 23

So, we find that Wind=T is (perfectly) negatively correlated with PLAY=Y; whereas Humi=h is (perfectly) uncorrelated.

You should compare these results with the negative class PLAY=N, where Wind=T is positively correlated, but Humi=h is still uncorrelated.

(ii). Which of the attributes has the greatest Mutual Information for the class, as a whole? A general form of the Mutual Information (MI) is as follows:

𝑀𝐼(𝑋, 𝐶) = S S 𝑃(𝑥, 𝑐)𝑃𝑀𝐼(𝑥, 𝑐) +∈, $∈{‘,)}

Effectively, we’re going to consider the PMI of every possible attribute value–class combination, weighted by the proportion of instances that actually had that combination.

To handle cases like PMI(Wind) above, we are going to equate 0 log 0 ≡ 0 (which is true in the limit anyway).

For Outl, this is going to look like:

𝑀𝐼(𝑂𝑢𝑡𝑙) = 𝑃(𝑠, 𝑌)𝑃𝑀𝐼(𝑠, 𝑌) + 𝑃(𝑜, 𝑌)𝑃𝑀𝐼(𝑜, 𝑌) + 𝑃(𝑟, 𝑌)𝑃𝑀𝐼(𝑟, 𝑌) +

𝑃(𝑠, 𝑁)𝑃𝑀𝐼(𝑠, 𝑁) + 𝑃(𝑜, 𝑁)𝑃𝑀𝐼(𝑜, 𝑁) + 𝑃(𝑟, 𝑁)𝑃𝑀𝐼(𝑟, 𝑁)

= 𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6+ 623613633

𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6 623613633

≈ 0 𝑙𝑜𝑔”0 + (0.1667)(1) + (0.3333)(0.4150) + (0.3333)(1) + 0 𝑙𝑜𝑔”0 + (0.1667)(−0.5850)

It’s worth noting that while some individual terms can be negative, the sum must be at least zero. For Temp, this is going to look like:

𝑀𝐼(𝑇𝑒𝑚𝑝) = 𝑃(h, 𝑌)𝑃𝑀𝐼(h, 𝑌) + 𝑃(𝑚, 𝑌)𝑃𝑀𝐼(𝑚, 𝑌) + 𝑃(𝑐, 𝑌)𝑃𝑀𝐼(𝑐, 𝑌) + 𝑃(h, 𝑁)𝑃𝑀𝐼(h, 𝑁) + 𝑃(𝑚, 𝑁)𝑃𝑀𝐼(𝑚, 𝑁) + 𝑃(𝑐, 𝑁)𝑃𝑀𝐼(𝑐, 𝑁)

= 𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6+ 633613623

𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6+ 𝑙𝑜𝑔” 6 623613623

≈ (0.1667)(−0.5850) + (0.1667)(1) + (0.1667)(0) + (0.3333)(0.4150) + 0 𝑙𝑜𝑔”0 + (0.1667)(−0.5850)

We will leave the workings as an exercise, but the Mutual Information for Humi is 0, and for Wind is 0.459.

Consequently, Outl appears to be the best attribute (perhaps as we might expect), and Wind also seems quite good; whereas Temp is not very good, and Humi is completely unhelpful.

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com