# CS代考 STAT318 — Data Mining – cscodehelp代写

STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.
, University of Canterbury 2021
STAT318 — Data Mining ,1 / 1

A Famous Clustering Example
, a physician plotted the location of cholera deaths on a map during an outbreak in the 1850s.
The locations of the cases were clustered around certain locations, and further investigation shows the root problem was around the polluted water sources.
Clustering is still an important statistical tool with real world applications:
Planning ecient fiber network nodes,
Location to put services, eg: hospitals and supermarkets. Election zoning, and etc.
, University of Canterbury 2021
STAT318 — Data Mining ,2 / 1

Cluster Analysis
Cluster analysis is a broad class of unsupervised learning techniques for discovering unknown subgroups or clusters in data.
There is no associated response variable.
Cluster analysis is more subjective than supervised learning because there is no simple goal for the analysis.
We must define what it means to be similar or dierent. This is often a domain specific consideration.
We need a measure.
, University of Canterbury 2021
STAT318 — Data Mining ,3 / 1

Similarity Measures (qualitative features)
Similarity measures satisfy (dissimilarity is 1 – similarity)
1 0Æsim(x,z)Æ1
2 sim(x,z)=1 i x=z
3 sim(x, z) = sim(z, x) (symmetry)
Nominal features:
sim(x,z) = number of feature matches total number of features
Ordinal features: replace the M feature values with (i≠0.5)/M for i=1,2,…,M,
in the prescribed order of their original values. Then treat them as quantitative features.
, University of Canterbury 2021
STAT318 — Data Mining ,4 / 1
For example, let Then sim(x, z) = 2/3.
x = (Blue, Male, 82071),
z = (Blue, Female, 82071).
If we have an ordinal feature with levels low, medium and high, for example, we could code the levels using
low = 1/6 medium = 3/6
high = 5/6, and treat the feature as a quantitative feature.
Gower similarity (for example) can be used to measure the similarity between observations with both qualitative and quantitative features (not covered here).

Similarity Measures (binary features)
Symmetric: the simple matching coecient (SMC) SMC(x,z)= f00 +f11 ,
f00 +f01 +f10 +f11 where fab is the number of features with xi = a and zi = b.
Asymmetric: the Jaccard coecient (JC) JC(x,z) = f11
f01 + f10 + f11 Example: Calculate SMC(x,z) and JC(x,z) for
x = (1,0,0,0,0,0,0,1,0,0) z = (0,0,1,0,0,0,0,0,0,1)
, University of Canterbury 2021
STAT318 — Data Mining ,5 / 1
If 0 and 1 are equally important (e.g. male = 0 and female = 1), then the variables are called symmetric binary variables. If only 1 is important (e.g. item purchased = 1, 0 otherwise), then the variable is called asymmetric.
Example:
f00 =6,f01 =2,f10 =2,f11 =0. SMC(x, z) = 6/10
JC(x,z) = 0/4 = 0.
If x and z are asymmetric, then JC gives a meaningful result (they have nothing in common). Otherwise, SMC shows the vectors are similar with six matching zeros.

Similarity Measures (document term vectors)
A document term vector is a vector containing frequencies of particular words from a document.
Cosine similarity: the cosine of the angle between x and z cos(x,z)= x·z .
ÎxÎÎzÎ z
􏱓
x
, University of Canterbury 2021
STAT318 — Data Mining ,6 / 1
Recall:
cos(x,z)= x·z = xTz =qpi=1xizi. ÎxÎÎzÎ ÎxÎÎzÎ ÎxÎÎzÎ
The closer cos(x , z ) is to one, the more similar the vectors. This similarity measure ignores 0-0 matches (like Jaccard), but it can handle non-binary vectors. For example, let
x = (2,5,0,0,2) z = (4,1,0,0,3).
Then cos(x,z) ¥ 0.65. If we convert the data to binary data (present/absent), we get
x = (1,1,0,0,1) z = (1,1,0,0,1),
which has a Jaccard similarity of one (the dierences between the vectors are lost).

Distance Measures (quantitative features)
Manhattan Distance (1-norm):
ÿp Îx≠zÎ1 =j=1|xj ≠zj|
Euclidean Distance (2-norm):
ˆıÿp Îx≠zÎ2 =Ùj=1|xj ≠zj|2
Supreme Distance (Œ-norm):
Îx≠zÎŒ =max{|xj ≠zj|:j=1,2,…,p}
, University of Canterbury 2021
STAT318 — Data Mining ,7 / 1
II x – z II = 8.94 II x – z II = 12 II x – z II = 8 2.1..
10 z 10 z 10 z …
2x2x2x IIIIII
262626 111

-1 (0,0) 1 -1 (0,0) 1 -1 (0,0) 1
-1 -1 -1