代写代考 CS 189 (CDSS offering) – cscodehelp代写

Lecture 22: Decision trees CS 189 (CDSS offering)
2022/03/16

Today’s lecture

• Today, we will lay the foundations behind probably the second most intuitive model after nearest neighbor: decision trees
• A decision tree describes how to make a decision/prediction based on the input
• Decision tree learning concerns how to build decision trees from data
• Decision trees are widely used in real world applications, and we’ll discuss why
• One reason is that they can be ensembled to give very good performance — this is the topic of Friday’s lecture

What’s a decision tree?

Decision trees
• We are given a dataset (x1, y1), …, (xN, yN)
• A decision tree for this problem will repeatedly split on select features of x (at
each internal node of the tree) in order to predict y (at the leaf nodes of the tree)
• Each node contains the subset of the data that matches the splits made so far
• Therefore, ideally, the leaf nodes contain very little (or no) variability in y
• The prediction for a new point x! proceeds down the tree to a leaf node, where
the prediction is a majority vote (for classification) or average (for regression)

Learning decision trees
• Learning optimal decision trees, for reasonable definitions of optimality, is known to be NP-complete (very, very hard)
• Algorithms for learning trees therefore only try to find “good” trees
• By far, the most common approaches for learning decision trees are greedy
algorithms that operate in a top down manner
• Top down means we start with a tree that is a single node with all the data, and we
continually “build out” the tree to get better and better leaf nodes
• Greedy means that, every time split a node, we only consider how good the immediate split will be, rather than looking further ahead at potential future splits

Splitting metrics
• When we are learning the tree, at every node, we consider all features and all possible splitting values, and we try to figure out which will make the best split
• If this sounds expensive, that’s because it certainly can be — we’ll see next time that considering all possible splits for all features may not be necessary
• How do we define whether one split is better than another?
• Various metrics have been proposed to try and characterize the “goodness” of a
split, and they are pretty much all based on concepts from information theory 6

One metric: information gain
recall entropy onion
EIIIEEEFI.IE
IT O 811311

Another metric: Gini impurity
f randomly sample two points g how often dothey disagree
p t g pl g
in general for K classes
G EE Pull pa ÉPÉE
can use G in place of H on previous slide 8

Metrics for continuous targets
what ifeachyoEIR
node with L points is given by Y I Ect
e.g EE Yi y 51 andmaybeset
prediction
to median again used in the same way as H and G