# CS代考 CSCI 544, lecture 2: Naïve Bayes – cscodehelp代写

CSCI 544, lecture 2: Naïve Bayes

Ron Artstein

2022-01-12

These notes are not comprehensive, and do not cover the entire lec- ture. They are provided as an aid to students, but are not a replacement for watching the lecture video, taking notes, and participating in class discussions. Any distribution, posting or publication of these notes out- side of class (for example, on a public web site) requires my prior ap- proval.

Administrative notes

Please login to Zoom using USC credentials Due January 18: Quiz

Reading for current and next class

Textbook chapters

Due January 20: Written Assignment Due January 27: Coding Assignment 1

Instructions on website now

Data on Blackboard now

Submission on Vocareum soon

Probability review

P(A,B) = P(A) · P(B|A) = P(B) · P(A|B) Joint Prior Conditional

Probability

P(A|B) ∝ Posterior

Probability

Probability Probability

P(A) · P(B|A) Bayes’ Theorem P(B)

P(B|A) When P(B) is constant

Probability Probability

Conditional

The importance of priors

Sometimes it is easy to determine conditional probabilities: Infected → 1% false negative (99% sensitivity)

Not infected → 1% false positive (99% specificity)

Suppose a person tests positive. How likely are they infected?

The importance of priors

Sometimes it is easy to determine conditional probabilities: Infected → 1% false negative (99% sensitivity)

Not infected → 1% false positive (99% specificity)

Suppose a person tests positive. How likely are they infected?

It depends on the incidence! Suppose 1% of population infected.

Positive Negative

99 1 99 9801

Posterior is 99/198 = 50%: higher than

the prior of 1%.

Infected Not infected

Classification

Classification task: assign labels (classes) to items.

Two kinds of squirrels in California:

Ground squirrel and Tree squirrel

Make observations about individual squirrels:

Tail: bushy or thin?

Ears: pointy or round?

Location: in a tree or on the ground?

For each individual squirrel, identify most likely class:

argmax P(class|observations) class

Naïve Bayes model (Bayes part)

Often, easier to model the observations given the class (e.g., How likely is a ground squirrel to have a bushy tail?)

P(observations|class)

Use Bayes’ theorem to turn one probability into the other! P(class|observations) ∝ P(class) · P(observations|class)

This is a generative model Need to know class priors

Don’t need observation priors: P(observations) is constant given the observations.

Naïve Bayes model (naïve part)

Modeling the probability of observations:

Observations represented as features: f1, f2, f3, . . . (e.g., Tail, Ears, Location, . . . )

P(observations|class)

= P(f1, f2, f3, . . . |class)

= P(f1|class) · P(f2|class, f1) · P(f3|class, f1, f2) · · ·

Naïve assumption:

observations are conditionally independent, given the class:

For all fx,fy,fz,…: P(fx|class,fy,fz,…) = P(fx|class)

P(observations|class)

= P(f1, f2, f3, . . . |class)

= P(f1|class) · P(f2|class, f1) · P(f3|class, f1, f2) · · ·

≈ P(f1|class) · P(f2|class) · P(f3|class) · · ·

Naïve Bayes model: parameter estimation

We have a model with many parameters:

For each class, a prior probability

For each feature × class, a conditional probability

How do we estimate the parameters of our model?

Maximum Likelihood Estimate: choose the model that gives the highest probability to the observed data

Suppose I flip a coin, and it falls 6 times on heads, 4 times on tails. What’s the most likely bias of the coin?

If the coin is 50%–50%: P = 0.20508

If the coin is 60%–40%: P = 0.25082 Maximum If the coin is 70%–30%: P = 0.20012

MLE of a probability distribution = observed frequencies

Naïve Bayes model: five squirrels

We found 5 squirrels (training data)

Ground Ground Ground Tree Tree

bushy bushy thin bushy bushy

Ears Location

pointy ground round ground round ground pointy ground round tree

Tbrt Ground Tree

Class priors

Bushy tail Thin tail

Pointy ears Round ears

On the ground In a tree

2/3 2/2 1/3 0/2

1/3 1/2 2/3 1/2

3/3 1/2 0/3 1/2

Naïve Bayes model: new squirrel

New squirrel: Thin tail, round ears, in a tree. What does the model predict?

P(Ground|observations) ∝ 3/5 · 1/3 · 2/3 · 0/3 = 0

P(Tree|observations) ∝ 2/5 · 0/2 · 1/2 · 1/2 = 0 Zero probability of just one feature → product = zero!

Smoothing to get rid of zero probabilities

Naïve Bayes model: add-one smoothing

Tbrt Ground Tree

New squirrel: Thin tail, round ears, in a tree. What does the model predict?

P(Ground|observations) ∝ 3/5 · 2/5 · 3/5 · 1/5 = 0.0288

P(Tree|observations) ∝

2/5 · 1/4 · 2/4 · 2/4 = 0.0250

What’s the most likely squirrel?

P(Ground, observations) = 3/5 · 3/5 · 3/5 · 4/5 = 0.1728

Class priors

Bushy tail Thin tail

Pointy ears Round ears

On the ground In a tree

3/5 3/4 2/5 1/4

2/5 2/4 3/5 2/4

4/5 2/4 1/5 2/4

Data splits for machine learning

Training data: What our program learns from Labeled (supervised learning)

Unlabeled (unsupervised learning) Test data: Used for evaluating the program

Provides correct answers

Development data: Testing during development Avoid “contaminating” the test data

Try different models Tune parameters

Coding Assignment 1

Write a Naive Bayes classifier

From Scratch

Classify hotel reviews on two dimensions

Positive / Negative

Truthful / Deceptive Words as features Graded on performance Programming in Python

Coding Assignment 1: directory structure

fold2 fold3 fold4 fold1 fold2 fold3 fold4 fold1 fold2 fold3 fold4 fold1 fold2 fold3 fold4

Coding Assignment 1: programs

Training data

Learning program

Model file(s) Test data Classifiation program

Output file

Answer key Scoring program Score

Coding Assignment 1: notes

Reserve data for development

Put training and development data in separate directories Vocareum uses fold1 for development

Problem formulation

Two binary classifiers One 4-class classifier

Reference solution uses add-one smoothing Reference solution ignores unknown test tokens

Tokenization

Experiment to see what works best!

What is a word?

One word or two?

handwriting ; hand writing

; Forty niner

San Francisco-based ; San Francisco–based he’s ; it’ll

David’s book ; David’s happy

Tokenization: splitting the text into units for processing

Token, stem, lemma

Token: atomic unit for processing

Anywhere on the scale character ←→ word

Stem: wordform stripped of some characters cat, cats → cat

defense, defensible → defens

Lemma: the base (or citation) form of a word

cat, cats → cat brought → bring is, were → be

Lemmatizer = function: {wordform|token} → lemma

Information lost (or gained)

Tokenization can be lossy Where were you born?

where be you bear Where is your bear?

Is it important to be able to reconstruct the original? Depends on the application

Byte-pair encoding (BPE): use lossless compression Not necesarrily linguistically meaningful

May be useful for some applications

Building a lemmatizer

Rules • Dictionary • Learn from corpus (learn what?)

What are some problems with using a dictionary? Word form not in the dictionary (OOV):

Kardashians → Kardashian

Word form ambiguous:

I saw the chicken lay an egg → lay

I saw the chicken lay in the sun → lie

OOV, Disambiguation: major issues in all of NLP