# 程序代写代做代考 Statistics 211 Project

Statistics 211 Project

Due Date: May 4, 2017

1 Project Background

Zipf’s law states that given some corpus of natural language utterances, the frequency of

any word is inversely proportional to its rank in the frequency table. Suppose we counted

the number of times each word was used in the written works by Shakespeare, Alexander

Hamilton, James Madison, or some other author with a substantial written record. Can we

say anything about the frequencies of the most common words?

We let fi be the rate per 1000 words of text for the ith most frequent word used. The

American linguist George Zipf (1902-1950) observed an interesting law like relationship be-

tween rate and rank,

E[fi|i] =

a

ib

, (1)

and further observed that the exponent b ≈ 1. Taking logarithms of both sides, we have

approximately,

E[Yi|i] = c− b log(i), (2)

where Yi = log(fi) and c = log(a).

Zipf’s law has been applied to frequencies of many other classes of objects besides words,

such as the frequency of visits to web pages on the internet and the frequencies of species of

insects in an ecosystem.

2 The Data

The data in words.txt give the frequencies of words in works from four different sources:

the political writings of eighteenth-century American political figures Alexander Hamilton,

James Madison, and John Jay, and the book Ulysses by twentieth-century Irish writer James

Joyce. The data are from Mosteller and Wallace (1964, Table 8.1-1), and give the frequencies

of 165 very common words. Several missing values occur in the data; these are really words

that were used so infrequently that their count was not reported in Mosteller and Wallace’s

table. The following table provides a description of the variables in the data set words.txt.

Word The word

Hamilton Rate per 1000 words of this word in the writings of Alexander Hamilton

HamiltonRank Rank of this word in Hamiltons writings

Madison Rate per 1000 words of this word in the writings of James Madison

MadisonRank Rank of this word in Madisons writings

Jay Rate per 1000 words of this word in the writings of John Jay

JayRank Rank of this word in Jays writings

Ulysses Rate per 1000 words of this word in Ulysses by James Joyce

UlyssesRank Rank of this word in Ulysses

1

3 Assignment

There are four parts to this assignment. You will complete this assignment in pairs and

submit one report.

Part 1: Using only the 50 most frequent words in Hamilton’s work (that is, using

only rows in the data for which HamiltonRank≤50), draw the appropriate summary graph,

estimate the mean function in (2), construct a 95% confidence interval for b, and summarize

your results.

Part 2: Use the following residual bootstrap method to construct a confidence interval

for b. Let ĉ and b̂ be the least squares estimators for c and b in equation (2) respectively.

Compute the residuals as

êi = Yi − ĉ+ b̂ log(i), 1 ≤ i ≤ 50. (3)

Draw 50 samples {ê∗1, . . . , ê∗50} with replacement from the residuals {ê1, . . . , ê50}. Given

{ê∗1, . . . , ê∗50}, construct the bootstrap sample through

Y ∗i = ĉ− b̂ log(i) + ê

∗

i , 1 ≤ i ≤ 50. (4)

Run a linear regression based on the bootstrap sample {Y ∗i , log(i)}50i=1 and let b̂∗ be the

corresponding least squares estimator for b. Repeat the above procedure 1000 times and sort

the bootstrap estimators for b from the smallest to the largest as:

b̂∗1 ≤ b̂

∗

2 ≤ · · · ≤ b̂

∗

999 ≤ b̂

∗

1000.

Then, the 95% bootstrap confidence interval for b is defined as [b̂∗25, b̂

∗

975]. Compare this

interval with the confidence interval obtained in Part 1. Make a histogram of the b̂∗i values

and include it in your report.

Part 3: Repeat Part 1, but for words with rank of 75 or less, and with rank less than

100. For larger number of words, Zipf’s law may break down. Does that seem to happen

with these data?

Part 4: Recall that ĉ and b̂ are the least squares estimators in Part 2. Denote by σ̂2 the

sample variance of the residuals {ê1, . . . , ê50}. Generate simulated data from the following

model,

Ỹi = ĉ− b̂ log(i) + ei, (5)

where ei follows the normal distribution with mean zero and variance σ̂

2. Based on the

simulated data (Ỹi, log(i)), test the null hypothesis H0 : b = b̂ versus the alternative that

Ha : b 6= b̂ at the 5% level. Note that the true slope is indeed equal to b̂ (i.e., the null

hypothesis is true) according to the way the data are generated. Simulate 1000 data sets

according to (5) and count the number of times that the true null hypothesis is being rejected.

Report your finding. Now if you test the null H ′0 : b = 0.96 versus the alternative that

H ′a : b 6= 0.96, what is the number of rejections among the 1000 simulated data sets.

2

4 Submission and Grading

Your printed report (along with your R codes) is due on May 4, 2017. Partners submit one

printed report. No late reports will be accepted. Both partners in the pair will receive the

same grade. The score is out of 100 points. 20 points will be assigned for each part, plus an

additional 20 points based on the presentation of results and your R codes. Make sure your

report is professional with correct punctuation and spelling. Omit needless words.

5 Some Useful R Functions

1. Read the data: read.table(“words.txt”, header = TRUE, sep = “”)

2. Linear regression: fit <- lm(Y∼X) 3. Summarize the outputs from the least squares fit: summary(fit) 4. Construct a 95% confidence interval: confint(fit, ’X’, level=0.95) 5. Sampling with replacement: sample(1:n,n,replace=TRUE) 6. Using the for loop to run a simulation: for(i in 1:1000){...} 3