# 程序代写代做代考 MATH10282 Introduction to Statistics

MATH10282 Introduction to Statistics

Semester 2, 2016-17

Coursework using R

The deadline for submitting this coursework is 4.00pm on Friday 28 April 2016. You

should upload your report to the assessment entitled ‘R Coursework Submission’ on

the MATH10282 Blackboard site by this time and date. Please note that a similarity

report on your work will be generated by Turnitin to detect plagiarism.

This coursework comprises 10% of the overall marks for the module.

Instructions

(a) You should prepare your coursework report using Microsoft Word. A PDF report

from another program is also acceptable. You can include code and numerical

results from R by copying and pasting into your Word document. Comments and

discussion of the results should be added as required. You can save any plots

created in R, for example as a PDF, and import these into your final report.

Include in your report all R commands used to generate results.

(b) To facilitate anonymous marking, please do not include your name in your report.

You should include your Student ID in the title of your submission on Turnitin,

and on the first page of your report.

(c) If you have any queries or problems, please contact me as soon as possible.

Tim Waite, March 2017

Question 1: Weather station data

The data1 are contained in the files durham.txt and eastbourne.txt available on

Blackboard. They comprise various measurements collected at two weather stations,

Durham and Eastbourne. Each row corresponds to a different month.

• Column 1: year (labelled yyyy)

• Column 2: month (labelled mm), coded 1 (January), . . . , 12 (December).

• Column 3: average daily maximum temperature (for that month) (labelled tmax)

• Column 4: average daily minimum temperature (for that month) (labelled tmin)

(a) Read the data into R.

(b) When do records begin at the Eastbourne station? When do records begin at the

Durham station?

(c) Plot a histogram of the distribution of the average daily minimum temperature

at Durham. Comment on any special features of the shape of the distribution.

1adapted from public data released under an Open Government Licence at

https://data.gov.uk/dataset/historic-monthly-meteorological-station-data

1

(d) Plot a histogram of the distribution of the average daily minimum temperatures

at Durham in January. Do you notice anything interesting about the shape of

the distribution compared to part (c)?

(e) Draw box plots comparing the average daily minimum temperatures at Durham

with those at Eastbourne. Put both box plots on the same axes. Comment on

what your box plots indicate.

[Hint: try passing two data vectors as arguments to an appropriate function]

[12 marks]

Question 2: Coverage of confidence intervals

In this question, you will investigate the coverage properties of confidence intervals

using simulation. Suppose that X1, . . . , Xn ∼ N(µ, σ2) independently, and let X =

(X1, . . . , Xn). Define an interval estimator for µ via

I(X) =

[

X̄ −

cS

√

n

, X̄ +

cS

√

n

]

,

where c is a constant to be chosen appropriately. Recall from lectures that the coverage

probability of I(X) = [a(X), b(X)] is defined as P[a(X) ≤ µ ≤ b(X)], i.e. the probability

that the random end-points of the interval contain the true value of the parameter µ.

(a) What value of c should be chosen to ensure that the coverage probability of I(X)

is exactly 1 − α, in other words to ensure that I(X) is an exact 100(1 − α)%

confidence interval for µ?

For the remainder of this question, suppose that the value c = zα/2 is used instead,

where zα/2 denotes the upper α/2 point of a N(0, 1) distribution.

(b) What would you anticipate to be the approximate value of the coverage probability

of I(X) in this case? Explain why there might be a difference between your

suggestion and the actual coverage probability if n = 9.

(c) Write additional code to complete the R function given overleaf. Your completed

function should do the following:

(i) simulate a number, nsims, of different datasets each of size n from a N(µ, σ2)

distribution

(ii) for each simulated dataset X compute the lower and upper endpoints of I(X)

(iii) create a logical vector called covered whose ith element records whether,

for the ith simulated dataset, the interval I(X) contains µ

(iv) return the vector covered as output

(d) Use your completed R function to simulate 10,000 datasets each of size n = 9,

with µ = 5, σ2 = 22, and calculate for each dataset whether I(X) contains µ,

using α = 0.05. [N.B. you do not need to record the individual data sets or

confidence intervals, just whether or not I(X) includes µ].

2

(e) For what proportion of your simulated datasets does I(X) contain the true value

of µ? Comment on your result in relation to your answer to part (b).

[8 marks]

coverage <- function(n=9, mu=5, sigma=2, alpha=0.05, nsims=10000) { xi <- numeric(n) xbar <- numeric(nsims) covered <- numeric(nsims) # repeat the following nsims times for (i in 1:nsims) { xi <- rnorm(n, mean=mu, sd=sigma) # simulate a new dataset of size n xbar <- # these lines need to be completed s <- # . zalpha <- # . upper <- # . lower <- # . covered[i] <- # . } return(covered) } [END OF COURSEWORK QUESTIONS] 3