# 程序代写代做代考 Excel scheme STAT7001 STATISTICS FOR PRACTICAL COMPUTING —

STAT7001 STATISTICS FOR PRACTICAL COMPUTING —

ASSESSMENT 2 (2016/17 SESSION)

• Your solutions should be your own work and are to be submitted electronically to the

course Moodle page by 12 noon on MONDAY, 24TH APRIL 2017.

• Ensure that you electronically ‘sign’ the plagiarism declaration on the Moodle page

when submitting your work.

• Late submission will incur a penalty unless there are extenuating circumstances (e.g.

medical) supported by appropriate documentation and notified within one week of the

deadline above. Penalties, and the procedure in case of extenuating circumstances, are

set out in the latest editions of the Statistical Science Department student handbooks

which are available from the departmental web pages.

• Failure to submit this in-course assessment will mean that your overall examination

mark is recorded as “non-complete”, i.e. you will not obtain a pass for the course.

• Submitted work that exceeds the specified word count will be penalized. The penalties

are described in the detailed instructions below.

• Your solutions should be your own work. When uploading your scripts, you will be

required to electronically sign a statement confirming this, and that you have read the

Statistical Science department’s guidelines on plagiarism and collusion (see below).

• Any plagiarism or collusion can lead to serious penalties for all students involved,

and may also mean that your overall examination mark is recorded as non-complete.

Guidelines as to what constitutes plagiarism may be found in the departmental student

handbooks: the relevant extract is provided on the ‘In-course assessment 2’ tab on the

STAT7001 Moodle page. The Turn-It-In plagiarism detection system may be used to

scan your submission for evidence of plagiarism and collusion.

• You will receive feedback on your work via Moodle, and you will receive a provisional

grade. grades are provisional until confirmed by the Statistics Examiners’ Meeting in

June 2017.

Background and overview

On 23rd June 2016, a referendum was held in the UK to decide whether or not to remain

part of the European Union (EU). 72% of registered voters took part. Of those, 51.2% voted

to leave the EU, and 48.1% voted to remain.

This result was unexpected, and there has been extensive commentary on the reasons for

it. On 6th February 2017, the BBC News web site carried an article entitled Local voting

figures shed new light on EU referendum (the article is at http://www.bbc.co.uk/news/

uk-politics-38762034). The article is by Martin Rosenbaum, the BBC’s Freedom of

Information specialist. He obtained data from 1070 electoral wards,1 giving the numbers of

‘Leave’ and ‘Remain’ votes cast in each ward.

1An electoral ward is the smallest administrative division for election purposes in the UK, typically with

a population of around 5 500. There are almost 9 500 electoral wards in the UK.

http://www.bbc.co.uk/news/uk-politics-38762034

http://www.bbc.co.uk/news/uk-politics-38762034

In his article, Martin Rosenbaum goes on to calculate some statistical associations between

the proportion of ‘Leave’ votes in a ward, and some of its social, economic and demographic

characteristics according to the most recent UK census which was conducted in 2011. His

article looks at characteristics such as education, age and ethnicity taken individually. What

it does not do, however, is to investigate combinations of these characteristics. This could

be important, because there could be other variables that simultaneously influence (say)

education level and the propensity to vote ‘Leave’, and which thereby create the illusion of

a causal link between them.

The BBC web page provides the voting data that were used in Martin Rosenbaum’s analysis,

but not the census data. However, he has very kindly shared his census data with us — and,

for this in-course assessment, they have been supplemented with some additional information

as well.

The data are provided to you in the CSV file ReferendumResults.csv on the ‘In-course

assessment 2’ tab of the STAT7001 Moodle page. For each of the 1070 electoral wards, this

file provides values of around 45 variables that may be relevant in understanding why people

voted as they did (the Appendix to these instructions gives a full list of variables, along with

other metadata). In addition, for the first 803 wards in the data file, the numbers of ‘Leave’

votes are provided, as well as the total number of votes for ‘Leave’ and ‘Remain’ combined.

For the final 267 wards however, the numbers of ‘Leave’ votes are not provided to you: they

are given as ‘−1’ in the data file.

Your task in this assessment is to use the data on the first 803 wards, to build a statistical

model that will help you to:

• Understand the social, economic and demographic characteristics that are associated

with the voting outcome for a ward; and

• Estimate the proportion of ‘Leave’ votes in each of the 267 wards for which you don’t

have this information.

Detailed instructions

You may use either R or SAS for this assessment.

1. Read the data into your chosen software package, and carry out any necessary recoding

(e.g. to deal with the fact that ‘−1’ represents a missing value).

2. Carry out an exploratory analysis that will help you to start building a sensible sta-

tistical model to explain and predict the proportion of ‘Leave’ votes in a ward. This

analysis should aim to reduce the number of candidate variables to take into the subse-

quent modelling exercise, as well as to identify any important features of the data that

may have some implications for the modelling. You will need to consider the context

of the problem to guide your choice of exploratory analysis. See the ‘Hints’ below for

some ideas.

1

3. Using your exploratory analysis as a starting point, develop a statistical model that

enables you to predict the proportion of ‘Leave’ votes in a ward, based on (a subset of)

the ward characteristics. To be convincing, you will need to consider a range of models

and to use an appropriate suite of diagnostics to assess them. Ultimately however, you

are required to recommend a single model that is suitable for interpretation, and to

justify your recommendation. Your chosen model should be either a linear model, a

generalized linear model or a generalized additive model.

4. Use your chosen model to predict the proportion of ‘Leave’ votes for each of the 267

wards with missing voting data, and also to estimate the standard deviation of your

prediction errors.

Submission for this assessment is electronic, via the STAT7001 Moodle page. You are required

to submit three files, as follows:

• A report on your analysis, not exceeding 2 000 words of text (including titles, footnotes,

appendices etc.) plus two pages of graphs and / or tables. Your report should be in

three sections, as follows:

I Describe briefly what aspects of the problem context you considered at the out-

set, how you used these to start your exploratory analysis, and what were the

important points to emerge from this exploratory analysis.

II Describe briefly (without too many technical details) what models you considered

in step (3) and why you chose the model that you did.

III State your final model clearly, summarise what your model tells you about the

characteristics associated with the proportion of ‘Leave’ votes, and discuss any

potential limitations of the model.

Your report should not include any computer code. It should include some graphs and

/ or tables, but only those that support your main points.

You are free to choose your own file name for the electronic submission of your report.

• An R script or SAS program corresponding to your analysis and predictions. Your

script /program should run without user intervention on any computer with R or SAS

installed, providing the file ReferendumResults.csv is present in the current working

directory / current folder. When run, it should produce any results that are mentioned

in your report, together with the predictions and the associated standard deviations.

The script / program should be named ########_ICA2.r or ########_ICA2.sas as

appropriate, where ######## is your student ID number. For example, if your ID num-

ber is 150123456 and you are using R, your script should be named 150123456_ICA2.r.

• A text file containing your predictions for the 267 wards with missing voting data.

This file should be named ########_pred.dat, where ######## is your student ID

number. The file should contain three columns, separated by spaces and with no

header. The first column should be the ward identifier (corresponding to variable ID in

2

file ReferendumResults.csv; the second should be the predicted proportion of ‘Leave’

votes for that ward, and the third should be the standard deviation of your prediction

error.

Marking criteria

There are 75 marks for this exercise. These are broken down as follows:

Report: 40 marks. The marks here are for: displaying awareness of the context for the

problem and using this to inform the statistical analysis; good judgement in the choice

of exploratory analysis and in the model-building process; a clear and well-justified

argument; clear conclusions that are supported by the analysis; and appropriate choice

and presentation of graphs and / or tables. The mark breakdown is as follows:

Awareness of context: 5 marks.

Exploratory analysis: 10 marks. These marks are for (a) tackling the problem in

a sensible way that is justified by the context (b) carrying out analyses that are

designed to inform the subsequent modelling.

Model-building: 10 marks. The marks are for (a) starting in a sensible place that

is justified from the exploratory analysis (b) appropriate use of model output and

diagnostics to identify potential areas for improvement (c) awareness of different

modelling options and their advantages and disadvantages (d) consideration of

the social, economic and demographic context during the model-building process.

Quality of argument: 5 marks. The marks are for assembling a coherent ‘narra-

tive’, for example by drawing together the results of the exploratory analysis so

as to provide a clear starting point for model development, presenting the model-

building exercise in a structured and systematic way and, at each stage, linking

the development to what has gone before.

Clarity and validity of conclusions: 5 marks. These marks are for stating clearly

what you have learned about the social, economic and demographic characteris-

tics that are related to the voting outcome in a ward, and for ensuring that this

is supported by your analysis and modelling.

Graphs and / or tables: 5 marks. Graphs and / or tables need to be relevant,

clear and well presented (for example, with appropriate choices of symbols, line

types, captions, axis labels and so forth). There is a one-slide guide to ‘Using

graphics effectively’ in the slides / handouts for Lecture 1 of this year’s course.

Note that you will only receive credit for any graphs in your report if your

submitted script / program generates these graphs, appropriately labelled, when

it is run.

Note that you will be penalised if your report exceeds EITHER the specified 2 000-word

limit or the number of pages of graphs and / or tables. Following the UCL guidelines

at https: // www. ucl. ac. uk/ srs/ academic-manual/ c4/ failure/ word-count ,

3

https://www.ucl.ac.uk/srs/academic-manual/c4/failure/word-count

the maximum penalty is 7 marks, and no penalty will be imposed that takes the fi-

nal mark below 30/75 if it was originally higher. Subject to these conditions, penalties

are as follows:

• More than two pages of graphs and / or tables: zero marks for graphs and / or

tables, in the marking scheme given above.

• Exceeding the word count by 10% or less: mark reduced by 4.

• Exceeding the word count by more than 10%: mark reduced by 7.

In the event of disagreement between reported word counts on different software sys-

tems, the count used will be that from the examiner’s system.

Coding: 15 marks. There are 3 marks here for reading the data and handling missing

values correctly; 7 marks for effective use of your chosen software (e.g. programming

efficiently and correctly); and 5 marks for clarity of your code — commenting, layout,

choice of variable / object names and so forth.

Prediction quality: 20 marks. The remaining 20 marks are for the quality of your pre-

dictions. Note, however, that you will only receive credit for your predictions if they

correspond to results produced by your submitted script / program when it is run: if

this is not the case, your predictions will earn zero marks.

For these marks, you are competing against each other. Your predictions will be as-

sessed using the following score:

S =

267∑

i=1

[

log σi +

(Yi − p̂i)

2

2σ2i

,

]

where:

Yi is the actual proportion of ‘Leave’ votes for the ith prediction;

p̂i is your predicted proportion of ‘Leave’ votes for the ith prediction;

σi is your quoted prediction error standard deviation for the ith prediction.

The score S is an approximate version of a proper scoring rule, which is designed to

reward predictions that are close to the actual observation and are also accompanied by

an accurate assessment of uncertainty (this was discussed during the Week 10 lecture,

along with the rationale for using this score for the assessment). Low values are better.

The scores of all of the students in the class (and the lecturer) will be compared:

students with the lowest scores will receive all 20 marks, whereas those with the highest

scores will receive fewer marks. The precise allocation of marks will depend on the

distribution of scores in the class.

If you don’t supply standard deviations for your prediction errors, the value of σi will be

taken as 1/2 for all of your predictions: this is the largest possible standard deviation

for any random variable taking values between 0 and 1, and the value of S will be

correspondingly large so that you will receive few if any marks for your predictions.

4

STAT7001 Assessment 2 — Hints

1. There is not a single ‘right’ answer to this assignment. There is a huge range of options

available to you, and many of them will be sensible.

2. You are being assessed not only on your computing skills, but also on your ability to

carry out an informed statistical analysis: material from other statistics courses (in

particular STAT2002, for students who have taken it) will be relevant here. To earn

high marks, you need to take a structured and critical approach to the analysis and to

demonstrate appropriate judgement in your choice of material to present.

3. At first sight, the task will appear challenging to many of you. However, there is a lot

that we already know: Martin Rosenbaum’s article is an obvious starting point. You

may also want to search for other commentaries on the UK referendum result, to gain

some understanding of what kinds of relationships you might look for in the data.

4. When building your model, you have two main decisions to make. The first is: should

it be a linear, generalized linear or generalized additive model? The second is: which

covariates should you include? You might consider the following points:

Linear, generalized linear or generalized additive? This is best broken down into

two further questions, as follows:

• Conditional on the covariates, can the response variable be assumed to fol-

low a normal distribution with constant variance? In this assignment, the

response variable is a proportion and therefore cannot have exactly a normal

distribution. However, there are thousands of votes in each ward: the Central

Limit Theorem may apply, therefore, so that the response distribution has

approximately a normal distribution — in which case you may judge that

the approximation is adequate for your purposes. However, the ‘constant

variance’ assumption may be more problematic: given that the response is

a proportion, you might think that a binomial distribution would be appro-

priate, but the variance of a binomial proportion is p(1 − p)/n in an obvious

notation. Since this depends on p, and p varies between wards, the variance

cannot be constant. Whether this is a problem depends on how much the

‘Leave’ probability p varies: if it doesn’t vary much, then you may wish to

claim that the variance is approximately constant. If it varies a lot however,

then you could probably improve your predictions (and hence your score!) by

accounting for it. You might consider using your exploratory analysis to gain

some preliminary insights into this point.

• Are the covariate effects best represented parametrically or nonparametrically?

Again, your exploratory analysis can be used to gain some preliminary in-

sights into this. You may want to look at the material from week 6, for

examples of situations where a nonparametric approach is needed.

Which covariates? The data file contains around 45 potential covariates, and many

of them may be highly correlated. It will not be possible for you to carry out a

5

modelling exercise involving all of them, so you need to reduce this number before

you start. You have several options here. Some of them are as follows:

• Look at other published commentaries on the referendum result. What mea-

sures are considered useful? Can these be linked to covariates for which you

have information? Obviously, if you do this then you will need to acknowledge

your sources in your report.

• Define useful summary measures on contextual grounds, and work with these.

For example, 16 of the potential covariates in the data file are percentages of

the population in different age categories (0 to 4, 5 to 7, . . ., all the way up

to ‘90 plus’). You may decide just to work with ‘young voters’ (18 to 29 —

18 is the minimum voting age in the UK), ‘working age’ (30 to 64 say) and

‘retirement age’ (65 and above). Or, indeed, to adopt your own categories

— the results are unlikely to be sensitive to the precise definitions. Similar

comments apply to the potential covariates representing ethnicity, household

deprivation and so on.

• Define new variables based on the correlations between the existing variables,

and work with these. If several variables are highly correlated, then it is

difficult to disentangle their effects and it may be preferable to work with a

single ‘index’ that combines all of them. This is the basis of techniques such

as Principal Components Analysis, that were discussed during the Week 10

lecture (along with how to implement them in R and SAS).

You should not start to build any models until you have reduced the number

of candidate covariates to a manageable number. Your decisions here should be

guided by your exploratory analysis, as well as your understanding of the context.

5. Don’t forget to look for interactions! For example, one of the variables in the data set

is RegionName, which is a factor (i.e. categorical covariate) indicating the UK region

in which each ward is located. Possibly there is regional variation in the strength of

dependence between other characteristics and the proportion of ‘Leave’ votes. Look at

the analysis of the iris data from Workshop 2, for a similar kind of situation.

6. If you use a linear model, it is straightforward to obtain the standard deviations of

your prediction errors using either R or SAS (look at the material in Workshops 2 and 9

respectively, to find out how to do it). However, for generalized linear and generalized

additive models you need some additional computations. Specifically:

(a) Suppose p̂i is your predicted probability of voting ‘Leave’ for the ith ward, and

that Yi is the actual proportion of ‘Leave’ voters as above.

(b) Then your prediction error will be Yi − p̂i.

(c) Yi and p̂i are independent, because p̂i is computed using only information from

the first 803 wards and Yi relates to one of the ‘new’ wards.

(d) The variance of your prediction error is thus equal to Var (Yi) + Var (p̂i).

6

(e) You can calculate the standard error of p̂i in both R and SAS, when making

predictions for new observations — see Workshops 6 and 9. Squaring this standard

error gives you Var (p̂i).

(f) You can estimate Var (Yi) by plugging in the appropriate formula for your chosen

distribution — for example, if you’re using a binomial distribution then V̂ar (Yi) =

p̂i(1 − p̂i)/ni, where ni is the number of votes for the ith ward.

(g) Hence you can estimate the standard deviation of your prediction error as σ̂i =√

V̂ar (Yi) + Var (p̂i).

Appendix: the ReferendumResults.csv data set

Data sources

The data provided in ReferendumResults.csv are from several different sources, as follows:

Martin Rosenbaum’s BBC article (data source MR1)

The article at http://www.bbc.co.uk/news/uk-politics-38762034 provides an Excel spread-

sheet, containing localised voting data supplied to the BBC by councils which counted the

EU referendum. Results are provided for all individual wards where data were available at

this level of detail: there are 1 283 such wards, of a total of 9 291 wards in the UK. Reasons

for the figures not being available at the remaining wards are:

• Three councils did not respond to the BBC’s request.

• Some councils refused to give the information to the BBC.

• For some councils, ballot boxes were mixed before counting so it was not possible to

identify the precise numbers of votes in each ward.

Important caveat: in many wards, some postal votes were mixed in prior to counting.

The BBC spreadsheet states “Figures which include postal votes cannot be treated as exact.

However broad patterns can still be identified in the data.”

Martin Rosenbaum’s data set from the 2011 UK census (data source MR2)

The variables in this data set are those that form the basis for the analysis reported in Martin

Rosenbaum’s article. He provided the following information when supplying the data to us:

7

http://www.bbc.co.uk/news/uk-politics-38762034

‘All the 2011 census data was downloaded via selecting datasets at https: //

www. nomisweb. co. uk/ query/ select/ getdatasetbytheme. asp? opt= 3& theme=

&subgrp= . (I calculated adult mean age from the raw counts of adults of each

age in each ward). Please note that some areas have seen boundary changes to

wards since 2011, so some wards with referendum voting data do not figure in

this list.’

This spreadsheet contains data for 8 570 wards.

The UK Register of Geographic Codes (data source RGC)

Geographical information for each ward was obtained from the UK ‘Register of Geographic

Codes’, downloaded on 14th March 2017 and located by searching at http://geoportal.

statistics.gov.uk/.

UK age structure by ward, from the 2011 UK census (data source ASW)

Martin Rosenbaum’s census data contains information on the mean adult age in each ward,

but it is possible that more detailed information on age profiles would be useful. Percent-

agess of population in different age bands were obtained for the 2011 census, from https://

www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?collapse=yes under Cen-

sus 2011, Key Statistics and then Age Structure. This provides information on the same 8 570

wards that are present in data source MR2.

Data processing

The data sources have been combined in the following way to create ReferendumResults.csv:

1. The spreadsheets from sources MR1, MR2 and ASW were merged using the nine-digit

ward identification code (identified as WardCode in MR1). The Remain variable, giving

the number of ‘Remain’ votes in each ward, was replaced by an NVotes column giving

the total number of ‘Leave’ and ‘Remain’ votes. There are 1 070 wards remaining after

this merge: this decrease from the original 1 283 wards in MR1 is due to the exclusion

of wards for which the boundaries changed between the 2011 census and the 2016

referendum.

2. Source RGC was used to identify the administrative area type and region name for

each ward, again based on its nine-digit identification code. Some ward codes were

found to be duplicated in source RGC, but in all cases the administrative area type

and region name were identical for the duplicates.

3. The rows of the data table were randomly shuffled, so that the order of wards no

longer corresponds to that in any of the data sources. This was done in order to

prevent ‘cheating’ when making predictions.

8

https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?opt=3&theme=&subgrp=

https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?opt=3&theme=&subgrp=

https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?opt=3&theme=&subgrp=

http://geoportal.statistics.gov.uk/

http://geoportal.statistics.gov.uk/

https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?collapse=yes

https://www.nomisweb.co.uk/query/select/getdatasetbytheme.asp?collapse=yes

4. A subset of 267 wards was identified for the ‘prediction’ part of the assessment. This

was done in such a way that the distributions of all of the covariates in this subset is

very similar to the distribution in the remaining 803 wards. Specifically:

(a) In each region, 25% of the wards were sampled at random as candidates for making

predictions. This sample will be referred to as ‘Group 2’ below, with ‘Group 1’

comprising the remaining wards.

(b) For each of the numeric covariates in the data set, a Kolmogorov-Smirnov test was

performed to test the null hypothesis that the underlying distributions in Groups

1 and 2 are the same.

(c) The prediction sample was accepted only if the p-values for all of the Kolmogorov-

Smirnov tests were greater than 0.5 (this is not a typo). Otherwise, a new candi-

date sample was drawn in step (a) and the procedure was repeated.

The Kolmogorov-Smirnov test is used here as a convenient way to measure whether two

distributions are similar: the use of a high p-value threshold is chosen to ensure that

the resulting Groups 1 and 2 are very well balanced with respect to all of the covariate

values. Note, however, that the voting numbers were not included in this balancing

exercise: this is because the performance of predictions would be artificially enhanced

if the voting numbers were included (for example, we would know that the distribution

of Group 2 voting proportions is similar to that of Group 1 proportions). Note also

that no attempt has been made to balance the groups in terms of combinations of the

covariates.

5. The ‘Group 2’ rows were placed at the end of the data table, with their ‘Leave’ vote

numbers set to −1; a new ID variable was created so that each ward has an ID number

between 1 and 1070; the original nine-digit ward identification code was removed so that

the wards cannot be identified easily using online information; a few other redundant

covariates were removed (e.g. where one variable was the sum of two or more others);

some variables were renamed for ease of interpretation; and the columns were re-ordered

for convenience.

6. Some (but not all) of the covariates were multiplied by random numbers close to 1. This

makes no difference to any models that you fit, because the corresponding regression

coefficients will scale correspondingly; but it makes it even more difficult for you to

identify the wards using online information.

Description of variables

This section gives a brief description of each of the variables in ReferendumResults.csv,

and an indication of which data source it came from. Descriptions are compiled on the basis

of correspondence with Martin Rosenbaum and online documentation, including the UK

Census User Guide (notably Part 4 of the ‘Variables and Classifications’ section, together

with the Glossary) at https://www.ons.gov.uk/census/2011census/2011censusdata/

2011censususerguide/.

9

https://www.ons.gov.uk/census/2011census/2011censusdata/2011censususerguide/

https://www.ons.gov.uk/census/2011census/2011censusdata/2011censususerguide/

Variable name Source Description

ID — Ward ID number, from 1 to 1 070

AreaType RGC

Type of administrative area in which the ward is situ-

ated. This takes one of four values:

E06: Unitary authorities

E07: Non-metropolitan districts

E08: Metropolitan districts

E09: London boroughs

This is included because E08 and E09 indicate a large

urban area.

RegionName RGC

Name of the region within which the ward is situ-

ated. The possible values are North East, North

West, Yorkshire and The Humber, East Midlands,

West Midlands, East of England, London, South

East and South West.

NVotes MR1 Total number of votes (‘Leave’ plus ‘Remain’)

Leave MR1 Number of ‘Leave’ votes

Postals MR1

Indicates whether postal votes were mixed in with the

data prior to counting. P indicates yes, NP indicates no.

Residents MR2 Number of permanent residents

Households MR2 Number of households

MeanAge ASW Mean age of permanent residents

AdultMeanAge MR2 Mean age of adult permanent residents

Age_0to4 ASW % of permanent residents aged 0–4

Age_5to7 ASW % of permanent residents aged 5–7

Age_8to9 ASW % of permanent residents aged 8–9

Age_10to14 ASW % of permanent residents aged 10–14

Age_15 ASW % of permanent residents aged 15

Age_16to17 ASW % of permanent residents aged 16–17

Age_18to19 ASW % of permanent residents aged 18–19

Age_20to24 ASW % of permanent residents aged 20–24

Age_25to29 ASW % of permanent residents aged 25–29

Age_30to44 ASW % of permanent residents aged 30–44

Continued on next page . . .

10

. . . continued from previous page

Variable name Source Description

Age_45to59 ASW % of permanent residents aged 45–59

Age_60to64 ASW % of permanent residents aged 60–64

Age_65to74 ASW % of permanent residents aged 65–74

Age_75to84 ASW % of permanent residents aged 75–84

Age_85to89 ASW % of permanent residents aged 85–89

Age_90plus ASW % of permanent residents aged 90 and above

White MR2 % of permanent residents self-identifying as white

Black MR2 % of permanent residents self-identifying as black

Asian MR2 % of permanent r