# 程序代写代做代考 Hive finance STAT2008/4038/6038

STAT2008/4038/6038

Assignment1_2017.docx Page 1 of 4

RESEARCH SCHOOL OF FINANCE,

ACTUARIAL STUDIES AND STATISTICS

College of Business & Economics, The Australian National University

REGRESSION MODELLING

(STAT2008/STAT4038/STAT6038)

Assignment 1 for 2017

Instructions

• This assignment is worth either 15% or 20% of your overall marks for this course (for all

students, enrolled in STAT2008, STAT4038 or STAT6038). It will be worth only 15%

rather than 20%, if you attempt the optional mid-semester Wattle quiz in week 6, which is

worth 5%, and your mark on the quiz is better than your mark on this assignment.

• If you wish, you may work together with another student in doing the analyses and present

a single (joint) report. If you choose to do this then both of you will be awarded the same

total mark. Students enrolled under different course codes may work together. You may

NOT work in groups of more than two students and the usual ANU examination rules on

plagiarism still apply with respect to people not in your group.

• Research School of Finance, Actuarial Studies and Statistics (RSFAS) assignment cover

sheets are available on Wattle. Please complete and attach a copy of the cover sheet to the

front of your report. Remember to keep a copy of your assignment for your own records.

• Assignments should be written, typed or printed on sheets of A4 paper stapled together at

the top left-hand corner (do NOT submit the assignment in plastic covers or envelopes).

Your assignment may include some carefully edited computer output (e.g. graphs) showing

the results of your data analysis and a discussion of those results. Please be selective about

what you present – only include as many pages and as much computer output as necessary

to justify your solution and be concise in your discussion of the results. Clearly label each

part of your report with the question number and the part of the question that it refers to.

• Unless otherwise advised, use a significance level of 5%.

• Marks may be deducted if these instructions are not strictly adhered to, and marks will

certainly be deducted if the total report is of an unreasonable length, i.e. more than 12 pages

including graphs. You may include as an appendix, any R commands you used to produce

your computer output. This appendix and the cover sheet are in addition to the above page

limits; but the appendix will generally not be marked, only checked if there is some question

about what you have actually done.

• Assignments will be marked by your tutor (or one of your two tutors, for joint assignments).

One copy of your assignment should be submitted to the box labelled with the name of your

tutor, located next to the RSFAS office, by 3 pm on Friday 31 March 2017. You may ask

any of the tutors or me (Ian McDermid) questions about this assignment, in person, up to

the deadline (3 pm on Friday 31 March 2017), after which we will NOT answer any further

questions about this assignment, until after the marked assignments have been returned to

students. Answers to questions in writing sent to me via e-mail or posted on Wattle, will be

posted on Wattle, but must be received no later than 12 noon on Thursday 30 March 2017.

• Late assignments will NOT be accepted after the deadline without an extension. Extensions

will usually be granted on medical or compassionate grounds on production of appropriate

evidence, but must have my permission by no later than 12 noon on Thursday 30 March

2017. Even with an extension, all assignments must be submitted reasonably close to the

original deadline to allow time for the marking to be completed prior to week 7 (which starts

on Tuesday 18 April 2017), when the assignment solutions will be released and discussed.

Assignment1_2017.docx Page 2 of 4

Data

The data for the first question (available in the file moorhen.csv on Wattle) is presumably from

an old consulting project (not one of mine, which might explain the poor documentation). The

original consultant hopefully got the permission of the owners to use it in teaching, as it has

been used for this purpose before. I am using it here as an example of a situation I have often

found myself in as a consultant, which is having to work with poorly described data and

where I have to speculate on aspects of the interpretation and the research question. Working

as a consultant is considerably easier when you are able to directly collaborate with the clients

on these issues.

Many of the projects I have worked on as a statistician have involved data that was considered

private (such as health data) or data to which access was restricted (for example, data

designated “commercial-in-confidence”). For these reasons, it is not always easy to source

realistic data for use in teaching statistics and so groups of statisticians maintain repositories

of examples of real data that are in the “public domain”. In many countries, there are Internet

repositories of data available for use in the teaching of introductory statistics.

The data for the second question comes from such a repository: the data archive associated

with the Journal of Statistics Education (JSE), a publication of the American Statistical

Association (www.amstat.org/publications/jse/jse_data_archive.htm).

Datasets in the JSE data archive are typically accompanied by a file which give a description

of the variables included in the data (the “meta-data”) and are also often accompanied by an

associated article in the journal (and occasionally even by references to other sources). The fat

data, which we will be using in question 2 of this year’s assignments, includes both of the

above accompanying documents.

You can download a text file containing the fat data and the associated documents from the

JSE website (www.amstat.org/publications/jse/jse_data_archive.htm) or the data is also

available on Wattle in the file fat.csv, which includes a header row with the variable names. I

have also downloaded a copy of the meta-data text file (fat.txt), and made this file available on

Wattle.

Alternatively, the fat data are also stored in the UsingR package from the recommended text

by John Verzani (Using R for Introductory Statistics, 2nd Edn, Chapman & Hall/CRC, 2014).

The UsingR package is available from CRAN (the Comprehensive R Archive Network, the

original Australian mirror site for which is located here in Canberra at the CSIRO).

You can install the UsingR package by typing the following commands in R:

install.packages(“UsingR”) # installs the UsingR package

library(UsingR) # attaches both the UsingR and the HistData libraries to your search path

search()

ls(pos=”package:UsingR”) # lists the contents of the UsingR package

ls(pos=”package:HistData”) # lists the contents of the HistData package

help(fat) # there are brief help files on all of the datasets in the above libraries

fat # typing the name shows the contents of the dataset

attributes(fat) # check that the variable names match the description

summary(fat) # always a sensible bit of exploratory data analysis

attach(fat) # attaches the data sets to your search path, so you can reference the variables

Further details are available in the sections titled “External packages” and “Data sets” on

pages 15-18, towards the end of “Chapter 1. Getting started” in the Verzani text.

http://www.amstat.org/publications/jse/jse_data_archive.htm

http://www.amstat.org/publications/jse/jse_data_archive.htm

Assignment1_2017.docx Page 3 of 4

Question 1 (20 marks)

Moorhens are those blue-purple-red water birds often seen down near Lake Burley Griffin in

Commonwealth Park. They are characterised by large, fleshy red shields that protrude from

their heads. Some scientists have collected various measurements on a group of 43 moorhens

in Commonwealth Park in the file moorhen.csv, which is available on Wattle. The scientists

have sent the data to you for analysis.

The e-mail accompanying the data is a little light on the details, but there is a suggestion that

moorhens form a fairly hierarchical society and that shield size is a relevant indicator of a

bird’s status within their group, so the variable of most interest (the response variable) is the

area of each bird’s Shield (units not specified, but presumably in mm2). An alternative

explanation might be that a bird’s status is more strongly related to their overall size (which

could be measured by the bird’s Weight, presumably in mg) and that bigger birds simply have

larger shields.

In this assignment, we will concentrate on the relationship between Weight and Shield (we will

investigate the other available variables in Assignment 2). Read the data into R and conduct

the following analyses:

(a) Plot Shield against Weight (which means that Shield should be the response variable on

the Y or vertical axis and Weight should be the explanatory variable on the X or

horizontal axis). Use the identify() function in R to identify any unusual data points on

the plot. Discuss why you chose these observation(s) as being unusual. (2 marks)

(b) Is there a significant correlation between Weight and Shield? Use the cor.test() function to

conduct a suitable hypothesis test. Clearly specify the hypotheses you are testing and

present and interpret the results. (2 marks)

(c) Experiment with applying natural log transformations (to the base e, which is the default

for the log() function in R) and square root transformations to one or both of Weight and

Shield, and repeat the analysis in parts (a) and (b). Do NOT show all of your results, just

pick whichever one you think is the best choice of scale for the two variables and show

and discuss the results for your chosen combination. (4 marks)

(d) Fit a simple linear regression (SLR) model with your chosen transformation of Shield as

the response variable and your chosen transformation of Weight as the predictor.

Construct a plot of the residuals against the fitted values, a normal Q-Q plot of the

residuals, a bar plot of the leverages for each observation and a bar plot of Cook’s

distances for each observation. Use these plots to comment on the model assumptions

and on any unusual data points. (3 marks)

(e) Produce the ANOVA (Analysis of Variance) table for the SLR model in part (d) and

interpret the results of the F test. What is the coefficient of determination for this model

and how should you interpret this summary measure? (3 marks)

(f) What are the estimated coefficients of the SLR model in part (d) and the standard errors

associated with these coefficients? Interpret the values of these estimated coefficients

and perform t-tests to test whether or not these coefficients differ significantly from

zero. What do you conclude as a result of these t-tests? (3 marks)

(g) Repeat part (a) and again plot Shield against Weight, but this time extend both X and Y

axes to include the origin. Now include the transformed SLR model from part (d) as a

curve on your plot and also include the untransformed SLR of Shield against Weight as a

line on the plot. Use different line types for the two curves and also include an

appropriate legend on the plot. What are you overall conclusions about the relationship

between Weight and Shield, and the broader research questions discussed in the second

paragraph of this question? (3 marks)

Assignment1_2017.docx Page 4 of 4

Question 2 (20 marks)

The dataset fat contains estimates of the percentage of adipose tissue (body.fat) and other

related measurements taken on a sample of 252 adult men. The measurements include a

derived variable, BMI or body mass index, which is frequently used as a measure of obesity

and is based on simple weight and height measurements.

For this assignment, we are interested in whether or not BMI, which is relatively easy to

measure, can be used to predict the percentage of body.fat, which has to be estimated using an

underwater weighing technique?

(a) Plot body.fat against BMI. Describe the correlation shown in the plot. Would you expect

a simple linear regression model to be a reasonable model for the relationship shown in

the plot? (4 marks)

(b) Fit a simple linear regression (SLR) model with body.fat as the response variable and

BMI as the predictor. Construct a plot of the residuals against the fitted values, a Q-Q

plot of the residuals and a bar plot of Cook’s Distances for each observation. Comment

on the model assumptions and on any unusual data points. (4 marks)

(c) A natural log (to the base e) transformation (to one or both of the response and predictor

variables) is often used to adjust the scale of the variables prior to fitting an SLR model.

Now fit another SLR model with body.fat as the response variable and log(BMI) as the

predictor. What would be the problem with also applying a log transformation to the

response variable? Check the same plots you produced for the earlier model in part (b).

Are the same problems still apparent? (4 marks)

(d) Produce the ANOVA table and the table of the estimated coefficients for the

transformed SLR model in part (c). Interpret the values of the estimated coefficients for

this SLR model and the results of the overall F test and the t-tests on the estimated

coefficients. (4 marks)

(e) Body mass index values less than 18.5 are typically categorised as “underweight”; from

18.5 to 25 as “normal”, 25 to 30 as “overweight” and over 30 as “obese”. Use the

transformed SLR model from part (c) to predict the body.fat percentage for groups of

males with typical BMI values 17.25 (“moderately underweight”), 21.75 (“normal”),

27.5 (“overweight”) and 32.5 (“moderately obese”), respectively. Find 95% confidence

intervals for these predictions. Do you think this SLR model is a good model for making

these predictions? (4 marks)

_____________

Assignment 1 for 2017

Instructions