# CS代考 Statistical Methods – cscodehelp代写

Statistical Methods

Computer Lab 1: Data summaries, plots and functions in R

1. Thedatainthistaskrefertotheamountofclaims(in£1000)arisinginageneralinsuranceportfolio.Youcan find the data in the file claimamount^data.txt on Vision under Learning Materials — Labs.

Read data into R

Save the file on your computer and then read the data into R:

claim.amounts = scan(file=HH:/claim_amounts_data.txtHJskip=2)

Remember that you need to tell R where to look for the file, so you must specify the full path of the file (here H:/claim_amounts_data.txt). Note the following:

• If you have saved the file in a different folder on your computer you need to make sure that you know where it is (e.g. on Windows based systems, if the full path is not displayed on the toolbar, use right-click — properties on the file).

• Also, in some cases the use of forward or backward slash for specifying the path is system-specific. If the file cannot be found, try file=”H: claim_amounts_data.txt” in the command above.

If you want to keep all your files in the same folder (e.g. for a particular project), it will be convenient to run R in the same folder so that you don’t need to repeatedly specify the folder. You can change the working folder in R at File — Change dir.

Notice also the skip=2 option in the scan() command above. Look at the data file and check ?scan to see what this option does.

If you are used to working with spreadsheets, R has a convenient data editor: try data.entry(claim.amounts) or de(claim.amounts).

Numerical summaries

For summary statistics, check how the commands summary。，mean(), median。， quantile(), var(), sd() etc. work. For example, the 3rd quartile of the claim amounts data can be obtained using

quantile(claim.amountsJprob=0.75)

Now, put the data in ascending order (use sort()) and try to determine which of the two definitions of the quartiles that we have given in the lectures R uses.

Let’s now compute the coefficient of symmetry 角=”払―/. R doesn’t have a function to do this automatically, but it is straightforward to write our own code. First, let’s look at how arrays work in R. The data claim.amounts are stored as a onedimensional array (vector) in R, with the individual values (elements) being indexed from 1 to 200. For example, you can access (and see) the value of the 5th amount

1

using claim.amounts[5], or the last 10 values using claim.amounts[191:200]. Note that the indexing corresponds (by default) to the order by which we entered the data. Now, for computing 京 we can exploit R’s array operations. Try:

sum.dev3 = sum( (claim.amounts – mean(claim.amounts))”3 )

betal = (sum.dev3/length(claim.amounts))/(sd(claim.amounts))”3 cat(“This gives:”, “

n, “beta_1 beta1, “

”)

Note that R is very good for working with arrays of data – see how we computed the value of 】i(xi — x)3 in the first line of the code above. The second line then computes (xi — x)3/n}/s3. Finally, the third line prints the answer on the screen; note that “

” starts a new line for printing.

Let’s now compute the sum of the cubed deviations for 歯 with the use of a for loop:

new.sum.dev3 = 0 # initialise variable to hold the sum dev3 = rep(0,length(claim.amounts)) # initialise array of cubed deviations for (i in 1:length(claim.amounts)){ # start loop

dev3[i] = (claim.amounts[i] – mean(claim.amounts)广 3 new.sum.dev3 = new.sum.dev3 + dev3[i]

} print(new.sum.dev3)

Again, make sure you understand how this works!

# compute sum # iteratively

Check if the answer is the same as with sum.dev3 above. Using loops will most times be slower than using array operations, but it often provides clearer understanding – and in some cases it might be the easier option for coding. Notice in the code above, that it is important to make sure that the variable new.sum.dev3 is equal to zero before the loop starts (line 1 of the code), and that dev3 has to be ‘defined’ as an array outside the loop (line 2 of code).

Graphical summaries

For producing various summary plots, see the use of commands hist(), stripchart(), boxplot() etc. on the tutorial sheets. Here, as an example, we consider how to show two plots on the same graph or scale (e.g. when we want to compare two data sets). Let’s assume that the claim amounts data can be divided to two different groups (e.g. for gender of policyholder) and we want separate boxplots per group on the same graph. We will first ‘assign’ each claim to group ‘1′ or ‘2′ (here completely at random by alternating group membership):

group.ind = rep(c(1:2),100); group.ind

group.1 = claim.amounts[group.ind == 1] group.2 = claim.amounts[group.ind == 2]

# create index by

# repeating(1,2)100times

# choose amounts with index = 1 # same for index = 2

2

We are now ready to draw the boxplots:

boxplot(group.1Jgroup.2Jhorizontal=TJcol=HorangenJnames=c(nGroup 1″,HGroup 2″)) title(main=HBoxplots of claim amounts per group”,xlab=”Amount (1000s pounds)”)

Note that an alternative way to do this would be:

boxplot(claim.amounts ~ group.ind,horizontal=TJcol=”orange”J names=c(“Group 1″,”Group 2″))

but the first command is more useful when we have two separate data sets.

We can also draw histograms for the two groups side-by-side. For comparison purposes we need to ensure that the two plots are drawn on the same scale (for the x and y axes). Try the following: par(mfrow=c(1,2)) # Specifies layout for plotting window amount.breaks = seq(0.9,2.3Jby=0.1)

hist(group.1, breaks=amount.breaks, main=”Claim amounts: Group breaks=amount.breaks, right=FALSE, amounts: Group 2″, xlab=”Amount

right=FALSE, col=”gray”, 1″, xlab=”Amount

col=”gray”, ylim=c(0,25),

ylim=c(0,25), xlim=c(0.9,2.3), hist(group.2, (1000)”)

main=”Claim xlim=c(0.9,2.3), (1000)”)

Notice how we have determined the cells for the histograms (both range and whether they are right-closed), and controlled the range of the axes.

It is important to know how you can export your graphs for use in other programs (e.g. when you are preparing reports). It is straightforward to copy and paste the graph on an editor – e.g. right-click on the R graphics window and choose Copy as. However, it is often useful to save the graph for later use. To save your graph as a PDF file, make the graphics window active (by clicking on it) and then choose File — Save as. Try to save your boxplot graph and then open it on your computer (make sure you remember where you saved the file!). Later you may also want to check the postscript() command which gives you more control over the saving options – e.g. size of the graph.

2. R is a programming environment and you can use it to write functions that can perform complex mathematical and statistical operations. Functions are a very useful and portable tool, especially for computations that you use repeatedly or when you want to write a longer piece of R code that can be broken down to smaller ‘tasks’. Here we will consider a simple task – writing a function for computing the symmetry coefficient Si which was also computed before without a user defined function.

The code for defining the function is given below. As with the for(…) command in task 1 (which is a system-defined function itself), note that the body of the function (included in curly brackets) must be entered in R as a block of commands.

betal.fn <- function(x){ # name and input of function sum.dev3 = sum( (x - mean(x)广 3 ) # see Lab 1 ...
bl = (sum.dev3/length(x))/(sd(x))"3 cat("
”,"Coefficient of symmetry for these data is: ",
"beta_1 =", bl, "
") }
Note that the input to this function is the array x, which is the data to be used. Also recall the use of the cat(...) 3
command from before (use ?cat for help).
Load the claim.amounts data on your R session again (use the scan(...) command as before if required), and call the betal.fn function using the claim amounts as its argument:
betal.fn(claim.amounts)
Check that you get the same answer as in Task 1.
Now change your data, for example by using
claim.amounts.new =0.6+ claim.amounts
or claim.amounts.new = 1.3 * claim.amounts
and call the betal.fn function again to compute the new value of 角. Comment on how the symmetry coefficient has changed with the new data.
4