# 程序代写代做代考 DNA flex Excel algorithm BIOL5274M – Tutorial problem sheet

BIOL5274M – Tutorial problem sheet

These problems represent 25% of the final mark for the module. For full marks, please indicate how you have worked out each answer. All the problems in section 1 can be done using tools in standard statistical packages such as R or Excel. Section 2 requires R and Bioconductor. Some of the questions have been covered in lecture and practical materials, but not all, so you will have to do some private study and reading in order to answer all the questions.

Section 1: Basic univariate statistics

1. A variable x is normally distributed with mean 10.4 and standard deviation 2.6. What is the probability of measuring a value of x more than three standard deviations from the mean? What is the 90% confidence interval for x? (2 marks).

2. During investigation of a type of cancer, an experimental worker measures the expression level of a single gene in 5 cancer tissue samples and for comparison in 5 samples of normal tissue. The following are the results where expression levels are expressed in arbitrary units, and it is assumed that the 10 measurements come from 10 different patients.

Normal 20.2 20.5 20.6 19.4 19.7

Cancer 20.1 20.3 20.3 19.2 19.5

Use an appropriate t-test to see if there is a significant difference in the mean expression of this gene between cancer and normal tissue. (3 marks).

3. Suppose now that you are given an extra piece of information. You are told that the measurements in question 4 above are structured so that measurements that lie directly above and below each other in the table are actually made and from normal and cancer tissues taken from the same patient, i.e. that the study involves only 5 patients with two samples taken from each. Does this change the t test you would apply, and does the p value change? Explain. (5 marks).

4. Another experimentalist repeats the experiments of question 5 on 20 further genes and obtains the following p values for the null hypothesis that expression level of the gene is no different between normal and cancer tissues

0.7,0.5,0.05,0.6,0.5,0.8,0.2,0.3,0.9,0.1,0.5,0.8,0.2,0.3,0.9,0.1,0.5,0.2,0.3,0.9

He uses these to assert that a single gene (the third in the list above) shows a significant difference in expression at the 5% level between the two tissue types. What is the flaw in this argument? (Hint: Bonferroni). Explain the Bonferroni correction for multiple testing. The Benjamini-Hochberg correction is sometimes used as an alternative to the Bonferroni: explain what this is and why it might be preferred. Discuss the significance of these considerations for genome scale (high-throughput) investigations (8 marks).

5. t tests are parametric. What does this mean, and what are the non-parametric alternatives? What is the advantage of a t test over a non-parametric alternative for appropriate data? (4 marks).

6. An agrochemicals company is investigating four closely related compounds which are equally effective as herbicides. Toxicity is a concern, and toxic response is thought to correlate with the induction of a particular gene. Below is a table showing gene expression measurement (arbitrary units) obtained from cell cultures treated with each molecule C1-C4 (5 experimental repetitions were done for each molecule). What is the appropriate test to see if the different compounds have a significant effect on gene expression, and does this test indicate a significant effect in this case? Why is it not appropriate to carry out t-tests on all possible pair wise comparisons of molecules? (4 marks)

C1

C2

C3

C4

4.5

4.5

3.5

9.6

4.3

4.6

3.5

8.7

4.2

4.1

3.1

8.8

4.4

4.2

3.7

9.1

4.4

4.8

3.2

9.2

7. For compound 4 above further investigations are carried out to determine the effect of dose on the induction of expression. The results are below (each dose was measured in triplicate).

Dose (mg/ml)

Rep1

Rep2

Rep3

0

4.4

4.3

4.4

0.05

7.1

7.3

7.4

0.1

10.1

10.4

9.6

How would you asses the trend for induction of expression to relate to the given dose? Is there are significant relationship between induction of expression and dose? (4 marks)

8. Epigenetic modifications of histone proteins change the structure of chromatin and mark genome regions in different functional states. For example tri-methylation of lysine 27 in histone H3 (H3K27me3) is associated with repression of transcription. ChIP-seq experiments can monitor such modifications over a complete genome using an antibody to the modified histone. They do this by fragmenting the chromatin and selecting the fragments that are bound to the antibody. The DNA in these fragments is first sequenced and then mapped back to where it matches in the genome to give a read out of the genome regions where the epigenetic modification is present. As in most high-throughput experiments the data are not perfect, and for a variety of reasons, including antibodies that are not perfectly selective, you can expect to find a low level of reads in the data set from all genomic regions irrespective of whether the modification is present.

Researchers are investigating a rare modification which is known to occur in relatively few genomic regions of typical size around 10kBases. They therefore choose to analyse their data set by dividing the genome into non-overlapping regions of 10kB and counting the number of sequences mapping to each window. They suggest that windows with high counts of mapped sequences are those where the modification is present, and those with low counts are those where it is absent.

(a) Explain why the Poisson distribution would be suitable as a statistical model for this data. The Poisson distribution has one parameter, what does it represent?

(b) Devise a suitable null hypothesis, using the Poisson distribution, to assess whether a high count of sequences in a particular window is statistically significant. Assume that the probability of finding sequence reads in genome regions without the modification is the same over the entire genome.

(c) Given that the total number of sequence reads mapping to the entire genome is Nr=10 million, and that the genome as whole is divided into 100 000 10kB windows, what is the p value for your null hypothesis for a single window with 160 mapped reads? Is this observation significant (remember to correct for multiple testing)?

(d) Suggest some biological reasons why the assumptions at (b) above might not be valid. How might you modify your method to account for this?

Section 2: Multivariate statistics

This section is a little different. The first part is a tutorial in multivariate statistics using the R software with simple data and also with genome scale transcriptomic data. The objective here is not necessarily to understand the details of the R code, because this is quite complex. (However, if you are interested in developing your skills in R and Bioconductor with a view to becoming a bioinformatician, then understanding this would be a good place to start.) The objective here is to study the results of applying multivariate methods like clustering and principal components analysis (PCA) to some real data sets. The questions afterwards then test your understanding of the methods and interpretation of the results.

First work through the tutorial using the R software and then answer the questions that follow.

2.1 Tutorial

1. Start up R and read in the data from the file “cluster.txt”. To do this, save the file in your own filespace and then use

data <- read.table(file="{path}/cluster.txt", sep=" ")
Now take a look at the data, by using “plot(data)”. You should see that the data is two dimensional and that there are 3 clear clusters in the data, of different shapes. First we will investigate k means clustering. Below are a sequence of instructions for doing k means clustering, and displaying the results
km <- kmeans(data,3,20)
plot(data)
km
plot(data,col=km$cluster)
Find out how k means clustering works. Note that you have to specify the number of clusters. Investigate what the algorithm does when you specify numbers of clusters between 2 and 10.
Hierarchical clustering is done by the set of commands below.
dm <- dist(data, method="euclidean")
hc <- hclust(dm, method="single")
plot(hc)
This time you have to calculate the distance matrix first (using the ‘dist’ command) as input to the clustering method. This matrix holds the Euclidean (geometric) distance between all pairs of points in the data. Do you think that the results make sense?
2. Download the data “pcdata.txt” and read it into R.
pcd <- read.table(file="{path}/pcdata.txt")
Make a scatter plot of the data. You should see that it is bivariate and that the two variables x1 and x2 are highly correlated. Here is the R command to do principal components analysis (PCA).
pc <- prcomp(pcd)
Now type “pc” to look at the result. You should find that there are 2 principal components (the data is bivariate). “summary(pc)” gives more information.
3. Download the set of ‘cel’ files from an Affymetrix experiment shoots1.cel,…,shoots7.cel in cel_files.zip. These are gene expression measurements from plants in a time course following the application of a stress (drought). There are 7 time points in the labelled order. Using the methods of practical 2, read in these data and use rma to get an eset object containing gene expression summary data. (Hint: ab <-ReadAffy(…), eset <- rma(ab)).
We are going to carry out a clustering analysis on these genes, gathering genes together into clusters in which all members will show similar patterns of expression. We will do this with hierarchical clustering and display the results as a heat map. First we will limit our analyses to the most interesting genes, this is because of about 22000 genes probed on the array, most have expression that is not affected at all by the stress and clustering these is uninteresting.
The R code below filters out genes whose coefficient of variation (standard deviation/mean) is less than 0.2, producing a set of just 147 genes showing the largest variation in expression in response to stress.
# R function to identify genes with coeff of variation > 0.2

myfil <- function(x) {
sd(x)/mean(x) > 0.2

}

ev <- exprs(eset)
dim(ev)
library(genefilter)
sub <- genefilter(ev,myfil)
sum(sub)
evs <- ev[sub,]
dim(evs)
We are now in a position to do some clustering. This can be done with the heatmap function in R. The command below creates the default heat map and clustering dendogram (clustering the genes – rows) by hierarchical clustering with ‘Euclidian’ distance measure. Note that we suppress clustering of the columns because it makes no sense – these are time points and we want them in time order.
heatmap(evs,Colv=NA)
Read “help(heatmap)” and see just how flexible this function is.
The default for ‘heatmap’ is to cluster using Euclidean distances, which is fine in many applications. However when clustering genes there are some advantages to using a different distance measure which is insensitive to the to the average expression level of the genes concerned. This can be achieved by basing the distance measure on the Pearson correlation of the expression patterns. The R code below first defines a new distance function (as 1-correlation), and then clusters according to this distance instead.
mydist <- function(x) {
xt <- t(x)
cm <- cor(xt)
dm <- as.dist(1-cm)
}
heatmap(evs,Colv=NA,distfun=mydist)
Try to interpret these results, guided by the dendogram, in terms of gene expression patterns of firstly two main clusters, and then 3-4 sub-clusters. Note that when interpreting the heatmaps, the colours can be interpreted as follows: red