程序代写代做代考 DNA flex Excel algorithm BIOL5274M – Tutorial problem sheet

BIOL5274M – Tutorial problem sheet
These problems represent 25% of the final mark for the module. For full marks, please indicate how you have worked out each answer. All the problems in section 1 can be done using tools in standard statistical packages such as R or Excel. Section 2 requires R and Bioconductor. Some of the questions have been covered in lecture and practical materials, but not all, so you will have to do some private study and reading in order to answer all the questions.
Section 1: Basic univariate statistics
1. A variable x is normally distributed with mean 10.4 and standard deviation 2.6. What is the probability of measuring a value of x more than three standard deviations from the mean? What is the 90% confidence interval for x? (2 marks).
2. During investigation of a type of cancer, an experimental worker measures the expression level of a single gene in 5 cancer tissue samples and for comparison in 5 samples of normal tissue. The following are the results where expression levels are expressed in arbitrary units, and it is assumed that the 10 measurements come from 10 different patients.
Normal 20.2 20.5 20.6 19.4 19.7
Cancer 20.1 20.3 20.3 19.2 19.5
Use an appropriate t-test to see if there is a significant difference in the mean expression of this gene between cancer and normal tissue. (3 marks).
3. Suppose now that you are given an extra piece of information. You are told that the measurements in question 4 above are structured so that measurements that lie directly above and below each other in the table are actually made and from normal and cancer tissues taken from the same patient, i.e. that the study involves only 5 patients with two samples taken from each. Does this change the t test you would apply, and does the p value change? Explain. (5 marks).
4. Another experimentalist repeats the experiments of question 5 on 20 further genes and obtains the following p values for the null hypothesis that expression level of the gene is no different between normal and cancer tissues
0.7,0.5,0.05,0.6,0.5,0.8,0.2,0.3,0.9,0.1,0.5,0.8,0.2,0.3,0.9,0.1,0.5,0.2,0.3,0.9
He uses these to assert that a single gene (the third in the list above) shows a significant difference in expression at the 5% level between the two tissue types. What is the flaw in this argument? (Hint: Bonferroni). Explain the Bonferroni correction for multiple testing. The Benjamini-Hochberg correction is sometimes used as an alternative to the Bonferroni: explain what this is and why it might be preferred. Discuss the significance of these considerations for genome scale (high-throughput) investigations (8 marks).
5. t tests are parametric. What does this mean, and what are the non-parametric alternatives? What is the advantage of a t test over a non-parametric alternative for appropriate data? (4 marks).
6. An agrochemicals company is investigating four closely related compounds which are equally effective as herbicides. Toxicity is a concern, and toxic response is thought to correlate with the induction of a particular gene. Below is a table showing gene expression measurement (arbitrary units) obtained from cell cultures treated with each molecule C1-C4 (5 experimental repetitions were done for each molecule). What is the appropriate test to see if the different compounds have a significant effect on gene expression, and does this test indicate a significant effect in this case? Why is it not appropriate to carry out t-tests on all possible pair wise comparisons of molecules? (4 marks)
C1
C2
C3
C4

4.5
4.5
3.5
9.6

4.3
4.6
3.5
8.7

4.2
4.1
3.1
8.8

4.4
4.2
3.7
9.1

4.4
4.8
3.2
9.2

7. For compound 4 above further investigations are carried out to determine the effect of dose on the induction of expression. The results are below (each dose was measured in triplicate).
Dose (mg/ml)
Rep1
Rep2
Rep3

0
4.4
4.3
4.4

0.05
7.1
7.3
7.4

0.1
10.1
10.4
9.6

How would you asses the trend for induction of expression to relate to the given dose? Is there are significant relationship between induction of expression and dose? (4 marks)
8. Epigenetic modifications of histone proteins change the structure of chromatin and mark genome regions in different functional states. For example tri-methylation of lysine 27 in histone H3 (H3K27me3) is associated with repression of transcription. ChIP-seq experiments can monitor such modifications over a complete genome using an antibody to the modified histone. They do this by fragmenting the chromatin and selecting the fragments that are bound to the antibody. The DNA in these fragments is first sequenced and then mapped back to where it matches in the genome to give a read out of the genome regions where the epigenetic modification is present. As in most high-throughput experiments the data are not perfect, and for a variety of reasons, including antibodies that are not perfectly selective, you can expect to find a low level of reads in the data set from all genomic regions irrespective of whether the modification is present.
Researchers are investigating a rare modification which is known to occur in relatively few genomic regions of typical size around 10kBases. They therefore choose to analyse their data set by dividing the genome into non-overlapping regions of 10kB and counting the number of sequences mapping to each window. They suggest that windows with high counts of mapped sequences are those where the modification is present, and those with low counts are those where it is absent.
(a) Explain why the Poisson distribution would be suitable as a statistical model for this data. The Poisson distribution has one parameter, what does it represent?
(b) Devise a suitable null hypothesis, using the Poisson distribution, to assess whether a high count of sequences in a particular window is statistically significant. Assume that the probability of finding sequence reads in genome regions without the modification is the same over the entire genome.
(c) Given that the total number of sequence reads mapping to the entire genome is Nr=10 million, and that the genome as whole is divided into 100 000 10kB windows, what is the p value for your null hypothesis for a single window with 160 mapped reads? Is this observation significant (remember to correct for multiple testing)?
(d) Suggest some biological reasons why the assumptions at (b) above might not be valid. How might you modify your method to account for this?

Section 2: Multivariate statistics
This section is a little different. The first part is a tutorial in multivariate statistics using the R software with simple data and also with genome scale transcriptomic data. The objective here is not necessarily to understand the details of the R code, because this is quite complex. (However, if you are interested in developing your skills in R and Bioconductor with a view to becoming a bioinformatician, then understanding this would be a good place to start.) The objective here is to study the results of applying multivariate methods like clustering and principal components analysis (PCA) to some real data sets. The questions afterwards then test your understanding of the methods and interpretation of the results.
First work through the tutorial using the R software and then answer the questions that follow.
2.1 Tutorial
1. Start up R and read in the data from the file “cluster.txt”. To do this, save the file in your own filespace and then use