STAT3006/STAT7305 Assignment 1—Probability and Optimization Theory

Due Date: 26th August 2022 Weighting: 20%

Instructions

• The assignment consists of four (4) problems, each problem is worth 25 marks, and each mark is equally weighted.

Copyright By cscodehelp代写 加微信 cscodehelp

• The mathematical elements of the assignment can be completed by hand, in LaTeX (prefer- ably), or in Word (or other typesetting software). The mathematical derivations and ma- nipulations should be accompanied by clear explanations in English regarding necessary information required to interpret the mathematical exposition.

• Computation problems should be answered using programs in the R language.

• Computer generated plots and hand drawn graphs should be included together with the text

where problems are answered.

• Submission files should include the following (which ever applies to you):

– Scans of handwritten mathematical exposition.

– Typeset mathematical exposition, outputted as a pdf file.

– Typeset answers to computational problems, outputted as a pdf file.

– Program code/scripts that you wish to submit, outputted as a txt file.

• Mathematical problems should be answered with reference to results presented in the Main Text (refer, page numbers), Remarks, Exercises, and Corollaries/Lemma/Propositions/Theorems from the Lecture Notes, if required. If a mathematical result is used that is not presented in the Lecture Notes, then its common name (e.g., “Bayes’ Theorem”, “Intermediate Value The- orem”, “Borel–Cantelli Lemma”, etc.) should be cited, or else a reference to a text containing the result should be provided (preferably a textbook).

• All submission files should be labeled with your name and student number and archived together in a zip file and submitted at the TurnItIn link on Blackboard. We suggest naming using the convention:

[LastName_FirstName/StudentNumber]_STAT3006A1_[AnythingElse].[FileExtension].

• As per my.uq.edu.au/information-and-services/manage-my-program/student-in tegrityand-conduct/academic-integrity-and-student-conduct, what you submit should be your own work. Even where working from sources, you should endeavour to write in your own words. You should use consistent notation throughout your assignment and define whatever is required.

Problem 1 [25 Marks]

Let (Ω, F , P) be a probability space, and let (Xi (ω))i∈[n] be a sequence of independent and iden- tically distribution (IID) random variables from (Ω,F) to (R,B(R)), each with the same image measure PX , as the random variable X (ω).

(a) Assuming that X has expectation μX = E (X) and finite variance σX2 =E(X−μX)2<∞,
prove that the sample mean,
χ(−∞,x) (Xi) .
Fn (x) = n
is a consistent estimator of μX in the sense that X ̄n converges in probability to μX, as
[10 Marks]
Recall that the cumulative distribution function (CDF) of X is defined as:
F (x) = P (ω : X (ω) ≤ x) .
Using the sequence (Xi)i∈[n] estimate the CDF F (x) using the empirical CDF:
Xn (ω) = n
n → ∞. Such a result is usually referred to as a weak law of large numbers.
(b) For each fixed x ∈ R, show that ̄1 ̄1
Iα,n(x)= y∈R:Fn(x)−2√nα≤y≤Fn(x)+2√nα is a (1 − α) × 100% confidence interval for F (x) in the sense that
P (ω : F (x) ∈ Iα,n (x)) ≥ 1 − α.
[10 Marks]
The Kolmogorov’s strong law of large numbers improves upon the result from (a) stating that if the
expectation of X is finite (i.e. |μX | < ∞), then X ̄n converges almost surely to μX , as n → ∞.
(c) Assuming that P ≪ Leb, use the strong law of large numbers to prove the Glivenko–
Cantelli Theorem; i.e., show that
sup F ̄n (x) − F (x) x∈R
converges to 0, almost surely, as n → ∞. [5 Marks]
This result, along with the laws of large numbers, is generally considered to be the fundamental law(s) of statistics in the sense that they provide a mechanism for probability distributions to be realized in reality; i.e., repeated observation of IID random variables and averaging the outcomes yields insight regarding the properties of the underlying probability measure P.
Let (Ω, F , P) be a probability space, and let (Xi (ω))i∈[n] be a sequence of IID random variables from (Ω, F ) to (R, B (R)), each with the same image measure PX , as the random variable X (ω).
Suppose that PX ≪ Leb, and has Radon–Nykodym derivative (probability density function; PDF) pθ0 : R → R≥0, where θ0 ∈ R, the so-called data generating parameter corresponding to PX, is unknown to us. That is, we only know that PX has a PDF of the form pθ, for some value of θ ∈ R, but not which particular value of θ (i.e., θ0).
We wish to test the simple hypotheses that either the null hypothesis H 0 : θ 0 = θ 1∗
is true, or the alternative hypothesis
H 1 : θ 0 = θ 2∗ 3
is true, for a pair of predetermined values θ1∗, θ2∗ ∈ R, where θ1∗ ̸= θ2∗. To construct a test between the two hypotheses, we construct the so-called likelihood ratio statistic:
ni=1 pθ2∗ (Xi (ω)) Ln (ω) = ni=1 pθ1∗ (Xi (ω)).
(a) Under the null hypothesis (i.e., we assume that θ0 = θ1∗ is true, and subsequently pθ0 = pθ1∗ ), show that
E [Ln (ω)] = 1.
You may use the fact that if (Yi (ω))i∈[n] is a sequence of independent random variables, then so (fi ◦ Yi)i∈[n] is also a sequence of independent random variables, where fi : R → R is a function, for each i ∈ [n].
We say that a random variable E : Ω → R≥0 is an e-value if
and we say that a random variable P (ω) is a p-value if for any α ∈ [0, 1]:
P (ω : P (ω) ≤ α) ≤ α. (b) Prove that P = min {1, 1/E} is a p-value.
We say that Rn, (Xi)i∈[n] → {0,1}, is a rejection rule (or test) for the hypotheses H0 and H1, wherewesaythatwereject H0 ifRn =1andwesaythatweaccept H0 ifRn =0. Wefurthersay that Rn has size (or significance level ) α ∈ (0, 1) if
P(ω:Rn =1)≤α,
under the assumption that H0 is true.
(c) Using the conclusions from (a) and (b), for any α ∈ (0, 1), construct a rejection rule
with size α for the test of H0: θ0 = θ1∗ versus H1: θ0 = θ2∗. [5 Marks]
In the R programing language, we can generate an sequence (Yi)i∈[n] of IID random variables, each with the same image probability measures PY , and PDF
1 1 2 pθ(y)=φθ(y)=√2πexp −2(y−θ) ,
using the command rnorm (n, θ). We can also compute φθ (y) using the command dnorm (y, θ).
(d) Assuming that X has image measure PX has a PDF of the form pθ0 (x) = φθ0 (x), use the commands above to program an R script that implements the rejection rule from (c) to test the hypotheses H0: θ0 = 0 versus H1: θ0 = θ2∗, and assess how often the test rejects H0, as θ2∗ increases. We say that a test is powerful if H0 is rejected with high probability when H1 is true. Via computational experiments or otherwise, comment on the power of the test as θ2∗ increases away from 0.
[10 Marks]
(a) For non-negative numbers u,v ∈ R≥0 and positive coefficients p,q ∈ R>0, such that

1/p + 1/q = 1, show that

up vq uv ≤ p + q .

You may use the fact that x → log (x) is concave on R>0. [5 Marks]

Let X(ω) and Y (ω) be functions from measure space (Ω,F,M) to (R,B(R)).

(b) Use the result from (a) to prove Hölder’s inequality. That is, show that if

|X| dM<∞and |Y| dM<∞,
for 1/p + 1/q = 1, then
p 1/p
|XY|dM≤ |X| dM ΩΩΩ
[10 Marks]
(c) Use Hölder’s inequality (or otherwise) to prove Minkowski’s inequality. That is, show
for p ≥ 1, then
|X| dM<∞and |Y| dM<∞,
p 1/p p 1/p p 1/p |X+Y| dM ≤ |X| dM + |Y| dM .
Let (Xi)i∈[n] be a sequence of functions from (Ω, F, M) to (R, B (R)). (d) Consider a generalization of Hölder’s inequality of the form:
n
n q 1/qi |X|i dM
where (qi)i∈[n] is a sequence of constants, such that qi ∈ R for each i ∈ [n]. Propose conditions on the values that (qi)i∈[n] can take so that (1) is true, and argue (formally or informally) why you believe that your suggested conditions are correct.
Letf:T→RbeafunctiononthedomainT⊂Rd,whereTissaidtobeconvex. Wesaythat f is μ-strongly convex on T (with respect to the Euclidean norm ∥·∥) if there exists a μ > 0 such that for any θ, τ ∈ X and λ ∈ [0, 1], we have

f (λθ + (1 − λ) τ ) ≤ λf (θ) + (1 − λ) f (τ ) + μλ (1 − λ) ∥θ − τ ∥2 . (2)

This more structured notion of convexity allows for proofs of convergence and rates for many modern optimization algorithms, and problems relating to strongly convex functions are mathe- matically “easier” to solve that others.

(a) Characterization (2) is one of many characterizations of μ-strong convexity. Another useful characterization is that

g (θ) = f (θ) − μ ∥θ∥2 (3)

is convex on T. That is, f is so convex that even when we remove a quadratic function from f (i.e., g) we still have a convex function. Prove that these two characterizations ((2) and (3)) of μ-strong convexity are equivalent.

(b) Using either (2) or (3), prove that if f is μ-strongly convex on T then it is also strictly convex on T, and propose a counterexample of a strictly convex function that is not μ-strongly convex for any μ > 0.

We now assume that f is differentiable on a an open set T ̄ ⊃ T, and thus is differentiable on T. Then, we can further characterize μ-strong convexity on T by the inequality

f (τ ) ≥ f (θ) + (τ − θ)⊤ ∇f (θ) + 12 μ ∥τ − θ∥ . (4) (c) Using (4), show that if f is μ-strongly convex on T and θ∗ ∈ T is a critical point of f

in the sense that

then for all θ ∈ T we have

∇f (θ∗) = 0,

f ( θ ) ≥ f ( θ ∗ ) + 21 μ ∥ θ − θ ∗ ∥ 2 f(θ∗)≥f(θ)− 1 ∥∇f(θ)∥2.

[10 Marks]

For a differentiable function f, we say that it is β-smooth on T if

∥∇f (θ) − ∇f (τ )∥ ≤ β ∥θ − τ ∥ ,

for every θ, τ ∈ T.

(d) A typical algorithm for computing

θ∗ = argmin f (θ) θ∈T

is the gradient descent method, where we generate a sequence (θt)t∈N such that θt+1 =θt −η∇f(θt),

for some positive constant η > 0, starting from some initial value θ1 ∈ T. When f is convex and β-smooth on T, it is provable that after T ∈ N steps of the gradient descent algorithm, we have the inequality:

f(θT)−f(θ∗)≤ 1 2β∥θ1 −θ∗∥2, (5) T−1

by taking η = 1/β. On the other hand, if f is also μ-strongly convex on T, then after

T + 1 steps of gradient descent, we have

∗ 4(T−1)β ∗ 2

f(θT)−f(θ)≤exp − β/μ 2∥θ1−θ∥ , (6) when taking η = 2/ (μ + β). Discuss the meanings of inequalities (5) and (6), and

argue whether or not gradient descent performs better when f is strongly convex. [5 Marks]

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com