Sections 3.1, 3.2, 3.3 in Ha ̈rdle and Simar (2015).
If X = (X1, . . . , Xp)T is a p dimensional random vector, we can compute the covariance between each pair Xi and Xj and collect all pairwise co- variances in the p × p covariance matrix Σ:
σXX …σXX σ11…σ1p 11.1p .
Σ= . = .  σXpX1 … σXpXp σp1 … σpp
 To highlight it is the covariance of X we can write ΣX . Σissymmetric:Σ=ΣT .
 Σ is positive semi-definite: Σ ≥ 0 .
 In matrix notation
Σ = E{(X − μ)(X − μ)T } , where X and μ are written as column p-vectors .
• In practice Σ is mostly unknown
• But can estimate it from a sample X1, . . . , Xn by the sample covariance
SXX …SXX S11…S1p 11.1p .
S= . = . , SXpX1 … SXpXp Sp1 … Spp
where, for j,k = 1,…,p,
1 􏰋n
(Xij − X ̄j)(Xik − X ̄k) is the sample covariance between Xj and Xk.
SXjXk = Skj = n − 1
 Note that S is symmetric (S = ST ) and semipositive definite .
• In matrix notation
S= 1 (X−X ̄1Tn)T(X−X ̄1Tn)= 1 XTX− n X ̄X ̄T,
n−1 n−1 n−1
where X is the n × p data matrix and X ̄ is written as column p-vector.
 Note that S is also positive semi-definite.
• Problem with covariance matrix: not unit invariant, i.e. if we change
the units, covariances change.
• Correlation: a measure of linear dependence which is unit invariant.
• The correlation matrix P of a random vector X = (X1,…,Xp)T is a p × p matrix defined by:
ρp1 ρp2 … 1 ρij = √σij
 1 ρ12 … ρ1p  ρ21 1…ρ2p
 . 
σiiσjj is the correlation between Xi and Xj.
Wealwayshave−1≤ρij ≤1.
 ρij is a measure of the linear relationship between Xi and Xj.  |ρij| = 1 means perfect linear relationship.
 ρij = 0 means absence of linear relationship, but does not imply independence.
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 XX
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 XX
-1.0 -0.6 -0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0
-0.8 -0.4 0.0 0.0 0.4 0.8

0.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 XX
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 XX
0.0 0.05 0.15 0.25 -1.0 0.0 0.5 1.0 1.5 2.0
-1 0 1 2 -1.5 -0.5 0.5

• In matrix notation
P = D−1/2ΣD−1/2 ,
where Σ is the p × p covariance matrix and
D = diag(σ11,…,σpp)
is the p × p diagonal matrix of variances.
• In practice P is mostly unknown. Can estimate it from a sample X1, . . . , Xn by the sample correlation matrix
R11 …R1p R= . ,
where, for j,k = 1,…,p,
Rp1 … = √sjk
is the sample correlation between Xj and Xk.
• In matrix notation we can write
R = D−1/2SD−1/2 ,
where S is the p × p sample covariance matrix and, on this occasion, D = diag(s11,…,spp)
is the p × p diagonal matrix of sample variances.
Let X = (X1,…,Xp)T be a p-vector and let Y be q-vector defined by
Y =AX+b,
where A is a q × p matrix and b is a q × 1 vector. Then we have
E(Y ) = AE(X) + b Y ̄ = A X ̄ + b
 The fact that ΣY = AΣX AT can be used to show that any covariance matrix must be positive semi-definite. (how? make sure you know!)
Sections 4.1, 4.2 in Ha ̈rdle and Simar (2015).
LetX=(X1,…,Xp)T bearandomvector.
• For all x = (x1, . . . , xp)T ∈ Rp, the cumulative distribution function
(cdf), or distribution function, of X is defined by
F(x) = P(X ≤ x) = P(X1 ≤ x1,…,Xp ≤ xp)
• If X is continuous, the probability density function (pdf) or density, f, of X is a nonnegative function defined through the following equation:
it always satisfies
􏱊x −∞
f(u)du = 1. −∞
F(x) = 􏱊∞
 The integrals are p-variate, u ∈ Rp but f (u) ∈ R: 􏱊 x 􏱊 x1 􏱊 xp
f(u)du = … f(u1,…,up)du1…dup . −∞ −∞−∞
• The marginal cdf a subset of X is obtained by the marginal of X com- puted at the subset, letting the other values equal to infinity.
 e.g. the marginal cdf of X1 is
FX1(x1) =P(X1 ≤ x1)
=P(X1 ≤ x1,X2 ≤ ∞,…,,Xp ≤ ∞) =FX(x1,∞,…,∞)
 e.g.the marginal cdf of (X1, X3) is g
FX1,X3(x1,x3) =P(X1 ≤ x1,X3 ≤ x3)
=P(X1 ≤ x1,X2 ≤ ∞,X3 ≤ x3,X4 ≤ ∞…,,Xp ≤ ∞)
• For a continuous random vector X, the marginal density of a subset of X is obtained from the joint density f of X by integrating out the other components.
 e.g. the marginal density X1 is 􏱊∞ 􏱊∞
fX1(x1) = … f(x1, u2, . . . , up) du2… dup −∞ −∞
 e.g. the marginal density of (X1, X3) is 􏱊∞ 􏱊∞
fX1,X3(x1, x3) = … f(x1, u2, x3, u4, . . . , up) du2 du4… dup . −∞ −∞
• For two continuous random variables X1 and X2, the conditional pdf of X2 given X1 is obtained by taking
f(x2|x1) = f(x1, x2)/fX1(x1) . (Defined only for values x1 such that fX1(x1) > 0)
• Two continuous random variables X1 and X2 are independent if and only if
f(x1, x2) = fX1(x1)fX2(x2)
 If X1 and X2 are independent then
fX2|X1(x2|x1) = f(x1, x2)/fX1(x1) = fX1(x1)fX2(x2)/fX1(x1) = fX2(x2) .
(Knowing the value of X1 does not change the probability assess- ments on X2 and vice versa)
• The mean μ ∈ Rp of X = (X1,…,Xp)T is defined by
μ1  E(X1) 􏱉 xfX (x)dx

μ= . = . = . .
μp E(Xp) 􏱉 xfXp(x) dx
 If X and Y are two p-vectors and α and β are constants then
E(αX + βY ) = αE(X) + βE(Y ).
IfX isap×1vectorwhichisindependentoftheq×1vectorY then
E(XY T ) = E(X)E(Y T ).
 Hint: Remember to always check that matrix dimensions are compatible.
The conditional expectation E(X2|X1 = x1) is defined by 􏱊
E(X2|X1 = x1) =
and the conditional (co)variance var(X2|X1 = x1) is defined by
var(X2|X1 = x1) = E(X2X2T |X1 = x1) − E(X2|X1 = x1)E(X2T |X1 = x1) , if X2 is a column vector.
x2fX2|X1(x2|x1) dx2
• As seen earlier, the covariance Σ of a vector X of mean μ is defined by Σ = E{(X − μ)(X − μ)T } .
We write
to denote a vector X with mean μ and covariance Σ.
• We can also define a covariance matrix between a p × 1 vector X of meanμandaq×1vectorY ofmeanνby
ΣX,Y =cov(X,Y)=E{(X−μ)(Y −ν)T}=E(XYT)−E(X)E(YT). The elements of this matrix are the pairwise covariances between the
components of X and those of Y .
X ∼ (μ, Σ)

 We have
cov(X + Y, Z) = cov(X, Z) + cov(Y, Z)
 We have
var(X + Y ) = var(X) + cov(X, Y ) + cov(Y, X) + var(Y )
 For matrices A and B and random vectors X and Y such that the below quantities are well defined we have
cov(AX,BY)=Acov(X,Y)BT .
Sections 4.4, 4.5, 5.1 in Ha ̈rdle and Simar (2015).
 Arguably the most common distribution we encounter.
 Recall that in the univariate case, the density of a N(μ,σ2) is given
f(x)=√1 exp􏱆−(x−μ)2/(2σ2)􏱇. 2πσ
 In the multivariate case, need to deal with vectors and matrices.
 Recall that
σ11…σ1p σ12…σ1p Σ= . = . 
σp1 … σpp σp1 … σp2
 The density of the multinormal (or simply normal) distribution with
where σj2 = var(Xj).
mean μ and covariance Σ is given by
f(x) = |2πΣ|−1/2 exp 􏰟 − 1(x − μ)T Σ−1(x − μ)􏰠 . (1) 2
 If the p-vector X is normal with mean μ and covariance Σ we write X ∼ Np(μ, Σ) .
If the Xi’s are independent, then  σ 12 . . . 0 
Thus and
so that
.22 Σ =  .  = diag(σ1,…,σp).
0 . . . σ p2
|2πΣ| = |diag(2πσ12, . . . , 2πσp2)| = (2π)pσ12 · · · σp2
Σ−1 =diag(σ−2,…,σ−2), 1p
p exp􏰟− 􏰋(xj −μj)2/σj2􏰠
=􏰎(2π)p 􏱈p σ exp − 2(xj − μj) /σj
(2π)p 􏱈j=1 σj 2 j=1
􏱅􏰟 22􏰠
1p1 j=1 j j=1
􏱅p 1 􏱆 2 2􏱇 = √2πσ exp −(xj −μj) /(2σj) .
j=1 j
is the product of densities of p univariate N (μj , σj2).
 When we define the multivariate normal distribution above, we’ve implicitly assume that Σ is non-singular, i.e. Σ > 0.
 Strictly speaking, we can also define normal distribution when Σ is only positive semi-definite, i.e. Σ ≥ 0. However, in this case, we end up with a degenerate normal distribution where a density function can- not be defined on Rp. (All probability mass lies in a lower dimensional subspace of Rp)
 In fact, even a constant number c ∈ R is trivially a degnerate normal random variable with variance 0!
We can see from (1) that the density of a Np(μ, Σ) is constant when (x−μ)TΣ−1(x−μ)
is constant. Now for positive constant c,
(x − μ)T Σ−1(x − μ) = c
corresponds to an ellipsoid. The quantity
is called the Mahalanobis distance between x and μ.
For example in p = 2 dimensions:
4.4 The Multinormal Distribution 139
the rectangle circumscribing the contour ellipse has sides with length 2d􏰬 and is
Fig. 4.3
􏰿1:5 􏰾 􏰿1:5 4
Normal sample 87
Contour Ellipses
6 5 4 3 2 1 0
-40 2 4 6 12345
X1 X1
Scatterplot of a normal sample and contour ellipses for 􏰳 D 􏰿 3 􏰾 and † D 􏰿 1
According to Theorem 2.7 in Sect. 2.6 the half-lengths of the axes in the contour
According to Theorem 2.7 in Sect. 2.6 the half-lengths of the axes in the contour ellipsoid are d √λi where λi are the eigenvalues of Σ. If Σ is a diagonal matrix,

 Let X ∼ Np(μ, Σ), A a q × p matrix and b a q × 1 vector. Then Y =AX+b∼Nq(Aμ+b,AΣAT).
 Let X = (X1T,X2T)T ∼ Np(μ,Σ) where 􏰆Σ11 Σ12􏰇
and Then
Σ= Σ21Σ22 var(X1)=Σ11, var(X2)=Σ22.
Σ12 = 0 ⇐⇒ X1 and X2 are independent.
 If X ∼ Np(μ, Σ) and A and B are matrices with p columns, then
AX and BX are independent ⇐⇒ AΣBT = 0 . (2)
 If X1,…,Xn are i.i.d.∼ Np(μ,Σ), then
X ̄ ∼Np(μ,Σ/n) (3)
IfZ1,…,Zn areindependentN(0,1)then n
X = 􏰋 Z k2 ∼ χ 2n k=1
is a chi square with n degrees of freedom. (That’s how chi-square dis- tributions are defined)
 If X ∼ Np(μ, Σ) and Σ is invertible, then
Y =(X−μ)TΣ−1(X−μ)∼χ2p. (4)
– To show this: First write Σ = Σ1/2Σ1/2 with spectral decomposition. (How? Make sure you know!)
⇒ Σ−1 = Σ−1/2Σ−1/2
– But then Σ−1/2(X − μ) ≡ Z ∼ Np(0, Ip)
– So Y = Z T Z ∼ χ2p by the definition of the chi-square distributions.
• The Wishart distritbution is a generalisation to multiple dimensions of
the chi square distribution.
It depends on 3 parameters: p, a p × p scale matrix Σ and the num-
ber of degrees of freedom n:
Wp(Σ, n) .
• If M is an p × n matrix whose columns are independent and all have a Np(0, Σ) distribution, then
M=MMT ∼Wp(Σ,n).
• Note that in the definition above, Σ doesn’t have to be strictly positive
definite, nor is there any restriction on n and p.
• However, when M is Wishart-distributed, it must be non-negative definite.
(By definition, M can be represented as MMT for some M with inde- pendent normal columns. Hence it must be that xT Mx = xT M M T x ≥ 0 for all p-vector x )
• Further if M is non-singular (i.e. positive definite) with probability 1, it is said to have a non-singular Wishart distribution.
• The following result can be found in Proposition 8.2 of the lecture notes by in CANVAS:
Suppose M is Wishart-distributed with parameters Σ, p, n. Then M has a non-singular Wishart distribution if and only if n ≥ p and Σ > 0, in which case M has the density
|M|n−p−1 exp(−1tr􏰀MΣ−1􏰁) 22
fΣ,n(M) = 2pn/2πp(p−1)/4|Σ|n/2 􏱈pi=1 Γ((n + 1 − i)/2)
•Whenσisascalar,aW1(σ2,n)isthesameasσ2 timesaχ2n.
• If a p × p random matrix Y ∼ Wp(Σ, n) and B is a q × p matrix then
BYBT ∼Wq(BΣBT,n). •Ifap×prandommatrixY ∼Wp(Σ,n)andaisap×1vectorsuchthat
aT Σa ̸= 0, then
aTYa/aTΣa ∼ χ2n . (How to show it? Make sure you know!)
• Recall the unbiased sample covariance matrix
1 􏰋n
S = n−1 It can be proved that
(Xi −X ̄)(Xi −X ̄)T . (n − 1)S ∼ Wp(Σ, n − 1).
• The above is essentially saying that if X1, . . . , Xn be iid N (μ, Σ) random vectors with sample mean X ̄ , then
􏰋(Xi − X ̄)(Xi − X ̄)′
is distributed as 􏰊n−1 ZiZ′, where the Zi’s are iid N(0, Σ) random vec-
tors. i=1 i
• See Theorem 3.3.2 in An Introduction to Multivariate Statistical Analysis
by Anderson for a proof.
• The Hotelling distribution is a generalisation to multiple dimensions
of the student tn distribution with n degrees of freedom.
• In the univariate case a variable X ∼ tn if it can be written as X = Y 􏰎n/Z
where Y and Z are independent random variables, Y ∼ N(0,1) and Z ∼ χ 2n .
• Definition of Hotelling’s T 2(p, n) distribution: If X ∼ Np(0, Ip) is inde- pendent of M ∼ Wp(Ip, n),, then
nXTM−1X∼T2 . p,n
• Theorem (p.193 of Ha ̈rdle and Simar): If X ∼ Np(μ, Σ) is independent of M ∼ Wp(Σ, n) with M being non-singular, then
n(X−μ)TM−1(X−μ)∼T2 . p,n
Proof sketch: Consider Y ∼ Np(0, Ip), then Σ1/2Y ∼ Np(0, Σ). Note that Σ1/2 can be obtained by spectral decomposition.
• e.g. If X1, . . . , Xn are i.i.d.∼ Np(μ, Σ), then the sample mean vector X ̄ and the sample covariance matrix S are such that
n ( X ̄ − μ ) T S − 1 ( X ̄ − μ ) ∼ T 2 . p,n−1
This is true because S is independent of X ̄ by Cochran’s theorem (The- orem 5.7 in the Ha ̈rdle and Simar).
• Hotelling’s T 2 and F-distribution is related by T2 = np Fp,n−p+1.
p,n n−p+1
(Recall that the square of a univariate t-distribution with n degree of
freedom is same as an F1,n distribution)
• Hotelling’s T 2 test is typically used for the following hypothesis test- ing problem (see chapter 7.1 of Ha ̈rdle and Simar):
• Suppose X1, . . . , Xn is an iid random sample from the Np(μ, Σ) popu- lation with Σ unknown. Test
H0 : μ = μ0 versus H1 : no constraints .
(This is the multivariate version of usual univariate testing problem
tackled by t statistics. )
• When H0 is true, n(X ̄ − μ0)T S−1(X ̄ − μ0) ∼ T 2 . Naturally, we can
use n(X ̄ − μ0)T S−1(X ̄ − μ0) as the test statistic, and calibrate the cutoff
threshold based on the T 2 ( or Fp,n−p ) distribution. p,n−1
• It turns out, the same test can also be derived as the likelihood ratio test for this testing problem (again, see chapter 7.1 of Ha ̈rdle and Simar).
