# 程序代写代做代考 case study database algorithm matlab Data Representation for High Dimensional Data

Data Representation for High Dimensional Data

Lecture 6: Dimensionality Reduction

1

CMSC5741 Big Data Tech. & Apps.

Prof. Michael R. Lyu

Computer Science & Engineering Dept.

The Chinese University of Hong Kong

A Compression Example

2

Outline

Motivation

SVD

CUR

Application of SVD and CUR

PCA

Extension to robust PCA

3

Dimensionality Reduction Motivation

Assumption: Data lie on or near a low

d-dimensional subspace

Axes of this subspace are effective representation of the data

4

Dimensionality Reduction Motivation

Compress / reduce dimensionality:

106 rows; 103 columns; no updates

Random access to any cell(s); small error: OK

5

The above matrix is really “2-dimensional.” All rows can be reconstructed by scaling [1 1 1 0 0] or [0 0 0 1 1]

16

16

16

16

16

Rank of a Matrix

Q: What is rank of a matrix A?

A: No. of linearly independent rows/columns of A

For example:

Matrix A = has rank r=2

Why? The first two rows are linearly independent, so the rank is at least 2, but all three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3.

Why do we care about low rank?

We can write A as two “basis” vectors: [1 2 1] [-2 -3 1]

And new coordinates of : [1 0] [0 1] [1 -1]

6

Rank is “Dimensionality”

Cloud of points 3D space:

Think of point positions

as a matrix:

We can rewrite coordinates more efficiently!

Old basis vectors: [1 0 0] [0 1 0] [0 0 1]

New basis vectors: [1 2 1] [-2 -3 1]

Then A has new coordinates: [1 0]. B: [0 1], C: [1 -1]

Notice: We reduced the number of coordinates!

1 row per point:

A

B

C

A

7

Dimensionality Reduction

Goal of dimensionality reduction is to

discover the axis of data!

Rather than representing

every point with 2 coordinates

we represent each point with

1 coordinate (corresponding to

the position of the point on

the red line).

By doing this we incur a bit of

error as the points do not

exactly lie on the line

8

Why Reduce Dimensions?

Why reduce dimensions?

Discover hidden correlations/topics

Words that occur commonly together

Remove redundant and noisy features

Not all words are useful

Interpretation and visualization

Easier storage and processing of the data

9

SVD: Dimensionality Reduction

B

10

For an matrix A, we can decompose it as = , where

U is an real or complex orthonormal matrix (i.e., = = I)

Σ is an m × n rectangular diagonal matrix with non-negative real numbers on the diagonal, and

VT (the conjugate transpose of V, or simply the transpose of V if V is real) is an real or complex orthonormal matrix.

SVD: Singular Value Decomposition

11

When rank(A) = r:

: input data matrix

matrix (e.g., documents, terms)

: left singular vectors

matrix (documents, topics)

: singular values

diagonal matrix (strength of each “topic”)

rank of matrix

: right singular vectors

matrix ( terms, topics)

SVD: Singular Value Decomposition

12

: scalar

: vector

: vector

A

𝑉𝑇

SVD: Singular Value Decomposition

13

A

: scalar

: vector

: vector

A

𝑉𝑇

13

SVD Properties

It is always possible to do SVD, i.e. decompose a matrix 𝐴 into , where

𝑈, , 𝑉: unique

𝑈,𝑉: column orthonormal

, (𝐼: identity matrix)

: diagonal

Entries (singular values) are non-negative,

Sorted in decreasing order (𝜎1 ≥ 𝜎2 ≥⋯ ≥ 𝜎r ≥ 0).

14

SVD Example

We give an example of a simple SVD decomposition

15

=

T

15

SVD: Example – Users-to-Movies

– example: Users to Movies

SciFi

Romance

“topics” or “concepts”

aka. Latent dimensions

aka. Latent factors

16

The Avengers

Star Wars

Twilight

Matrix

In this example, there are two “concepts” underlying the movies: science fiction and romance.

16

SVD: Example – Users-to-Movies

– example

SciFi

Romance

17

The Avengers

Star Wars

Twilight

Matrix

SciFi-concept

X

X

Romance-concept

The rows of M as people and the columns of M as movies.

17

SVD: Example – Users-to-Movies

– example

SciFi

Romance

18

The Avengers

Star Wars

Twilight

Matrix

SciFi-concept

X

X

Romance-concept

U is “user-to-concept” similarity matrix

The rows of M as people and the columns of M as movies.

The matrix U connects user to concepts.

The value 0.24 in the first row and first column of U is smaller than some of the other entries in that column, because the first person watches only science fiction, he does not rate those movies highly. The second column of the first row of U is nearly zero, because he does not rate romance movies at all.

18

SVD: Example – Users-to-Movies

– example

SciFi

Romance

19

The Avengers

Star Wars

Twilight

Matrix

SciFi-concept

X

X

“strength” of the SciFi-concept

The matrix Sigma gives the strength of each of the concepts. In this example, the strength of the science fiction concept is 11.9, while the strength of the romance concept is 7.1. Intuitively, the science-fiction concept is stronger because the data provides more movies of that genre and more people who like them.

19

SVD: Example – Users-to-Movies

– example

SciFi

Romance

20

The Avengers

Star Wars

Twilight

Matrix

SciFi-concept

X

X

V is “movie-to-concept” similarity matrix

SciFi-concept

Q: Does the movie “Twilight” relate to concept “Romance”?

The matrix V relates movie to concepts. The 0.5x indicates that the first three movies are of the science-fiction genre, while the small absolute value in the last column say that the movie does not partake much the concept of romance.

The second row indicates that the movie Twilight is related to romance.

20

SVD: Interpretation #1

“movies”, “users” and “concepts”

: user-to-concept similarity matrix

: movie-to-concept similarity matrix

: its diagonal elements

‘strength’ of each concept

21

SVD: Interpretations #2

SVD gives ‘best’ axis to project on

‘best’ = minimal sum of squares of projection errors

In other words,

minimum reconstruction error

22

SVD: Interpretation #2

– example

𝑈: user-to-concept matrix

𝑉: movie-to-concept matrix

23

X

X

23

SVD: Interpretation #2

– example

24

X

X

variance (“spread”)

on the v1 axis

24

SVD: Interpretation #2

– example

: the coordinates of the points in the projection axis

25

Projection of users on the “Sci-Fi” axis

25

SVD: Interpretation #2

Q: how exactly is dimension reduction done?

26

X

X

SVD: Interpretation #2

Q: how exactly is dimension reduction done?

A: Set smallest singular values to zero

27

X

X

X

SVD: Interpretation #2

Q: how exactly is dimension reduction done?

A: Set smallest singular values to zero

Approximate original matrix by low-rank matrices

28

X

X

X

SVD: Interpretation #2

Q: how exactly is dimension reduction done?

A: Set smallest singular values to zero

Approximate original matrix by low-rank matrices

29

X

X

SVD: Best Low Rank Approximation

A

B

is a rank matrix

is the best rank approximation of matrix

30

SVD: Best Low Rank Approximation

Theorem: Let (), and

= diagonal matrix where ( and ()

or equivalently, is the best rank- approximation to :

or equivalently,

Intuition (spectral decomposition)

Why setting small to 0 is the right thing to do?

Vectors and are unit length, so scales them.

Therefore, zeroing small introduces less error.

31

SVD: Interpretation #2

Q: How many 𝜎𝑖 to keep?

A: Rule-of-a thumb

Keep 80~90% “energy” ()

32

𝜎1𝑢1∘𝑣1𝑇+ 𝜎2𝑢2∘𝑣2𝑇 +⋯

Assume: 𝜎1 ≥ 𝜎2 ≥⋯

SVD: Complexity

SVD for full matrix

But

Less work, if we only want to compute singular values

or if we only want first k singular vectors (thin-svd).

or if the matrix is sparse (sparse svd).

Stable implementations

LINPACK, Matlab, Splus, Mathematica…

Available in most common languages

33

SVD: Conclusions so far

SVD: : unique

user-to-concept similarities

movie-to-concept similarities

: strength to each concept

Dimensionality reduction

Keep the few largest singular values (80-90% of “energy”)

SVD: picks up linear correlations

34

SVD: Relationship to Eigen-decomposition

SVD gives us

Eigen-decomposition

is symmetric

are orthonormal (

are diagonal

35

SVD: Relationship to Eigen-decomposition

Eigen-decomposition of and

So, , and

That is, is the matrix of eigenvectors of , and

is the matrix of eigenvectors of

This shows how to use eigen-decomposition to compute SVD

Ahe singular values of are the square roots of the corresponding eigenvalues of

Note: and are the dataset covariance matrices

36

36

A Brief Review of Eigen-Decomposition

Eigenvalues and eigenvectors

matrix.

eigenvalue of , : eigenvector of . eigenpair.

Simple computational method of eigenvalues

Solve the equation

Example

Then

Then

Solve , we get

37

A Brief Review of Eigen-Decomposition

Example (continued)

solve , we get eigenvalues

now we compute eigenvectors.

for eigenvalue we need to find

solve

We get Since needs to be a unit vector, therefore

Similarly, we can compute

38

Computing Eigenvalues: Power Method

Power method

choose an arbitrary

.

Theorem: sequence converges to the principal eigenvector (i.e., the eigenvector corresponds to the largest eigenvalue)

Normalized power method

choose an arbitrary

Theorem: sequence converges to the principal eigenvector.

39

In-class Practice

Go to practice

40

SVD Case Study: How to Query?

Q: Find users that like “The Avengers”

A: Map query into a “concept space” – how?

41

SciFi

Romance

The Avengers

Star Wars

Twilight

Matrix

X

X

41

Q: Find users that like “The Avengers”

A: Map query into a “concept space” – how?

Case Study: How to Query?

42

The Avengers

Twilight

Matrix

Star Wars

v1

q

Project into concept space:

Inner product with each concept vector vi

v2

A is not one of the people represented by the original matrix, but he wants to use the system to know what movies he would like.

He has only seen one movie, Matrix, and rated it 5. Thus, we can represent A by the vector q = [5, 0, 0, 0], as if this were one of the rows of the original matrix.

42

Q: Find users that like “The Avengers”

A: Map query into a “concept space” – how?

Case Study: How to Query?

43

The Avengers

Twilight

Matrix

Star Wars

v1

q

Project into concept space:

Inner product with each concept vector vi

v2

q*v1

A is not one of the people represented by the original matrix, but he wants to use the system to know what movies he would like.

He has only seen one movie, Matrix, and rated it 5. Thus, we can represent A by the vector q = [5, 0, 0, 0], as if this were one of the rows of the original matrix.

43

Compactly, we have

𝑞𝑐=𝑞𝑉

Case Study: How to Query?

=

movie-to-concept

similarities ()

SciFi-concept

44

The Avengers

Twilight

Matrix

Star Wars

X

is high in science-fiction interest, and nearly no interested in romance.

44

How would the user d that rated (‘Star Wars’, ‘Matrix’) be handled?

d𝑐=d 𝑉

Case Study: How to Query?

=

movie-to-concept

similarities ()

SciFi-concept

45

The Avengers

Twilight

Matrix

Star Wars

X

is high in science-fiction interest, and nearly no interested in romance.

45

Observation

User d that rated (‘Star Wars’) will be similar to user q that rate (‘The Avengers’), although d and q have zero ratings in common!

Case Study: How to Query?

SciFi-concept

46

The Avengers

Twilight

Matrix

Star Wars

Zero ratings in common

Similarity 0

Cosine similarity: 0.99

is high in science-fiction interest, and nearly no interested in romance.

46

SVD: Drawbacks

Optimal low-rank approximation

in terms of Euclidean norm

Interpretability problem:

A singular vector specifies a linear

combination of all input columns or rows

Lack of sparsity:

Singular vectors are dense!

47

=

U

VT

CUR Decomposition

Goal: express A as a product of matrices

Minimize

Constraints on

48

CUR Decomposition

Goal: express A as a product of matrices

Minimize

Constraints on

49

CUR: Good Approximation to SVD

Let be the best rank approximation of (obtain by SVD)

Theorem

CUR algorithm in time achieves

with probability at least , by picking

columns and

rows

(in practice, choose columns/rows)

50

CUR: How it Works

Sampling columns (similarly for rows):

51

For step 5: Having selected each of the columns in A, we scale each column by dividing its elements by the square root of the expected number of times this column would be picked. That is, we divide the elements of the jth column of A, if it is selected, by sqrt(cP(j)).

51

CUR: Computing U

Let be the “intersection” of sampled columns C and rows R

Let SVD of

Then: , where

is the “Moore-Penrose pseudo-inverse”.

, if .

52

CUR: Pros & Cons

+ easy interpretation

the basis vectors are actual columns and rows

– duplicate columns and rows

columns of large norms will be sampled many times

53

CUR: Duplicate Columns

If we want to get rid of the duplicates

Throw them away

Scale the columns/rows by the square root of the number of duplicates

54

SVD vs CUR

55

Question: Large or small? Dense or sparse?

SVD vs CUR

56

SVD & CUR: Simple Experiments

DBLP data

author-to-conference matrix

very sparse

: number of papers published by author at conference .

428k authors (rows)

3659 conferences (column)

Dimensionality reduction

Running time?

Space?

Reconstruction error?

57

Results: DBLP

accuracy: 1-relative sum squared errors

space ratio: # output non-zero matrix entries / # input non-zero matrix entries

courtesy: Sun, Faloutsos: Less is more, Compact Matrix Decomposition for Large Sparse Graph, SDM’07

58

58

The Linearity Assumption

SVD is limited to linear projections

Data lies on a low-dimensional linear space

Non-linear methods: Isomap

Data lies on a low-dimensional manifold

Non-linear

How?

Build adjacency graph

SVD the graph adjacency matrix

Further reading: wikipage of Isomap

59

PCA: An Application of SVD

PCA = Principle Component Analysis

Motivation

Visualization

60

PCA: Data Visualization

Example:

Given 53 blood samples (features) from 65 people (data item or instance)

How can we visualize the samples

61

PCA: Data Visualization

How can we visualize the other variables???

… difficult to see in 4 or higher dimensional spaces …

62

PCA: Data Visualization

Is there a representation better than the coordinate axes?

Is it really necessary to show all the 53 dimensions?

What if there are strong correlations between the features?

How could we find the smallest subspace of the 53-D space that keeps the most information about the original data?

A solution: Principal Component Analysis

An application of SVD.

63

PCA: Definition and Algorithms

PCA

Orthogonal projection of the data onto a lower-dimensional linear space such that

Maximize variance of projected data (purple line)

Minimize mean squared distance between

Data point

Projection (sum of blue lines)

Look data from a literally different angle.

64

PCA: Idea

Given data points in a d-dimensional space, project them into a lower dimensional space while preserving as much information as possible.

Find best planar approximation to 3D data

Find best 12-D approximation to 104-D data

In particular, choose projection that minimizes squared error in reconstructing the original data.

Implement through SVD

65

PCA

PCA Vectors originate from the center of mass.

Principal component #1: points in the direction of the largest variance.

Each subsequent principal component

is orthogonal to the previous ones, and

points in the directions of the largest variance of the residual subspace

66

PCA: 2D Gaussian dataset

67

1st PCA axis

68

2nd PCA axis

69

PCA: Algorithm

Given centered data , compute principle vectors

1st principle vector

maximize the variance of projection of

70

PCA: Algorithm by SVD

SVD of the centered data matrix

: number of instances

: dimension

A

71

PCA: Algorithm by SVD

Columns of

is exactly the principal vectors,

orthogonal and has unit norm

Matrix

Diagonal

Strength of each eigenvector

Columns of

Coefficients for reconstructing the samples.

72

Application: Face Recognition

Want to identify specific person, based on facial image

Can’t just use the given 256 x 256 pixels

73

Applying PCA

Method A: Build a PCA subspace for each person and check which subspace can reconstruct the test image the best

Method B: Build one PCA database for the whole dataset and then classify based on the weights.

Example data set: Images of faces

Each face is …

values

74

Each face is converted to a 256*256-d vector (not matrix).

74

Principal Components (Method B)

75

Reconstructing … (Method B)

76

Happiness Subspace (Method A)

77

Suppose we have a set of faces known as happy faces (i.e. we have labels for each faces). Then, we may compute the principle vectors of this subset of faces. These principle vectors are shown here.

Similarly, in the next slide, we can compute the principle vectors of disgust faces.

Therefore, to predict the label of a test face, we can use these two sets of principle vectors to reconstruct test face and see which set reconstructs the test face better (method A).

77

Disgust Subspace (Method A)

78

Image Compression

Divide the original 372×492 image into patches:

Each patch is an instance that contains 12×12 pixels on a grid

View each as a 144-D vector

79

PCA Compression: 144D => 60D

80

PCA Compression: 144D => 16D

81

PCA Compression: 144D => 6D

82

6 Most Important Eigenvectors

83

PCA Compression: 144D => 3D

84

3 Most Important Eigenvectors

85

Noisy Image

86

Denoised Image using 15 PCA Components

87

PCA: Shortcomings

PCA cannot capture non-linear structure

Similar with SVD

88

PCA: Shortcomings

PCA does not know labels

89

PCA: Conclusions

PCA

find orthonormal basis for data

sort dimensions in order of “strength”

discard low significance dimensions

Uses

Get compact description

Ignore noise

Improve classification (hopefully)

Not magic:

Doesn’t know class labels

Can only capture linear variations

One of many tricks to reduce dimensionality!

90

Extra: Compute PCA Using Eigen-Decomposition

Given centered data compute covariance matrix

Top PCA components = Top eigenvectors of .

Equivalence between eigen-decomposition and SVD

SVD decomposition of ,

SVD-based algorithm for PCA

Eigen-decomposition of .

Eigen-based algorithm for PCA

The equivalence gives .

91

One-slide Takeaway

Dimensionality reduction

compress/reduce dimension

reconstruct the original matrix by two or more smaller matrices

Singular value decomposition (SVD)

decompose a matrix into

: column-orthonormal. diagonal matrix.

CUR decomposition

set of columns of . set of rows of .

Principle component analysis (PCA)

reconstruct data matrix by a smaller number of eigenvectors

view the data from a literally different angle.

92

In-class Practice

1. Describe briefly (informally or formally) the relationship between singular value decomposition and eigenvalue decomposition.

2.1 Compute the eigenvalues and eigenvectors of matrix

2.2 Let It is easy to check that . What are the singular values of ?

2.3 Obtain SVD for A where A = U VT

93

/docProps/thumbnail.jpeg