IR H/M Course

(Recall) Bag of Words Representation

• Simple strategy for representing documents

• Count how many times each term occurs – Binary mode uses only 0 & 1

Copyright By cscodehelp代写 加微信 cscodehelp

• A ‘term’ is any lexical item that you chose such as:

– A word (delimited by ‘white space’ or punctuation)

– Some conflated ‘root form’ of each word (e.g. a stem)

– An n-gram (a sequence of any consecutive n chars)

• Doesn’t consider the ordering of words in a document

– John is quicker than Mary and Mary is quicker than John have the same

representation

– This could be a set back: positional information allows to distinguish these 2 docs

• For now: Bag of Words Model (BoW)

Vector Space Model

LET’S LOOK AT THIS PROCESS DIFFERENTLY ….

What%is%a%bag%of%word%%representation?

IR H/M Course

Document Vectors One location for each word

diet film fur galaxy heat h’wood nova role

“Nova” occurs 10 times in text A

“Galaxy” occurs 5 times in text A 10 10 “Heat” occurs 3 times in text A 9 10

(Blank means 0 occurrences.)

H 6 10 2 8 I7513

Document Vectors One location for each word

diet film fur galaxy heat h’wood nova role

F G579 H 6 10 2 8

10 5 3 5 10

“Hollywood” o1c0curs 78times7in text I “Film” occu9rs 5 tim10es in 5text I

“Diet” occurs 1 time in text I 10 10 “Fur” occurs 3 times in text I 9 10

IR H/M Course

Document ids

galaxy heat

Document Vectors

One vector for each document

A B C D E F G H I

Vector Space Model

• Documents are also treated as a bag of words or terms

• Each document is represented as a vector in a t-dimensional vector space (t is the number of index terms)

• Each term weight is computed based on some variations of TF or TF-IDF scheme

IR H/M Course

Document ids

diet film fur

galaxy heat

TF-IDF Vectors

A B C D E F G H I

Sparse’Matrix

More Formally ….

• Documents and queries are represented by

vectors of term weights

• A collection is represented by a matrix of term weights

So,$we$have$docs$represented$ as$vectors

How$can$we$use$this$in$retrieval?

t”is”the”number”of”index”terms”(words,”stems,”etc)

IR H/M Course

Retrieval Model

An IR Model defines

– a model for document representation

– a model for query representation

– a mechanism for estimating the relevance of a

document for a given query

Progress’in’retrieval’models’has’corresponded’with’ improvements’in’effectiveness

Relevance Estimation

Retrieval in Vector Space Model • Vector space model represents both query and

documents using term sets (term vectors)

• Documents and queries are represented in a high

dimensional space (Bag of Words)

– Each dimension of the space corresponds to a term in

the document collection (t-dimensional vector space) • Relevance Estimation is performed by identifying

documents similar to the query

– Relevance of di to q:”Compare the similarity of query q and document di

What%is%a%retrieval model?

What%is%its%purpose?

IR H/M Course

Geometrically: Vector Space Model

Assumption: Documents that are “close together” in vector space “talk about” the same things

NB:$3D#diagrams#useful,#but#can#be#misleading#for# high6dimensional#space

Geometrically: Vector Space Model

Assumption: Documents that are “close together” in vector space “talk about” the same things

Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

IR H/M Course

Vector Space • X = (t1,t2, …, tt)

– The number ti is called the i-th component of the vector

– Magnitude: is defined by the square root of the sum of the squares of the components

• that is, ∑ti2

– If ||X|| =1 then X is a unit vector • Concept of length normalization

Summary: Document Vectors

• Documents are represented as “bags of words”

• Represented as vectors when used computationally

– A vector is like an array of floating point

– Has direction and magnitude

– Each vector holds a (unique) place for every term in the collection

– Therefore, most vectors are sparse

IR H/M Course

Plotting the Vectors … & Intuition

Doc about astronomy

Doc about movie stars

Doc about mammal behavior Diet

Vector Space Intuition

– Books from a domain are organised at the same

place/ shelf/ nearby shelves

– Human organisation – librarian

What’is’the’intuition’behind The’vector4space’model?

IR H/M Course

Vector Space Model

• The relevant documents for a query are expected to be those represented by the vectors closest to the query

• Documents ranked by distance between points representing query and documents

– Similarity measure more common than a distance or dissimilarity measure

Cosine Measure

In(IR(we(consider(only(the(similarity(range(from(0(to(1 Why?(Why(not(-1(to(1?

• It measures cosine of the angle between the vectors

• Cosine ranges from 1 for vectors pointing in the same direction over zero for orthogonal vectors and -1 for vectors pointing in opposite directions

• If Cosine is applied to normalised (unit) vectors it gives the same ranking as Euclidean distance does

Cos(0’(=(1 Cos(90’(=(0 Cos(180’(=(-1 C

IR H/M Course

Similarity Calculation

– Consider two documents D1, D2 and a query Q

• D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)

How$could$we$implement$a$cosine$similarity3based$ measure$using$inverted$index?

What$role$the$denominator$plays?

Algorithm (Reminder)

For each document I, Score(I) =0; I = 1 to N For each query term tk

– Search the vocabulary list

– Pull out the postings list

– For each document J in the list,

• Score(J) =Score(J) + wkj

IR H/M Course

• D1 = (T1 => 12 ,T2=> 23 , T3=>3)

• D2 = (T1 => 3 , T2 => 2 , T3 => 1)

• Q = (T1 => 0 ,T2=> 0, T3=>2)

• Sim(D1,Q) = 12*0 + 23*0 + 3*2 =6

• Sim(D2,Q) = 3*0+3*0+1*2 = 2

sim

Matching Coefficient (Coordination Level)

• Simply counts the number of dimensions on which both vectors are non-zero

• |X ! Y| ” #xi * yi

• Number of shared index terms (binary vectors)

• Does not take into account the sizes of the vectors

If#you#are#using#an#inverted#index? 3*2=6

Role#of#Document#length?

Is#this#familiar?

IR H/M Course

Some Problems …

• Normalisation …

• Consider a single word query and a single word document (In Binary mode…)

– If that matches • Coefficient is 1

• Same query against a thousand word document

– If that matches • Coefficient is 1

Dice Coefficient

• 2 |X!Y|/(|X|+|Y|)

• Normalises for length by dividing by the total number of non-zero entries.

• We multiply by 2 so that we get a measure that ranges from 0 to 1.0

Justify(the(need(for(vector length(normalization

What%is%a%dice%coefficient?

IR H/M Course

Query Term Weighting • Boolean representation

– Just have a weight of zero or 1

• Short queries

– Typical of web searches

– Multiple keyword occurrences are rare

• Wkq = idfk

• Long queries

– Result of relevance feedback (will talk about it later)

• Wkq =fkq .idfk

Advantages and Disadvantages of a Vector-space Model

Advantages

− Simple geometric interpretation of retrieval readily comprehensible to non- specialist and a uniform basis for wide range of operations

− Easy to compute measure (any similarity measure can be used)

− Easy to adapt to various weighting schemes

− Provision for ranked output

Disadvantages

− High dimensionality − Term independence

assumption

− Adhoc similarity metric: Cosine, Dice, etc. (which one to use?)

−Adhoc term weighting (not theoretically founded)

−No guidance on when to stop ranking

Discuss&three&query&term& weighting&strategies!

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com