CS代考计算机代写 information retrieval CE306/CE706 Information Retrieval
CE306/CE706 Information Retrieval
Indexing and Information Retrieval Models
Dr Alba García Seco de Herrera
Brief Module Outline (Reminder)
§ Motivation + introduction
§ Processing pipeline / indexing + query processing
§ Large-scale open-source search tools
§ Information Retrieval models
§ Evaluation
§ Log analysis
§ User profiles / personalisation / contextualisation
§ IR applications, e.g. enterprise search
Performing retrieval
Indexing and Query Processing
§ We looked at how a document gets turned into an index via a pre-processing pipeline
§ Next, we will see two types of indices and how they can be used to retrieve documents
§ Bit-map index vs. variable-length inverted- list index
§ In particular, we’ll focus on how they can be used to evaluate Boolean queries quickly
§ Both produce the same output
§ However, they go about it in different ways
Binary Full-text Representation bitmap index
a
aardvark
abacus
abba
able
…
zoom
doc_1
1
0
0
0
0
…
1
doc_2
0
0
0
0
1
…
1
::
::
::
::
::
::
…
0
doc_m
0
0
1
1
0
…
0
§ 1 = the word appears in the document at least once
§ 0 = the word does not appear in the document
§ Does not represent word frequency, order, or location information
Binary Full-text Representation bitmap index
a aardvark abacus abba able … zoom doc_1 1 0 0 0 0 … 1
:: :: :: :: :: :: … 0 doc_m 0 0 1 1 0 … 0
§ This type of document representation is known as a bag of words representation
§ Term location information is lost § dog bites man = man bites dog
§ Simplistic, but surprisingly effective for search
doc_2
0 0 0 0 1 … 1
Binary Full-text Representation bitmap index
abba
0 0 :: 1
a doc_1 1
doc_2 0 :: :: doc_m 0
aardvark abacus
0 0 0 0 :: :: 0 1
able … zoom
0 … 1 1 … 1 :: … 0 0 … 0
§ Every indexed term is associated with an inverted list § Inverted list: marks the docs where the term appears
at least once
§ This type of inverted list is called a bit-vector
§ In a bit-map index, all inverted lists (or vectors) have equal length
Processing a Boolean Query
§ Query: Jack AND Jill
doc_1 doc_2 doc_3 doc_4 doc_5 doc_6 doc_7 doc_8
Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown, And Jill came tumbling after.
Up Jack got, and home did trot,
As fast as he could caper,
To old Dame Dob, who patched his nob With vinegar and brown paper.
Processing a Boolean Query
§ Query: Jack AND Jill
docid
doc_1 doc_2 doc_3 doc_4 doc_5 doc_6 doc_7 doc_8
text
Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown, And Jill came tumbling after.
Up Jack got, and home did trot,
As fast as he could caper,
To old Dame Dob, who patched his nob With vinegar and brown paper.
Jack Jill
1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0
Processing a Boolean Query
§ Query: Jack AND Jill
Jack Jill Jack AND Jill
doc_1 1 1 ? doc_2 0 0 ? doc_3 1 0 ? doc_4 0 1 ? doc_5 1 0 ? doc_6 0 0 ? doc_7 0 0 ? doc_8 0 0 ?
Processing a Boolean Query
§ Query: Jack AND Jill
Jack Jill Jack AND Jill
doc_1 1 1 1 doc_2 0 0 0 doc_3 1 0 0 doc_4 0 1 0 doc_5 1 0 0 doc_6 0 0 0 doc_7 0 0 0 doc_8 0 0 0
Processing a Boolean Query
§ Query: Jack OR Jill
docid
doc_1 doc_2 doc_3 doc_4 doc_5 doc_6 doc_7 doc_8
text
Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown, And Jill came tumbling after.
Up Jack got, and home did trot,
As fast as he could caper,
To old Dame Dob, who patched his nob With vinegar and brown paper.
Jack Jill
1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0
Processing a Boolean Query
§ Query: Jack OR Jill
Jack Jill Jack OR Jill
doc_1 1 1 ? doc_2 0 0 ? doc_3 1 0 ? doc_4 0 1 ? doc_5 1 0 ? doc_6 0 0 ? doc_7 0 0 ? doc_8 0 0 ?
Processing a Boolean Query
§ Query: Jack OR Jill
Jack Jill Jack OR Jill
doc_1 1 1 1 doc_2 0 0 0 doc_3 1 0 1 doc_401 1 doc_5 1 0 1 doc_6 0 0 0 doc_7 0 0 0 doc_8 0 0 0
Processing a Boolean Query § Query: Jack AND (up OR down)
up down up OR down Jack
Jack AND (up OR down)
doc_110?1 ? doc_200?0 ? doc_301?1 ? doc_400?0 ? doc_510?1 ? doc_600?0 ? doc_700?0 ? doc_800?0 ?
Processing a Boolean Query § Query: Jack AND (up OR down)
up down up OR down Jack
Jack AND (up OR down)
doc_11011 ? doc_20000 ? doc_30111 ? doc_40000 ? doc_51001 ? doc_60000 ? doc_70000 ? doc_80000 ?
Processing a Boolean Query § Query: Jack AND (up OR down)
up down up OR down Jack
Jack AND (up OR down)
doc_11011 1 doc_20000 0 doc_30111 1 doc_40000 0 doc_51001 0 doc_60000 0 doc_70000 0 doc_80000 0
Processing a Boolean Query
§ Query: Jack AND NOT Jill
Jack Jill Jack AND NOT Jill
doc_1 1 1 ? doc_2 0 0 ? doc_3 1 0 ? doc_4 0 1 ? doc_5 1 0 ? doc_6 0 0 ? doc_7 0 0 ? doc_8 0 0 ?
Processing a Boolean Query
§ Query: Jack AND NOT Jill
Jack Jill NOT Jill Jack AND NOT Jill
doc_111? ? doc_200? ? doc_310? ? doc_401? ? doc_510? ? doc_600? ? doc_700? ? doc_800? ?
Processing a Boolean Query
§ Query: Jack AND NOT Jill
Jack Jill NOT Jill Jack AND NOT Jill
doc_1110 ? doc_2001 ? doc_3101 ? doc_4010 ? doc_5101 ? doc_6001 ? doc_7001 ? doc_8001 ?
Processing a Boolean Query
§ Query: Jack AND NOT Jill
Jack Jill NOT Jill Jack AND NOT Jill
doc_1110 0 doc_2001 0 doc_3101 1 doc_4010 0 doc_5101 1 doc_6001 0 doc_7001 0 doc_8001 0
Binary Full-text Representation
a
aardvark
abacus
abba
able
…
zoom
doc_1
1
0
0
0
0
…
1
doc_2
0
0
0
0
1
…
1
::
::
::
::
::
::
…
0
doc_m
0
0
1
1
0
…
0
§ These are fixed-length inverted lists, each of size m (the number of documents in the collection)
§ Are these inverted lists efficient in terms of storage?
Statistical Properties of Text (Examples)
§ IMDB collection (movies, artist/role, plot descriptions)
§ number of documents: 230,721
§ number of unique terms (= types): 424,035
§ number of term occurrences (= tokens): 36,989,629
Statistical Properties of Text (Examples)
§ Term Statistics (remember Zipf’s Law!) § Most terms occur very infrequently
§ 44% of all terms occur only once
§ 77% occur 5 times or less
§ 85% occur 10 times or less
§ Only 6% occur 50 times or more
Sparse Representation of an Inverted List
§ Most terms appear in only a few documents
§ Most bit-vectors have many 0’s and only a few 1’s
Final step
§ A bitmap index is very inefficient
§ Alternative: represent only the 1’s: § aardvark: 00101011….
§ aardvark: df = 18; 33, 56, 86, …, 1011
§ df = number of documents in which the term appears at least once
§ Each document has a unique identifier (docid)
Inverted Index Full-text Representation
1 33 2 33 66 54 33 56 10 150 134
458615 176 :::::: ::
1022 1011 231 432
§ Variable-length inverted lists
§ Each document has a unique identifier (docid) § Why are the inverted lists sorted by docid?
§ Why do we store the df’s in the index?
a
aardvark
abacus
abba
able
…
zoom
df=3421
df=22
df=19
df=2
df=44
df=1
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack AND and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotequal, increment pointer with lowest docid
3. Repeatuntil(1)endof one list and (2) docid from other list is greater
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack AND and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotequal, increment pointer with lowest docid
3. Repeatuntil(1)endof one list and (2) docid from other list is greater
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack AND and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotequal, increment pointer with lowest docid
3. Repeatuntil(1)endof one list and (2) docid from other list is greater
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack AND and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotequal, increment pointer with lowest docid
3. Repeatuntil(1)endof one list and (2) docid from other list is greater
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack AND and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotequal, increment pointer with lowest docid
3. Repeatuntil(1)endof one list and (2) docid from other list is greater
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack AND and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotequal, increment pointer with lowest docid
3. Repeatuntil(1)endof one list and (2) docid from other list is greater
If the inverted list for “and” was longer, would it make sense to continue? Why or why not?
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack AND and Final step
1. Ifdocidsareequal,add docid to results and
This is why the
increment both pointers
3. Repeatuntil(1)endof one list and (2) docid from other list is greater
If the inverted list for “and” was longer, would it make sense to continue? Why or why not?
inverted lists are
2. Ifdocidsasroertneodtienquaaslc,ending increment poirndter wofithdocid! lowest docid
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack OR and Final step
Jack and
df=3 df=5
11
33 54 5 8
Jack OR and
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotadd lowest docid and increment its pointer
3. Repeatuntilendofboth lists
count=5
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack OR and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotadd lowest docid and increment its pointer
3. Repeatuntilendofboth lists
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack OR and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotadd lowest docid and increment its pointer
3. Repeatuntilendofboth lists
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack OR and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotadd lowest docid and increment its pointer
3. Repeatuntilendofboth lists
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack OR and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotadd lowest docid and increment its pointer
3. Repeatuntilendofboth lists
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack OR and Final step
1. Ifdocidsareequal,add docid to results and increment both pointers
2. Ifdocidsarenotadd lowest docid and increment its pointer
3. Repeatuntilendofboth lists
stop!
Merging (Variable-Length) Inverted Lists AND
§ Query: Jack OR and Final step
Jack OR and 111
1. Ifdocidsareequal,add docid to results and increment both pointers
Jack and df=3 df=5
2. Ifdocidsarenotadd lowest docid and increment its pointer
3. Repeatuntilendofboth lists
333 544 55 88
§ Which is more expensive (on average) AND or OR?
count=5
Merging (Variable-Length) Inverted Lists
§ In some cases, the search engine has a choice in the order of operations
§ Query: Abraham AND Lincoln AND President § option 1: ( Abraham AND Lincoln ) AND President § option 2: Abraham AND ( Lincoln AND President ) § option 3: ( Abraham AND President ) AND Lincoln
§ Which is probably the most effective order of operations?
Merging (Variable-Length) Inverted Lists
§ Which is probably the most effective order of operations?
president
df=302 df=45 df=5
XX XX XX
XX XX XX XX XX XX XX XX XX XX XX XX
:: :: XX XX
abraham lincoln
Brief Module Outline (Reminder)
§ Motivation + introduction
§ Processing pipeline / indexing + query processing
§ Large-scale open-source search tools
§ Information Retrieval models
§ Evaluation
§ Log analysis
§ User profiles / personalisation / contextualisation
§ IR applications, e.g. enterprise search
Performing retrieval
What is a Retrieval Model?
§ A formal method that predicts the degree of relevance of a document to a query
Basic IR Process
the retrieval model is responsible for performing this comparison and retrieving objects that are likely to satisfy the user
Retrieval Models
§ Older models
§ Boolean model
§ Vector space model
§ Probabilistic models § BM25
§ Language models
§ Combining evidence
§ Inference networks § Learning to rank
Boolean Retrieval § X AND Y
Boolean Retrieval § X OR Y
Boolean Retrieval § X AND NOT Y
Boolean Model
§ The users describe their information need using Boolean constraints (e.g., AND, OR, and AND NOT)
§ Based on set theory and Boolean algebra
§ Retrieves the set of documents that match the Boolean query (an “exact-match” retrieval model)
§ Similar to a DBMS call
§ Returns results in no particular order
(documents are either relevant or not relevant)
§ This is problematic with large collections
Boolean Retrieval Models: Advantages
§ Clean formalism
§ Easy for the system
§ Users get transparency:
§ it is easy to understand why a document was or was not retrieved
§ Users get control:
§ easy to determine whether the query is too specific (few results) or too broad (many results)
Boolean Retrieval Models: Disadvantages
§ Exact matching only
§ All index terms of equal weight
§ Difficulty to express natural language queries in Boolean expressions
§ The burden is on the user to formulate an effective query
But: there are use cases for this approach…
Boolean Models: Systematic Review Use Case
But what about your typical search case?
Can we do Better?
§ Need to have a way of weighting different terms
§ e.g. the higher the value, the more important the term
§ Compare query and document to get a measure of similarity
§ All this leads directly to the Vector Space Model
§ But let’s look at relevance first
Relevance
§ Many factors affect whether a document satisfies a particular user’s information need
§ Topicality, novelty, freshness, authority, formatting, reading level, assumed level of prior knowledge/expertise
§ Topical relevance:
§ the document is on the same topic as the
query
§ User relevance:
§ everything else!
Introduction to Best-Match Retrieval Models
§ So far, we’ve discussed ‘exact-match’ models
§ Now we start discussing ‘best-match’ models
§ Best-match models predict the degree to which a document is relevant to a query
§ Ideally, this would be expressed as RELEVANT(q,d)
§ In practice, it is expressed as SIMILAR(q,d)
§ How might you compute the similarity between q and d?
Vector Space Model
What is a Vector Space?
§ Formally, a vector space is defined by a set of linearly independent basis vectors
§ The basis vectors correspond to the dimensions or directions of the vector space
What is a Vector?
§ A vector is a point in a vector space and has length (from the origin to the point) and direction
What is a Vector?
§ A 2-dimensional vector can be written as § [x,y]
§ A 3-dimensional vector can be written as § [x,y,z]
What is a Vector?
§ The basis vectors are linearly independent because knowing a vector’s value along one dimension doesn’t say anything about its value along another dimension
Binary Text Representation
a doc_1 1
doc_2 0 :: :: doc_m 0
aardvark abacus abba able
0 0 0 0 0 0 0 1 :: :: :: :: 0 1 1 0
… zoom
… 1 … 1 … 0 … 0
§ 1 = the word appears in the document
§ 0 = the word does not appear in the document
§ Does not represent word frequency, word location, or word order information
Vector Space Representation
§ Let V denote the size of the indexed vocabulary
§ V = the number of unique terms,
§ V = the number of unique terms excluding stopwords,
§ V = the number of unique stems, etc…
§ Any arbitrary span of text (i.e., a document, or a query) can be represented as a vector in V- dimensional space
§ For simplicity, let’s assume three index terms: dog, bite, man (i.e., V=3)
§ Why? Because it’s easy to visualize 3-D space
Vector Space Representation with binary weights
§ 1 = the term appears at least once § 0 = the term does not appear
dog man bite
doc_1 1 1 1 doc_2 1 0 1 doc_3 0 1 1
Vector Space Representation with binary weights
§ 1 = the term appears at least once § 0 = the term does not appear
dog man bite
doc_1 1 1 1 doc_2 1 0 1 doc_3 0 1 1
Vector Space Representation with binary weights
§ 1 = the term appears at least once § 0 = the term does not appear
dog man bite
doc_1 1 1 1 doc_2 1 0 1 doc_3 0 1 1
Vector Space Representation with binary weights
§ What span(s) of text does this vector represent?
Vector Space Representation with binary weights
§ What span(s) of text does this vector represent?
Vector Space Representation with binary weights
§ What span(s) of text does this vector represent?
Vector Space Representation
§ Any span of text is a vector in V-dimensional space, where V is the size of the vocabulary
Vector Space Representation
§ A query is a vector in V-dimensional space, where V is the number of terms in the vocabulary
Vector Space Similarity
§ The vector space model ranks documents based on the vector-space similarity between the query vector and the document vector
§ There are many ways to compute the similarity between two vectors
§ One way is to compute the inner product V
∑ xi × yi i=1
The Inner Product
§ Multiply corresponding components and then sum those products
V
∑ xi × yi i=1
xi yi
xi × yi
a
1
1
1
aardvark
0
1
0
abacus
1
1
1
abba
1
0
0
able
0
1
0
::
::
::
::
zoom
0
0
0
inner product =>
2
The Inner Product
§ What does the inner product (with a binary representation) correspond to?
V
∑ xi × yi i=1
xi yi
xi × yi
a
1
1
1
aardvark
0
1
0
abacus
1
1
1
abba
1
0
0
able
0
1
0
::
::
::
::
zoom
0
0
0
inner product =>
2
The Inner Product
§ When using 0’s and 1’s, this is just the number of unique terms in common between the query and the document
V
∑ xi × yi i=1
xi yi
xi × yi
a
1
1
1
aardvark
0
1
0
abacus
1
1
1
abba
1
0
0
able
0
1
0
::
::
::
::
zoom
0
0
0
inner product =>
2
The Inner Product
§ 1 = the term appears at least once § 0 = the term does not appear
The Inner Product
§ Multiply corresponding components and then sum those products
§ Using a binary representation, the inner product corresponds to the number of terms appearing (at least once) in both spans of text
§ Scoring documents based on their inner- product with the query has one major issue. Any ideas?
The Inner Product
§ What is more relevant to a query?
§ A 50-word document which contains 3 of the query-terms?
§ A 100-word document which contains 3 of the query-terms?
§ All things being equal, longer documents are more likely to have the query-terms
§ The inner-product doesn’t account for the fact that documents have widely varying lengths
§ So, the inner-product favours long documents
The Cosine Similarity
§ The numerator is the inner product
§ The denominator is the product of the two vector-lengths
§ Ranges from 0 to 1 (equals 1 if the vectors are identical)
§ The cosine of the angle between the two vectors
§ 0 if the angle is 90 degrees
Vector Space Model
cosine similarity example (binary weights)
∑ Vi = 1 x i × y i ∑Vi=1 xi2 × ∑Vi=1 y2i
cosine( [1,0,1] , [1,1,0] ) =
Homework…
V
∑i=1 xi × yi
∑ Vi = 1 x i2 × ∑ Vi = 1 y 2i
§ For each document, compute the inner- product and cosine similarity score for the query: Jill
doc_1 doc_2 doc_3 doc_4 doc_5 doc_6 doc_7 doc_8
Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown, And Jill came tumbling after.
Up Jack got, and home did trot,
As fast as he could caper,
To old Dame Dob, who patched his nob With vinegar and brown paper.
More Homework…
V
∑i=1 xi × yi
∑ Vi = 1 x i2 × ∑ Vi = 1 y 2i
§ For each document, compute the inner- product and cosine similarity score for the query: Jack
doc_1 doc_2 doc_3 doc_4 doc_5 doc_6 doc_7 doc_8
Jack and Jill went up the hill
To fetch a pail of water.
Jack fell down and broke his crown, And Jill came tumbling after.
Up Jack got, and home did trot,
As fast as he could caper,
To old Dame Dob, who patched his nob With vinegar and brown paper.
Vector Space Representation
§ So far, we’ve assumed binary vectors
§ 0’s and 1’s indicate whether the term occurs (at least
once) in the document/query
§ Let’s explore a more sophisticated representation
Term-Weighting
what are the most important terms?
§ Movie: Rocky (1976)
§ Plot: Rocky Balboa is a struggling boxer trying to make the big time. Working in a meat factory in
Philadelphia for a pittance, he also earns extra cash as a debt collector. When heavyweight champion Apollo Creed visits Philadelphia, his managers want to set up an exhibition match between Creed and a struggling boxer, touting the fight as a chance for a “nobody” to become a “somebody”. The match is supposed to be easily won by Creed, but someone forgot to tell Rocky, who sees this as his only shot at the big time. Rocky Balboa is a small-time boxer who lives in an apartment in Philadelphia, Pennsylvania, and his career has so far not gotten off the canvas. Rocky earns a living by collecting debts for a loan shark named Gazzo, but Gazzo doesn’t think Rocky has the viciousness it takes to beat up deadbeats. Rocky still boxes every once in a while to keep his boxing skills sharp, and his ex-trainer, Mickey, believes he could’ve made it to the top if he was willing to work for it. Rocky, goes to a pet store that sells pet supplies, and this is where he meets a young woman named Adrian, who is extremely shy, with no ability to talk to men. Rocky befriends her. Adrain later surprised Rocky with a dog from the pet shop that Rocky had befriended. Adrian’s brother Paulie, who works for a meat packing company, is thrilled that someone has become interested in Adrian, and Adrian spends Thanksgiving with Rocky. Later, they go to Rocky’s apartment, where Adrian explains that she has never been in a man’s apartment before. Rocky sets her mind at ease, and they become lovers. Current world heavyweight boxing champion Apollo Creed comes up with the idea of giving an unknown a shot at the title. Apollo checks out the Philadelphia boxing scene, and chooses Rocky. Fight promoter Jergens gets things in gear, and Rocky starts training with Mickey. After a lot of training, Rocky is ready for the match, and he wants to prove that he can go the distance with Apollo. The ‘Italian Stallion’, Rocky Balboa, is an aspiring boxer in downtown Philadelphia. His one chance to make a better life for himself is through his boxing and Adrian, a girl who works in the local pet store. Through a publicity stunt, Rocky is set up to fight Apollo Creed, the current heavyweight champion who is already set to win. But Rocky really needs to triumph, against all the
odds…
Term-Weighting how important is a term?
rank term 1 a
2 rocky 3 to 4 the 5 is
6 and 7 in 8 for 9 his
10 he 11 adrian 12 with 13 who 14 that 15 apollo
freq. rank term
22 16 creed 5 19 17 philadelphia 5 18 18 has 4 17 19 pet 4 11 20 boxing 4 10 21 up 4 10 22 an 4
7 23 boxer 4 7 24 s 3 6 25 balboa 3 6 26 it 3 6 27 heavyweigh 3
t
6 28 champion 3
5 29 fight 3
freq.
5 30
become 3
Term-Weighting how important is a term?
rank term 1 a
2 rocky 3 to 4 the 5 is
6 and 7 in 8 for 9 his
10 he 11 adrian 12 with 13 who 14 that 15 apollo
freq. rank term
22 16 creed 5 19 17 philadelphia 5 18 18 has 4 17 19 pet 4 11 20 boxing 4 10 21 up 4 10 22 an 4
7 23 boxer 4 7 24 s 3 6 25 balboa 3 6 26 it 3 6 27 heavyweigh 3
t
6 28 champion 3
5 29 fight 3
freq.
5 30
become 3
Inverse Document Frequency (IDF) how important is a term?
idft = log( N ) dft
§ N = number of documents in the collection
§ dft = number of documents in which term t appears
Inverse Document Frequency (IDF) how important is a term?
rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
term
doesn
adrain
viciousness
deadbeats
touting
jergens
gazzo
pittance
balboa
heavyweigh
t stallion
canvas ve managers apollo
idf rank 11.66 16 10.96 17
9.95 18 9.86 19 9.64 20 9.35 21 9.21 22 9.05 23 8.61 24 7.18 25 7.17 26 7.10 27 6.96 28 6.88 29 6.84 30
term idf creed 6.84 paulie 6.82
packing 6.81 boxes 6.75 forgot 6.72 ease 6.53 thanksgivin 6.52
g
earns 6.51
pennsylvani 6.50 a
promoter 6.43 befriended 6.38 exhibition 6.31 collecting 6.23 philadelphia 6.19 gear 6.18
TF.IDF
how important is a term?
TF.IDF
term rocky philadelphia boxer fight mickey for
tft ×log N dft
tf N
19 230721 5 230721 4 230721 3 230721 2 230721 7 230721
df idf
1420 5.09 473 6.19 900 5.55
8170 3.34
2621 4.48 117137 0.68
tf.idf
96.72 30.95 22.19 10.02 8.96 4.75
TF.IDF
how important is a term
rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
term
tf.idf rank 96.72 16 34.20 17 34.18 18 30.95 19 26.44 20 25.83 21 22.37 22 22.19 23 21.54 24 21.17 25 18.43 26 15.08 27 13.96 28 13.01 29 11.82 30
term tf.idf
rocky apollo creed philadelphia adrian balboa boxing boxer heavyweigh t
meat doesn adrain fight
11.76 11.66 10.96 10.02
pet gazzo champion match earns apartment
viciousness 9.95 deadbeats 9.86 touting 9.64 current 9.57 jergens 9.35 s 9.29 struggling 9.21 training 9.17 pittance 9.05 become 8.96 mickey 8.96
TF.IDF/Caricature Analogy
§ TF.IDF: accentuates terms that are frequent in the document, but not frequent in general
§ Caricature: exaggerates traits that are characteristic of the person (compared to the average)
Queries as TF.IDF Vectors
§ Terms tend to appear only once in the query
§ TF usually equals 1
§ IDF is computed using the collection
statistics
idft = log( N ) dft
§ Terms appearing in fewer documents get a higher weight
Vector Space Model
cosine similarity example (tf.idf weights)
§ Rank documents based on cosine similarity to the query
Vector Space Model
cosine similarity example (tf.idf weights)
∑ Vi = 1 x i × y i ∑Vi=1 xi2 × ∑Vi=1 y2i
cosine( [2.3,0.0,1.5] , [5.5,2.0,0.0] ) =
(2.3⇥5.4)+(0.0⇥2.0)+(1.5⇥0.0) p2.32 + 0.02 + 1.52 ⇥ p5.42 + 2.02 + 0.02
TF.IDF
§ Many variants of this formula have been proposed
§ However, they all have two components in common:
§ TF: favours terms that are frequent in the document
§ IDF: favours terms that do not occur in many documents N
tft×log dft
Independence Assumption
§ The basis vectors (X, Y, Z) are linearly independent because knowing a vector’s value on one dimension doesn’t say anything about its value along another dimension
does this hold true for natural language text?
Independence Assumption
§ The vector space model assumes that terms are independent
§ The fact that one occurs says nothing about another one occurring
§ This is viewed as a limitation
§ However, the implications of
this limitation are still debated
§ A very popular solution
Vector Space Model
§ Any text can be seen as a vector in
V-dimensional space § a document
§ a query
§ a sentence
§ a word
§ an entire encyclopedia
§ Rank documents based on their cosine
similarity to query
§ If a document is similar to the query, it is likely to be relevant (remember: topical relevance!)
Vector Space Representation
§ A power tool!
§ A lot of problems in IR can be cast as:
§ Find me _____ that is similar to _____ !
§ As long as _____ and _____ are associated with text, one potential solution is:
§ represent these items as tf.idf term-weight vectors and compute their cosine similarity
§ return the items with the highest similarity
Vector Space Representation
§ Find documents that are similar to this query
Vector Space Representation
§ Find ads that are similar to these results
Vector Space Representation
§ Find queries that are similar to this query
Vector Space Representation
§ Find ads that are similar to this document
Vector Space Representation
§ Find documents (with a known category assignment) that are similar to this document
Summary
§ We have seen how documents are being processed resulting in a set of words and matching documents
§ In this process we treat documents as a `bag of words’
§ The output goes into an index table for fast search
§ Two types of index structures, both fairly simple (e.g., they only record the presence of a term)
§ Next we will see how different IR models match a query against the index
Summary
§ Any text can be seen as a vector in
V-dimensional space § a document
§ a query
§ a sentence
§ a word
§ an entire encyclopedia
§ Rank documents based on their cosine
similarity to query
§ If a document is similar to the query, it is likely to be relevant (remember: topical relevance!)
Reading § Chapter 5.3 § Chapter 7.1 § Next week:
§ Chapter 8
§ Fuhr, N. “Some Common Mistakes In IR Evaluation, And How They Can Be Avoided”. SIGIR Forum 51(3), December 2017.
Cake with WCSEE
International Day of Women and Girls in Science
9th February, 4pm
@wcsee_essex @womenincsee wcsee-info@essex.ac.uk
Acknowledgements
§ Thanks again to contain substantial material prepared by Udo Kruschwitz and Jaime Arguello
§ Additional material as provided by Croft et al. (2015)