# 代写代考 BM25 to reduce computational overhead. – cscodehelp代写

The following documents have been processed by an IR system where stemming is not applied:
1998 croatia
Show your working. Note that log2(0.75)= -0.4150 and log2(1.3333)= 0.4150.
In Web search, explain why the use of raw term frequency (TF) counts in scoring documents can hurt the effectiveness of the search engine.

DocID Text
Doc1 france is world champion 1998 france won
Doc2 croatia and france played each other in the semifinal
Doc3 croatia was in the semifinal 1998
Doc4 croatia won the other semifinal in russia 2018
Assume that the following terms are stopwords: and, in, is, the, was. Construct an inverted file for these documents, showing clearly the dictionary and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij*log2(N/dfi)
Suggest a solution to alleviate the problem, and show through examples how it might work. Explain through examples how modern term weighting models in IR control the raw term frequency counts.
Compute the term weights of the two terms 􏰇champion􏰈 a􏰮d 􏰇1998􏰈 i􏰮 D􏰯c1. Show your working.
Assuming the use of a best match ranking algorithm, rank all documents using their relevance scores for the following query:
April- 1 Continued Overleaf

(c) Assume that you have decided to modify the approach you use to rank the documents of your collection. You have developed a new Web ranking approach that makes use of recent advances in neural networks. All other components of the system remain the same. Explain in detail the steps you need to undertake to determine whether your new Web ranking approach produces a better retrieval performance than the original ranking approach.
(a) Consider a corpus of documents C written in English, where the frequency
dis􏰰rib􏰁􏰰ion of 􏰂ords appro􏰃ima􏰰el􏰄 follo􏰂s Zipf􏰅s la􏰂 r * p(􏰂r|C) = 0.1, where r = 1,2, 􏰆, n is 􏰰he rank of a 􏰂ord b􏰄 decreasing order of freq􏰁enc􏰄. 􏰂r is the word at rank r, and p(wr|C) is the probability of occurrence of word wr in the corpus C.
Compute the probability of occurrence of the most frequent word in C. Compute the probability of occurrence of the 2nd most frequent word in C. Justify your answers.
(b) Consider 􏰰he q􏰁er􏰄 􏰇michael jackson m􏰁sic􏰈 and 􏰰he follo􏰂ing 􏰰erm freq􏰁encies for the three documents D1, D2 and D3, where the search engine is using raw term frequency (TF) but no IDF:
indiana jackson life michael music pop really D1 0 4 1 3 0 6 1
D2 4 0 3 4 1 0 2
D3 0 4 0 5 4 4 0
Assume that the system has returned the following ranking: D2, D3, D1. The user judges D3 to be relevant and both D1 and D2 to be non-relevant.
Show the original query vector, clearly stating the dimensions of the vector.
Use Rocchio􏰅s relevance feedback algorithm (􏰂i􏰰h 􏰊=􏰋=􏰌=1) to provide a revised query vector for 􏰇michael jackson m􏰁sic􏰈. Terms in 􏰰he re􏰉ised q􏰁er􏰄 that have negative weights can be dropped, i.e. their weights can be changed back to 0. Show all your calculations.
April- 2 Continued Overleaf

(c) Suppose we have a corpus of documents with a dictionary of 6 words w1, …, w6. Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2 (fourth column), where ct(w, doci) is the count of word w (i.e. its term frequency) in document doci. Let the query q be the following:
Word p(w|C) ct(w, doc1)
SUM 1.0 10
Word p(w|C) ct(w, doc1)
ct(w, doc2) 7
ct(w, doc2)
3 Continued Overleaf
Assume that we do not apply any smoothing technique to the language model for doc1 and doc2. Calculate the query likelihood for both doc1 and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.
Suppose we now smooth the language model for doc1 and doc2 using Jelinek- Me􏰱ce􏰱 S􏰍􏰯􏰯􏰰hi􏰮g 􏰂i􏰰h 􏰲 = 0.1. Recalculate the likelihood of the query for both doc1 and doc2, i.e., p(q|doc1) and p(q|doc2) (Do not compute the log- likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.
Explain which document you think should be reasonably ranked higher (doc1 or doc2) and why?

(a) How would the IDF score of a word w change (i.e., increase, decrease or stay the same)
in each of the following cases: (1) adding the word w to a document; (2) making each document twice as long as its original length by concatenating the document with itself; (3) Adding some documents to the collection. You must suitably justify your answers.