The following documents have been processed by an IR system where stemming is not applied:

1998 croatia

Show your working. Note that log2(0.75)= -0.4150 and log2(1.3333)= 0.4150.

In Web search, explain why the use of raw term frequency (TF) counts in scoring documents can hurt the effectiveness of the search engine.

Copyright By cscodehelp代写 加微信 cscodehelp

DocID Text

Doc1 france is world champion 1998 france won

Doc2 croatia and france played each other in the semifinal

Doc3 croatia was in the semifinal 1998

Doc4 croatia won the other semifinal in russia 2018

Assume that the following terms are stopwords: and, in, is, the, was. Construct an inverted file for these documents, showing clearly the dictionary and posting list components. Your inverted file needs to store sufficient information for computing a simple tf*idf term weight, where wij = tfij*log2(N/dfi)

Suggest a solution to alleviate the problem, and show through examples how it might work. Explain through examples how modern term weighting models in IR control the raw term frequency counts.

Compute the term weights of the two terms champion ad 1998 i Dc1. Show your working.

Assuming the use of a best match ranking algorithm, rank all documents using their relevance scores for the following query:

April- 1 Continued Overleaf

(c) Assume that you have decided to modify the approach you use to rank the documents of your collection. You have developed a new Web ranking approach that makes use of recent advances in neural networks. All other components of the system remain the same. Explain in detail the steps you need to undertake to determine whether your new Web ranking approach produces a better retrieval performance than the original ranking approach.

(a) Consider a corpus of documents C written in English, where the frequency

disribion of ords approimael follos Zipfs la r * p(r|C) = 0.1, where r = 1,2, , n is he rank of a ord b decreasing order of freqenc. r is the word at rank r, and p(wr|C) is the probability of occurrence of word wr in the corpus C.

Compute the probability of occurrence of the most frequent word in C. Compute the probability of occurrence of the 2nd most frequent word in C. Justify your answers.

(b) Consider he qer michael jackson msic and he folloing erm freqencies for the three documents D1, D2 and D3, where the search engine is using raw term frequency (TF) but no IDF:

indiana jackson life michael music pop really D1 0 4 1 3 0 6 1

D2 4 0 3 4 1 0 2

D3 0 4 0 5 4 4 0

Assume that the system has returned the following ranking: D2, D3, D1. The user judges D3 to be relevant and both D1 and D2 to be non-relevant.

Show the original query vector, clearly stating the dimensions of the vector.

Use Rocchios relevance feedback algorithm (ih ===1) to provide a revised query vector for michael jackson msic. Terms in he reised qer that have negative weights can be dropped, i.e. their weights can be changed back to 0. Show all your calculations.

April- 2 Continued Overleaf

(c) Suppose we have a corpus of documents with a dictionary of 6 words w1, …, w6. Consider the table below, which provides the estimated language model p(w|C) using the entire corpus of documents C (second column) as well as the word counts for doc1 (third column) and doc2 (fourth column), where ct(w, doci) is the count of word w (i.e. its term frequency) in document doci. Let the query q be the following:

Word p(w|C) ct(w, doc1)

SUM 1.0 10

Word p(w|C) ct(w, doc1)

ct(w, doc2) 7

ct(w, doc2)

3 Continued Overleaf

Assume that we do not apply any smoothing technique to the language model for doc1 and doc2. Calculate the query likelihood for both doc1 and doc2, i.e. p(q|doc1) and p(q|doc2) (Do not compute the log-likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.

Suppose we now smooth the language model for doc1 and doc2 using Jelinek- Mece Shig ih = 0.1. Recalculate the likelihood of the query for both doc1 and doc2, i.e., p(q|doc1) and p(q|doc2) (Do not compute the log- likelihood; i.e. do not apply any log scaling). Show your calculations. Provide the resulting ranking of documents and state the document that would be ranked the highest.

Explain which document you think should be reasonably ranked higher (doc1 or doc2) and why?

(a) How would the IDF score of a word w change (i.e., increase, decrease or stay the same)

in each of the following cases: (1) adding the word w to a document; (2) making each document twice as long as its original length by concatenating the document with itself; (3) Adding some documents to the collection. You must suitably justify your answers.

[5] (d) Consideraqueryq,whichreturnsallwebpagesshowninthehyperlinkstructurebelow.

(b) Explain in detail why positive feedback is likely to be more useful than negative feedback to an information retrieval system. Illustrate your answer using an example from a suitable search scenario.

(c) Neural retrieval models often use a re-ranking strategy over BM25 to reduce computational overhead.

Explain the key limitation of this strategy. Describe in sufficient details an approach that you might use to overcome this problem.

Write the adjacency matrix A for the above graph.

[1] the webpages of the above graph after a complete single iteration of the

Using the iterative HITS algorithm, provide the hub and authority scores for all algorithm. Show your workings.

Describe in sufficient details an alternative approach to compute the hub and authority scores for the above graph. You need to show all required steps to generate the scores, but you do not need to actually compute the final scores.

April- 4 /End

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com