Tuesday, January 26, 2010

(blog comments welcome) Another way to view the approximation of R(d|Q,U,{d1...dk}) in the traditional IR

[Sorry, the message accidentally got sent before it was completed... here is the full version]

Comments on the following are welcome

One possibly frustrating part of today's class is the impedence mismatch between the fact that, the class is focused on how traditional IR is done, and the fact that while you may not know much about traditional IR, you do know about the ways in which it has been extended/used in search engines.

Here is a slightly alternative view to interpret how we get the traditional IR model from the "true" relevance model.

We agreed that Relevance R is a function of D, Q, U and the already shown docs {d1...dk}

The next question is how do we represent (or ignore representing) the various components:

document D:
 --Can be represented in terms of "meaning" (too hard ;-)
 --Can be represented in terms of just the words it has
     --Can be seen as a "set" of key words; "bag" of keywords; or "shingles" of keywords

User U
  -- can be represented in terms of the "properties" of the user
     --interests of the user
     --features (e.g. age, salary, domicile etc) of the user
 -- can be "ignored"

  -- can be represented in terms of the full context of the query
  -- can be represented in terms of the keywords (i.e., a mini-document)
       and "context" properties (e.g. location of the query, current context of the query)

Traditional IR, as we shall see, looks at Keyword representation for the D and Q, largely ignore U.

Given this representation, it makes perfect sense to think of relevance as the "similarity" between D and Q
(and to assess residual relevance in terms of dissimilarity between D and {d1...dk}).


ps: Regarding "accuracy"/"trustworthiness" aspect, note that traditional IR assumes that the query is being run on a curated corpus where all; so this doesn't need to be modeled. Also, while there is need to model it on the web, my own preference is to separate this from relevance. This is because trust/accuracy is a badge that the user has no way of verifying, while relevance is something they can judge locally.


  1. I am not quite clear about the so-called personalized search. If it is possible to be implemented, thus the search engines should keep all the users' information?

  2. One more question is that PageRank derives the weight of web pages from the links. How the pages are connected to the query string ?


Note: Only a member of this blog may post a comment.