Tuesday, January 23, 2007

clarification+discussion qn+links


I mentioned in the class that it matters less what similarity metric you are using
and more whether it satisfies some reasonable desiderata. I really should have put up a
slide on that before mentioning any similarity metrics. I did go back and put one up into the class slides.
You can also look at just that one slide by looking at the link



I said that most search engines try to rank the documents by computing the similarity between the
document and the query and outputting the documents in the order of their measured similarity.

It is clear that this is the best approach if you are returning (a) either just the top relevant document
or (b) all relevant documents. Do you think you need to change something if you are returning the top-10
relevant documents (as search engines do)? 

Consider scenario (1) There are near duplicate documents in the corpus (2) the query is ambiguous--e.g.
"bush". Use these scenarios to talk about what the search engine should be optimizing in picking the top-10

Does your favorite search engine(s) take these considerations into account?



1. Here is the link to the Lincoln play by Woody Allen that I mentioned in the class:


2. The click-bias study I discussed today (where users wind up being
biased  in their relevance judgements by what google says is the most 
relevant document) is described in:


I would recommend that you read this paper as it also gives how a
carefully constructed experimental study would look like. (More over
this is a paper from 2005 SIGIR conference--the top conference on
information retrieval)--so you get to glimpse recent state of the art.

3. Finally, in my continuing quest to show you realworld applications of the stuff
you are learning, I bring you now a potentially useful paper that I expect
most of you to read very carefully ;-)

"High precision document classification in Over-Abundant Domains:
A case study in web pornography search"


They even had a user study where the users were paid
"$3.99 for the first minute or fraction thereof and $1.99 for each
additional minute"

"like Reduced Shakespeare Company: The better you know the original,
the funnier it gets"

--LA Times review of "Dave Barry Slept here, A
sort of history of United States"

1 comment:

Aravind Krishna K said...

The interesting fact in link 3) is, it is not that the volunteers were paid, but the volunteers actually paid to be part of user study.. (also in future directions, they say this, "income from volunteers has provided seed money for the proj" ... &, "positive word-of-mouth allowed them a hike of fees to $5.99 now.. )