Tuesday, January 30, 2007

Clarification regarding Qn 3 in the homework

As some of you noticed, there is an error in the tf/idf representation of the database example
shown in table 14.3 (which is on one of the slides shown in the class). The tf/idf rep of document
d7 is mysteriously missing from the table.

Confusing as this is, nothing you need to do is affected by this elision.


Interesting Tech Talk by Luis Von Ahn


When Dr. Rao mentioned about the ESP game, I was inquisitive to know more about that. This video is a very interesting talk by Prof. Luis Von Ahn in which he explains about his inventions that involve novel techniques for utilizing the computational abilities of humans.


It was amazing. I am pretty much sure most of the students would have seen this already, but for those who havent, it is something that should not be missed.


Information + new discussion topic (on IR for other languages)... 1/30


 --> Thursday's class is going to be one of the more technical (and critical) classes for the semester.
      Please look at the slides before coming to the class to maximize absorption rate/chance

 -->Note that I added new slides to last week's lecture to capture the class discussion
      about (1) marginal relevance (2) the point that relevance/marginal relevance are being
      assessed in terms of similarity (3) the idea of reducing the lossyness of bag of words model
      by looking at shingles/noun-phrases and (4) the digression about doing plagiarism/duplicate detection
      using similarity metrics.

Discussion topic (the two students with whom I discussed this already are exempted ;-)

 Difficult as it might be to believe, English is not the only language in the world. To what extent
are the information retrieval techniques that we have discussed until now also apply to
languages other than english? (Which work without change, which need to be modified a bit, and
which don't make any sense at all?)

(Of course, to answer this question, it helps if you actually know a language other than English..
So "Americans" as defined by the smart-alec saying "There are three types of people in the world:
Bilinguals, Polyglots and Americans" are excused ;-))

My first impulse was to add a qualification to the parenthetical remark after "English.." saying
"Java/C++ don't count". Then again, I decided, how about retrieving relevant code fragments
to help programmers? How effective are the information retrieval models we are discussing for this

Get cracking..


Friday, January 26, 2007

Readings for next week (LSI and Correlation Clutering)

The readings for next week are:

1. The LSI paper (by Dumais et al) on the readings list

2. You may also want to brush up on the theory of singular value decomposition.
Here is a good reference:


3. I will try to scan and send you a couple of pages on association/correlation clusters



Thursday, January 25, 2007

Clustering search results to increase their diversity..

I added a slide summarizing the discussion on why top-k results
should simultaneously satisfy two constraints (be as close to the query
as possible while being as far from each other as possible).

As I said one way to increase result diversity is to get results in terms of their
proximity to the query, and then cluster them and give cluster representatives.

Vivisimo.com is such a seach engine.

Here is an example:


(you can try other queries..)


Discussion Topic: (mis)intuitions about High dimensional space..

Here are two interesting facts.

1. High dimensional apples are all peel and no core. Specifically, consider spherical apples of
n dimensions. Suppose you were to peel off an epsilon-think apple skin off the apple, what is the
fraction of the apple that is left? If you consider unit radius apples (spheres) you can show that
the percentage of apple left is 97% in 3-D, and 0.004% in 1000-D (i.e., thousand dimensional space). (You can
assume that the volume of an n-dimensional sphere is O( r^n) where r is the radius.

2. In high-dimensions, any randomly chosen pair of vectors are, with very high probability, perpendicular
to each other.

You are welcome--on the blog--to (a) try to prove them
and/or (b) discuss why any of these are relevant to information retrieval using vector space


ps: If intellectual adventure/stimulation is not enough of an incentive to post on the blog,
    note that blog participation is viewed as class participation--and part of your grade for this course will be participation based.

Added two more questions to the homework. It is now due on 2/1 (feb 1st)

I added questions on precision recall and vector space similarity ranking and closed the
homework socket. It will be due next thursday--in class.


Tuesday, January 23, 2007

NYT has a bag-of-words analysis of Bush's state-of-union speeches..


It is always nice to have NYT working in tandem with the course ;-)


clarification+discussion qn+links


I mentioned in the class that it matters less what similarity metric you are using
and more whether it satisfies some reasonable desiderata. I really should have put up a
slide on that before mentioning any similarity metrics. I did go back and put one up into the class slides.
You can also look at just that one slide by looking at the link



I said that most search engines try to rank the documents by computing the similarity between the
document and the query and outputting the documents in the order of their measured similarity.

It is clear that this is the best approach if you are returning (a) either just the top relevant document
or (b) all relevant documents. Do you think you need to change something if you are returning the top-10
relevant documents (as search engines do)? 

Consider scenario (1) There are near duplicate documents in the corpus (2) the query is ambiguous--e.g.
"bush". Use these scenarios to talk about what the search engine should be optimizing in picking the top-10

Does your favorite search engine(s) take these considerations into account?



1. Here is the link to the Lincoln play by Woody Allen that I mentioned in the class:


2. The click-bias study I discussed today (where users wind up being
biased  in their relevance judgements by what google says is the most 
relevant document) is described in:


I would recommend that you read this paper as it also gives how a
carefully constructed experimental study would look like. (More over
this is a paper from 2005 SIGIR conference--the top conference on
information retrieval)--so you get to glimpse recent state of the art.

3. Finally, in my continuing quest to show you realworld applications of the stuff
you are learning, I bring you now a potentially useful paper that I expect
most of you to read very carefully ;-)

"High precision document classification in Over-Abundant Domains:
A case study in web pornography search"


They even had a user study where the users were paid
"$3.99 for the first minute or fraction thereof and $1.99 for each
additional minute"

"like Reduced Shakespeare Company: The better you know the original,
the funnier it gets"

--LA Times review of "Dave Barry Slept here, A
sort of history of United States"

Monday, January 22, 2007

TA office hours

I am keeping my office hour on Wednesday from 10 - 11 am. For this week I will be at BYENG 557BB during my office hour. From next week I may change it to CSE open lab. I will let you know if I change to CSE open lab.

Friday, January 19, 2007

Some interesting readings + a discussion topic

Here are two interesting (not-fully-technical) readings:

 (This talks about the long tail phenomena on the web..)

 (this applies the ideas of precision and recall to conference paper reviewing
  and in so doing strengthens your intuitions about precision/recall)


Here is something to think about.  We noticed that information retrieval uses
precision/recall measures to evaluate systems.

Database systems, on the other hand, do not care about precision/recall.

Why don't they? 

Should they?

When should they?

Feel free to add your comments to the blog.


Semantic Search on Pictures.

In context of first day of discussion in class on tagging picuturs this looks interesting, Semantic search on multimedia content, including pictures and videos, It is not perfect but still its good. MARVEL can be downloaded and tried.
URL: http://www.alphaworks.ibm.com/tech/marvel

Class mailing list and blog set up..

The class mailing list and blog have been set up.

You should already have got an invitation to join the class blog.

Here are the usage guidelines:

--> The class mailing list is mainly used by me to get information out to you

Everything sent to the mailing list--including this mail-- is
(a) sent to you via email
(b) archived on the mailing list archive (available from the homepage)
and (c) added to the class blog

--> Class blog is for all discussions relevant to the class. You can initiate discussions
as well as respond to them. Additionally, I will sometimes initiate discussions on the blog.