Tuesday, January 30, 2007
Information + new discussion topic (on IR for other languages)... 1/30
--> Thursday's class is going to be one of the more technical (and critical) classes for the semester.
Please look at the slides before coming to the class to maximize absorption rate/chance
-->Note that I added new slides to last week's lecture to capture the class discussion
about (1) marginal relevance (2) the point that relevance/marginal relevance are being
assessed in terms of similarity (3) the idea of reducing the lossyness of bag of words model
by looking at shingles/noun-phrases and (4) the digression about doing plagiarism/duplicate detection
using similarity metrics.
Discussion topic (the two students with whom I discussed this already are exempted ;-)
Difficult as it might be to believe, English is not the only language in the world. To what extent
are the information retrieval techniques that we have discussed until now also apply to
languages other than english? (Which work without change, which need to be modified a bit, and
which don't make any sense at all?)
(Of course, to answer this question, it helps if you actually know a language other than English..
So "Americans" as defined by the smart-alec saying "There are three types of people in the world:
Bilinguals, Polyglots and Americans" are excused ;-))
My first impulse was to add a qualification to the parenthetical remark after "English.." saying
"Java/C++ don't count". Then again, I decided, how about retrieving relevant code fragments
to help programmers? How effective are the information retrieval models we are discussing for this