Tuesday, January 30, 2007

Information + new discussion topic (on IR for other languages)... 1/30


 --> Thursday's class is going to be one of the more technical (and critical) classes for the semester.
      Please look at the slides before coming to the class to maximize absorption rate/chance

 -->Note that I added new slides to last week's lecture to capture the class discussion
      about (1) marginal relevance (2) the point that relevance/marginal relevance are being
      assessed in terms of similarity (3) the idea of reducing the lossyness of bag of words model
      by looking at shingles/noun-phrases and (4) the digression about doing plagiarism/duplicate detection
      using similarity metrics.

Discussion topic (the two students with whom I discussed this already are exempted ;-)

 Difficult as it might be to believe, English is not the only language in the world. To what extent
are the information retrieval techniques that we have discussed until now also apply to
languages other than english? (Which work without change, which need to be modified a bit, and
which don't make any sense at all?)

(Of course, to answer this question, it helps if you actually know a language other than English..
So "Americans" as defined by the smart-alec saying "There are three types of people in the world:
Bilinguals, Polyglots and Americans" are excused ;-))

My first impulse was to add a qualification to the parenthetical remark after "English.." saying
"Java/C++ don't count". Then again, I decided, how about retrieving relevant code fragments
to help programmers? How effective are the information retrieval models we are discussing for this

Get cracking..



Raju said...

The bag of words model will work as such in other languages also,for which permutation of words is not as important as combination of words. All natural languages I know seems to be having this property and similar size of "working" vocabulary, but still large. Details like stemming algorithms and stop words must be changed.
For programming languages it will not work since a limited vocabulary is used in different permutations to express large number of concepts. For example, a=b is semantically very different from b=a; though bag of words are exactly same.

Zheshen(Jessie) said...

When it comes to Chinese, the idea of “bags of terms” still works, but we should separate words first, because Chinese does not have any obvious sign to separate words as space used for separating words in English. For Chinese, we should also remove “stopwords”, do “term weighting”, but we don’t need to deal with “stemming”, “converting upper-case characters to lower-case” and “removing special symbols(such as hyphen and punch signs in the words)” problems. Chinese also inevitably has “synonyms” and “polysemy”problems like English and it may be much worse since “character”, rather than “word”, is the smallest unit of meaning in Chinese. Therefore, “synonyms” and “polysemy” may cause big problems even in the first step—separating words, because in many cases, one character may be combined with either its previous characters or the following ones to form a word and the length of word(the number of characters included) is also undecided. (All cases may be meaningful in terms of “a word”. Without context, it is very difficult to decide which one is correct.).

For code retrieving, now we have many search engines for searching codes, such as “sourceforge”, “the code project”, “google code”, etc. I think all I have seen so far are still based on text retrieving(e.g., “labels”, “keyword” or “text descriptions”). I think it may be possible to treat “Java/C++” as human languages and apply similar methods to do code retrieval.

Subbarao Kambhampati said...

The comment regarding Chines is interesting. So what are the techniques for segmentation used by the current best search engine in Chinese? (I understand that Baidu rather than Google is the frontruner in China)

The issue of segmenting strings into words is also an issue in languages such as German (and sanskrit) where many words can be joined to make long "compound words"--but the problem seems easier here than what needs to be
done for Chinese.

Zheshen(Jessie) said...

Basically, there are three ways to segment strings into words for Chinese: matching based method, understanding based method and statistical data based method.

1)Matching based method: the system scans the string and matches its sub-strings(possible words) with words in a dictionary.
2)Understanding based method: the system does grammar and syntax analysis so as to eliminate ambiguities(“synonyms” and “polysemy”)as much as possible.
3)Statistical based method: the system computes the possibility of neighboring co-occurrence of characters and considers those with high possibilities as words.

Most current Chinese search engines are based on one or some of above techniques. Other than “synonyms” and “polysemy”, there is another tough problem in Chinese word segmentation—identifying unknown words, such as names, addresses, etc.

Nanan said...

In my opinion, there might be some problem with stopwords in Chinese. Since the words in Chinese are consist of characters. And the stopwords are more suitable to be called as stopcharacters since they are not real words just one character normally. In this case, it is possible that some stopcharacters appear in some meaningful words. If we delete them, some problem will occur during the segamentation process.