Monday, April 30, 2007

Current grade register snapshot...

  I am enclosing the snapshot of the current cumulatives--so that you know your relative standing as well as check to make sure that
all your marks have been properly entered.

Please note that the exact weights for various parts are subject to change. As of now, I reckoned
21 pts for the three homeworks
20 pts for midterm
36 pts for all projects (includign the third+demo)
15 pts for the final homework+presentation
8   pts for participation

I am surely going to have second thoughts on the relative weights (and you are welcome to express your views--either by mail or

The last column in blue titled relative percentage is just relativized percentage of your current cumulative assuming the top student in each
category determines 100%.

The participation grade will be determined from your attendance, class questions and blog participation (you will be filling a data sheet tomorrow--
that will ask you to estimate the number of classes you missed, the number of times you may have asked/answered questions in the class, and the
number of times you wrote a discussion response on the blog--not counting the times you were required to post your reviews).




Sunday, April 29, 2007

Trawling the web for answers.... a real-life example of a information integration


As soon as I put up the new home work question on data integration on the web,  the following hits were made on the website:

----->from my referer_log
8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a -> /cse494/notes/f02-exam2.pdf

-->from the access_log
71-35-60-??? - - [29/Apr/2007:11:45:51 -0700] "GET /cse494/notes/f02-exam2.pdf HTTP/1.1" 200 55917

(I am suppressing the actual IP number by putting ???)

Of course, I am pretty sure this is all coincidental and there was no actual plagiaristic instinct behind these purely academic endeavor of doing a manual-Mashup for answering homework problems.

Nevertheless, may I remind y'all the rule I articulated at the beginning of the semester--I know you are all smart enough to find the answers for any question I can give. I would rather hope that you will try to find them in your head rather than on the web or with your friends.   


Another question (on data integration) added to Homework 4

Important: Announcement re: mini-WWW2007 during the final exam period class of 5/8/2007

As you know, we will be meeting from 2:40pm--4:30pm on May 8th, Tuesday.

My general idea was to convert that meeting into a discussion-cum-wrap-up of the course.

Given that you are all reading a WWW 2007 paper and writing a review, it occurred to me that the best way to
make this happen is to have each of you make a 5-minute presentation on your paper.

We have a total of 110 minute for the class period. At 5min per slot, the presentations should take about 100min, leaving us
10min of buffer time for any general discussions.

Here is what you need to do:

1. *Pick* the paper you plan to read and review--post the title to the blog as I had asked (9 of you have done it)

2. In addition to the 1 page review you will be submitting as part of the homework 4, you should also make a 4 slide summary of the paper

 Slide 1: What is the problem the paper is atempting. Why is it interesting
 Slide 2: What is the solution the authors propose
 Slide 3: What is your criticism of the solution
 Slide 4: How is this paper related to anything we discussed in the course.
 Slide 5: a Quad-chart version of slides 1-4 (i.e., copy each of the slides and paste it on to 5th slide at 1/4th size)

3. On 8th, be ready to present your slides.


Friday, April 27, 2007

New question added: WWW 2007 Paper selection + Review

Folks I added a new question to the homework (reproduced below).

When you pick you paper, put a comment on the blog w.r.t. this post (so others know what papers
have already been picked).

Question 3. WWW 2007 Paper Review. The proceedings of WWW 2007 are now available online at . (The conference is still 2 weeks away).
  1. Browse the proceedings and pick a paper (preferably from the main track; but semantic web track is also okay)
  2. Send a note to the class blog saying you picked it
  3. Read the paper and write a critical one-page review. The purpose of your review is to explain to someone who took CSE494 to be able to get an idea of what the paper's contribution is. You should thus try to connect the ideas in the paper to what we discussed in the class.
  4. Upload your review to the class blog (under the appropriate discussion topic to be opened)
  5. Be prepared to discuss it briefly in the class

Option 3 is the winner for final..

Homework 4 will be used in lieu of a separate final examination.

There will be a mandatory class during the final exam period-- 2:40--4:30pm May 8th.

(so make sure  you don't make other commitments for that slot)


Monday, April 23, 2007

A paper on learning taxonomies from Wikipedia categories (that is a good companion to Semtag/Seeker)

 You might want to (optionally) read the following paper. This makes a good companion paper to Semtag/Seeker.
Semtag/seeker _assumes_ ontology (TAP), and tags pages.This paper learns ontology  using Wikipedia categories.

This is a paper to be published at AAAI 2007 (talk about bleeding edge ;-)


Sunday, April 22, 2007

*Important*: Change in the Project 3 demo schedule

Based on the discussions in the class, we made a change to the demo
schedule for project 3.

1. All projects, along with the analysis, are due on the last day of
classes (May 1st) in the class.

2. The demos are scheduled on 2nd and 3rd May (i.e., after turning in the
project reports.) You need to demo all the tasks(i.e., Vector Space,
Authority/Hub, PageRank. Clustering algorithms and anything else that you
have implemented.) The TA will see the demo in CSE open lab(BYENG 214). Either
you can bring your laptop or show it on the computer in the open lab.

A sheet with demo slots will be circulated in the class.


In this approach, everyone gets equal time for completing the project
and turning it in. The version demonstrated is expected to be the one
that is turned in on May 1st.

[Apr 22, 2007]

Homework 3 solutions posted

Saturday, April 21, 2007

Reminder: Mandatory reading for next class: Semtag/Seeker paper

Here is the link:

(also listed in the readings of information extraction)

As usual, you will write a one page review/questions on the paper, which is made part of homework 4.


Thursday, April 19, 2007

statistics for homework 3

For homework 3 following grading scheme was followed.
Qn1. K Means - 12 pts
Qn2. Hierarchical - 8 pts
Qn3. Text Classification - 10 pts
Qn4. Collaborative Filtering - 10 pts
Qn5. Paper Discussion - 10 pts
The statistics for homework 3 are
Total - 50
Max - 50
Min - 21
Mean - 44.74
Std. Dev. - 7.36
Please let me know if you have any questions regarding the grading.

Wednesday, April 18, 2007

(required) Discussion Topic for the blog: Critique the following interview by Tim Berners Lee on Semantic web..


 Here is a fairly high level inteview on semantic web by Tim Berners Lee given this week to Business Week.
Critique this interview (agree or disagree) in the context of the discussion in the class and your understanding.
Post your comments on blog.


CEO Guide to Technology April 9, 2007, 12:01AM EST text size: T T

Q&A with Tim Berners-Lee

The inventor of the Web explains how the new Semantic Web could have profound effects on the growth of knowledge and innovation

Tim Berners-Lee is far from finished with the World Wide Web. Having invented the Web in 1989, he's now working on ways to make it a whole lot smarter.

For the last decade or so, as director of the World Wide Web Consortium (W3C), Berners-Lee has been working on an effort he's dubbed the "Semantic Web." At the heart of the Semantic Web is technology that makes it easier for people to find and correlate the information they need, whether that data resides on a Web site, in a corporate database, or in desktop software.

The Semantic Web, as Berners-Lee envisions it, represents a change so profound that it's not always easy for others to grasp. This isn't the first time he's encountered that problem. "It was really hard explaining the Web before people just got used to it because they didn't even have words like click and jump and page," Berners-Lee says. In a recent conversation with writer Rachael King, Berners-Lee discussed his vision for the Semantic Web and how it can alter the way companies operate. Edited excerpts follow.

It seems one of the problems the Semantic Web can solve is helping unlock information in various silos, in different software applications, and different places that currently cannot be connected easily.

Exactly. When you use the word "silos," that's the word we hear when somebody in the enterprise talks about the stovepipe problem. Different words for the same problem: that business information inside the company is managed by different sorts of software, and you have to go to a different person and learn a different program to see it. Any enterprise CEO really ought to be able to ask a question that involves connecting data across the organization, be able to run a company effectively, and especially to be able to respond to unexpected events. Most organizations are missing this ability to connect all the data together.

Even outside data can be integrated, as I understand it.

Absolutely. Anybody making real decisions uses data from many sources, produced by many sorts of organizations, and we're stymied. We tend to have to use backs of envelopes to do this and people have to put data in spreadsheets, which they painfully prepare. In a way, the Semantic Web is a bit like having all the databases out there as one big database. It's difficult to imagine the power that you're going to have when so many different sorts of data are available.

It seems to me that we're overwhelmed with data and this might be a good way to help us find the data we need.

When you can treat something as data, your querying can be much more powerful.

In your speech at Princeton last year, you said that maybe you had made a mistake in naming it the Semantic Web. Do you think the name confuses some people?

I don't think it's a very good name but we're stuck with it now. The word semantics is used by different groups to mean different things. But now people understand that the Semantic Web is the Data Web. I think we could have called it the Data Web. It would have been simpler. I got in a lot of trouble for calling the World Wide Web "www" because it was so long and difficult to pronounce. At the end, when people understand what it is, they understand that it connects all applications together or gives them access to data across the company when they see a few general Semantic Web applications.

Some of the early work with the Semantic Web seems to have been done by government agencies such as the Defense Advanced Research Projects Agency and the National Aeronautics & Space Administration. Why do you think the government has been an early adopter of this technology?

I understand that DARPA had its own serious problems with huge amounts of data from all different sources about all sorts of things. So, they saw the Semantic Web rightly as something that was aimed directly at solving the problems they had on a large scale. I know that DARPA then funded some of the early development.

You have touched on the idea that the Semantic Web will make it easier to discover cures for diseases. How will it do that?

Well, when a drug company looks at a disease, they take the specific symptoms that are connected with specific proteins inside a human cell which might lead to those symptoms. So the art of finding the drug is to find the chemical that will interfere with the bad things happening and encourage the good things happening inside the cell, which involves understanding the genetics and all the connections between the proteins and the symptoms of the disease.

It also requires looking at all the other connections, whether there are federal regulations about the use of the protein and how it's been used before. We've got government regulatory information, clinical trial data, the genomics data, and the proteomics data that are all in different departments and different pieces of software. A scientist who is going through that creative process of brainstorming to find something that could possibly solve the disease has to somehow keep everything in their head at the same time or be able to explore all these different axes in a connected way. The Semantic Web is a technology designed to specifically do that—to open up the boundaries between the silos, to allow scientists to explore hypotheses, to look at how things connect in new combinations that have never before been dreamt of.

The Semantic Web makes it so much easier to find and correlate information about nearly anything, including people. What happens if that information gets into the wrong hands? Is there anything that can be done to safeguard privacy?

Here at [MIT], we are doing research and building systems that are aware of the social issues. They are aware of privacy constraints, of the appropriate uses of information. We think it's important to build systems that help you do the right thing, but also we're building systems that, when they take data from many, many sources and combine it and allow you to come to a conclusion, are transparent in the sense that you can ask them what they based their decision on and they can go back and you can check if these are things that are appropriate to use and that you feel are trustworthy.

Developing Semantic Web standards has taken years. Has it taken a long time because the Semantic Web is so complex?

The Semantic Web isn't inherently complex. The Semantic Web language, at its heart, is very, very simple. It's just about the relationships between things.

Tuesday, April 17, 2007

to add summary of the document in the displayed results(proejct part C)...

How is the project part C going on? Any difficulties?
In project part A and part B you were displaying the score of the document as well as the url as part of the result. You can display the summary of the document with the url to give more information about the document.
The summary of the document can be obtained using
if you are using the following statement to get the url of the document.
You need to add the summary of the document in project part C while displaying the results.

Monday, April 16, 2007

XQuery in XML syntax...


On 4/15/07, Brandon Mechtley <> wrote:
Prof. Rao,

I was working through the homework, and a question popped into my head:
Perhaps it is totally irrelevant, but is there any particular reason that the XQuery syntax itself doesn't leverage the structure of XML? With some clever use of oid's, I have a feeling queries could be just expressive without requiring a non-XML grammar. Maybe that's where XML Schema comes into play, though . . .

no--XQuery  is allowed to have XML syntax, and there is a W3C working group recommendation standard called XqueryX for this:

Here is an excerpt from the introduction that is self-explanatory:


The [XML Query 1.0 Requirements] states that "The XML Query Language MAY have more than one syntax binding. One query language syntax MUST be convenient for humans to read and write. One query language syntax MUST be expressed in XML in a way that reflects the underlying structure of the query."

XQueryX is an XML representation of an XQuery. It was created by mapping the productions of the XQuery grammar into XML productions. The result is not particularly convenient for humans to read and write, but it is easy for programs to parse, and because XQueryX is represented in XML, standard XML tools can be used to create, interpret, or modify queries.


Notice the part about human readability of current Xquery syntax. This "certification" was no doubt given by geeks who think SQL is more readable than English ;-)


Sunday, April 15, 2007

Required reading for Tuesday's class on Information extraction

For tuesday's class, please make sure to read  this paper (also added to the reading list).

This gives an easy to read introdcution to the information extraction problem.


Tuesday, April 10, 2007

Qn on the perceived usefulness of the XML lectures.


I am wondering if I am adding much value to what you may already know about XML and its standards.
I have been going by some assumptions about the usual misconceptions regarding XML standards and am spending
time in the lectures trying my  best to dispel these misconceptions. However,  it occurred to me that perhaps most of you are already
free of these misconceptions and I may have been over-elaborating (or is it over-belaboring)
some of the issues  (sort of like Elaine's over-compensation of exclamation marks; see ).

Any comments, especially by people aware of XML, on the utility of these lectures will be useful in helping me
fine tune the later lectures
(of course anonymously via )


Monday, April 9, 2007

Required Reading for Tomorrow: XML and XQUERY tutorials from the readings..

Please look at the XML and Xquery tutorials from the readings page in prepration for tomorrow's class.


Friday, April 6, 2007

I added one question to the homework 3 and closed the socket. It is now due next Thursday

I added one question to the homework 3 (on collaborative filtering) and closed the socket. It is now due next Thursday

Wednesday, April 4, 2007

Summary of Tipping Point...

I mentioned Tipping  Point in the class yesterday. You can find a summary of the main ideas
of the book at

(The book is quite well written--although its hypotheses are probably not as scientifically sound
as the author wants us to believe.)


Sunday, April 1, 2007

An easy way to get hard copies of all class emails

Those of you who miss having hard-copy handouts can now use the Gmail paper mail facility