Wednesday, February 28, 2007

project 1 statistics

Hi Everyone,
I have used following grading scheme for project.
Algorithm description and analysis - 15
Program                                       - 15
Results                                        - 10
Total                                            - 40
The statistics for project 1 are:
Mean                    - 35.78
Standard Deviation -   3.58
Max                      - 40.00
Min                       - 28.00
Here are some suggestions:
1. Some students have implemented vector space ranking as iterating over all the documents and finding whether the document contains query term or not. Instead inverted index can be used which gives the set of documents for query terms.
2. You can provide time and space complexity of the algorithm. It is not mandatory.
3. Only terms having "contents" as a field should be used from the index.
For project 2 you need to compare results of authority/hub, pagerank and vector space ranking. You also need to analyze the effect of changing the parameters of pagerank and authority/hub (like damping factor for pagerank, root set size for authority/hub ...).

Tuesday, February 27, 2007

An additional paper for discussion on Thursday.. long as we are going to continue with the discussion of anatomy of search engine paper, I would
also suggest that you read the following paper on the challenges faced by search engines:

(This is a 2002 paper written by one of the google research scientists that talks about all fun ways people try to
fool search engines... and some ideas for overcoming them).

(This is a mandatory reading)


ps: If you are curious about the searchking pagefarming case, search for "searchking google" in google ;-)

ps2: Information about google coop is at  (an example of coop supporting google queries can be found
by issuing a query like "Lymphoma" or "Diabetes"  --see the "query refinement suggestions" before the top 10 results ). We will talk about
Google coop and the idea of using folksonomies and annotations to improve search in the second part of the course.

Monday, February 26, 2007

Project part B assigned(due 3/20/07)

Project part B is assigned (you can look at the project specifications under projects). In this project you need to implement Authority/Hub and Pagerank algorithms.
The crawl and the index are the same as project part A. "Hashedlinks" file provided as part of this project contains links among the crawled pages.You need to use this file for authority/hub and pagerank computation.
You can also create a GUI as an extra credit task. Creating GUI will be useful as end-of-semester demo of the project requires GUI.
Once again you need to submit code, results and analysis of the algorithms as described on project page.
Email me if you have any questions regarding the project.

Sunday, February 25, 2007

Blinkx and searching videos by transcribing the audio of the video into text and tagging the video with it..

Here is an interesting article from NYTimes about how does video search (just as image search, effective video search
needs to be done in terms of textual descriptions. Unlikes images, videos do have sound track that does sometimes provide
auditory annotations. Blinkx claims that it does speech recognition on these sound tracks, transcribes the recognized words as
tags on the video--and uses them to retrieve the video).


February 25, 2007

Millions of Videos, and Now a Way to Search Inside Them

THE World Wide Web is awash in digital video, but too often we can't find the videos we want or browse for what we might like.

That's a loss, because if we could search for Internet videos, they might become the content of a global television station, just as the Web's hypertext, once it was organized and tamed by search, became the stuff of a universal library.

What we need, says Suranga Chandratillake, a co-founder of Blinkx, a start-up in San Francisco, is a remote control for the Web's videos, a kind of electronic TV Guide. He's got just the thing.

Videos have multiplied on social networks like YouTube and MySpace as well as on news and entertainment sites because of the emergence of video-sharing, user-generated video, free digital storage and broadband and Wi-Fi networks.

Today, owing to the proliferation of large video files, video accounts for more than 60 percent of the traffic on the Internet, according to CacheLogic, a company in Cambridge, England, that sells "media delivery systems" to Internet service providers. "I imagine that within two years it will be 98 percent," says Hui Zhang, a computer scientist at Carnegie Mellon University in Pittsburgh.

But search engines — like Google — that were developed during the first, text-based era of the Web do a poor job of searching through this rising sea of video. That's because they don't search the videos themselves, but rather things associated with them, including the text of a Web page, the "metadata" that computers use to display or understand pages (like keywords or the semantic tags that describe different content), video-file suffixes (like .mpeg or .avi), or captions or subtitles.

None of these methods are very satisfactory. Many Internet videos have little or obscure text, and clips often have no or misleading metadata. Modern video players do not reveal video-file suffixes, and captions and subtitles imperfectly capture the spoken words in a video.

The difficulties of knowing which videos are where challenge the growth of Internet video. "If there are going to be hundreds of millions of hours of video content online," Mr. Chandratillake said, "we need to have an efficient, scalable way to search through it."

Mr. Chandratillake's history is unusual for Silicon Valley. He was born in Sri Lanka in 1977 and divided his childhood among England and various countries in South Asia where his father, a professor of nuclear chemistry, worked. Then he studied distributed processing at Kings College, Cambridge, before becoming the chief technology officer of Autonomy, a company that specializes in something called "meaning-based computing." This background possibly suggested an original approach to search when he founded Blinkx in 2004.

Mr. Chandratillake's solution does not reject any existing video search methods, but supplements them by transcribing the words uttered in a video, and searching them. This is an achievement: effective speech recognition is a "nontrivial problem," in the language of computer scientists.

Blinkx's speech-recognition technology employs neural networks and machine learning using "hidden Markov models," a method of statistical analysis in which the hidden characteristics of a thing are guessed from what is known.

Mr. Chandratillake calls this method "contextual search," and he says it works so well because the meanings of the sounds of speech are unclear when considered by themselves. "Consider the phrase 'recognize speech,' " he wrote in an e-mail message. "Its phonemes ('rek-un-nise-peach') are incredibly similar to those contained in the phrase 'wreck a nice beach.' Our systems use our knowledge of which words typically appear in which contexts and everything we know about a given clip to improve our ability to guess what each phoneme actually means."

While neural networks and machine learning are not new, their application to video search is unique to Blinkx, and very clever.

How good is blinkx search? When you visit, the first thing you see is the "video wall," 25 small, shimmering tiles, each displaying a popular video clip, indexed that hour. (The wall provides a powerful sense of the collective mind of our popular culture.)

To experiment, I typed in the phrase "Chronic — WHAT — cles of Narnia," the shout-out in the "Saturday Night Live" digital short called "Lazy Sunday," a rap parody of two New York slackers. I wanted a phrase that a Web surfer would know more readily than the real title of a video. I also knew that "Lazy Sunday," for all its cultish fame, would be hard to find: NBC Universal had freely released the rap parody on the Internet after broadcasting it in December 2005, but last month the company insisted that YouTube pull it.

Nonetheless, Blinkx found eight instances of "Lazy Sunday" when I tried it last week. By contrast, Google Video found none. Typing "Lazy Sunday" into the keyword search box on Google's home page produced hundreds of results — but many were commentaries about the video, and many had nothing to do with "Saturday Night Live."

Blinkx, which has raised more than $12.5 million from angel investors, earns money by licensing its technology to other sites. Although Blinkx has more than 80 such partners, including Microsoft , Playboy, Reuters and MTV, it rarely discloses the terms of its deals. Mr. Chandratillake said some licensees pay Blilnkx directly while others share revenue and some do both. Blinkx has revealed the details of one deal: ITN, a British news broadcaster, will share the revenue generated by advertising inserted in its videos.

For all of Blinkx's level coolness, there are at least three obvious obstacles to the company's success.

First, because Google Video is not much good now doesn't mean it won't get better: after all, when Blinkx was founded, it first applied machine learning to searching the desktops of personal computers, a project that was abandoned when Google and Microsoft released their own desktop search bars.

Second, even if Google improbably fails to develop effective video search, the field will still be crowded: TruVeo, Flurl, ClipBlast and other start-ups are all at work on different subsets of the market.

Finally, Blinkx might not go far enough in searching the content of videos: the company searches their sounds, but not their images.

THIS last objection is the most serious.

"Because Blinkx emphasizes speech recognition, there is a great amount of multimedia content that they cannot address, like photographs," said John R. Smith, a senior manager in the intelligent information management department of I.B.M.'s T. J. Watson Research Center in Hawthorne, N.Y. "But what's worse, speech is not a very good indicator of what's being shown in a video."

Mr. Smith says he has been working on an experimental video search engine called Marvel, which also uses machine learning but organizes visual information as well as speech.

Still, at least for now, Blinkx leads video search: it searches more than seven million hours of video and is the largest repository of digital video on the Web.

"Search is our navigation, our interface to the Internet," said John Battelle, chief of Federated Media Publishing and author of "The Search," an account of the rise of Google. With Blinkx, we may have such an interface for digital video, and be a little closer to Mr. Chandratillake's vision of a universal remote control.

Thursday, February 22, 2007

Another try on answering Zheshen's qn on why have Authorities defined in terms of Hubs and vice versa..

Implicit in my discussion of kings and king-makers is the point that "kings" normally don't point to others (they are kings); Kingmakers do.. (if you want
a more operatic and gender-neutral metaphor, primadonnas don't necessarily point to other primadonnas. Impresarios point to primadonnas.)

In other words, pages with high authority don't *necessarily* point to other pages with high authority. This is why we find it better to define authorities
in terms of hubs and vice versa

(Note that if an authority u happens to be too gracious for its own good
and  does deign to point to another authority v, the A/H model will still catch it-- u will have both a high authority and high hub value..

This is not surprising. In science you value a single paper both for its "originality" (authority) and its "graciousness" in citing all relevant prior work
(hub). Just because you are not gracious doesn't mean you are not an authority--you are just an ungracious authority.

As you probably know, for the longest time  Newton held a grudge against Leibnitz who, for all practical purposes, had invented calculus independently and
had even gave it a much more natural notation--the one we use today. Newton did invent calculus (what he called the theory of "fluxions" ) much earlier--but never
bothered to publicize it (or even start a company..). He hated Leibnitz's claim to calculus so much that he ran a highly organized scientific vendetta against Leibnitz
both while he was living and after his death.. (in one particularly flagrant incident, Newton, as the president of Royal Society, started an "impartial" --wink wink--scientific committee
to investigate who deserves the credit for calculus--he or leibnitz, staffed the committee with his friends, and not wanting to take a chance, even *wrote* the
committee's final report!).  Newton is thus a classic case of ungracious authority... (he also had a long standing rivalry with Hooke--the guy who analyzed elasticity).

A sweet contrast is Gauss, who too discovered and opened more areas of Math than he ever had the time to publish. The story goes that when Jacobi went to him
to show him his great discoveries in group theory, Gauss went to the back of his room, opened some dusty folders and showed him 20-year old manuscripts written by him (Gauss) which basically did all the work. He however, didn't begrudge Jacobi his fame...

---end anecdote--


*REQUIRED READING FOR NEXT CLASS*--Anatomy of hyper-textual search engine..


 A large part of next Tuesday's class will be devoted to a discussion of the paper

(this is the top one in the "overall search engine" category).

You should all (1) read it and (2) bring to the class a an at-most one page review to the class with questions and/or observations about
the paper given all that you know about how search engines could work). Focus on what are aspects of the architecture that surprised you
and/or different from what we discussed in the class.

The class itself will consist of us discussing the questions you bring up.

This is a mandatory class and your participation is mandatory..

(I will probably use part of the time time to talk about page rank variations such as
topic-specific page rank and trust-rank; but most of the class will be for discussing the whole search
engine anatomy).


Wednesday, February 21, 2007

Fwd: Disccussion Topic: What factors *should* affect page importance?

  We listed the following as some of the factors that should affect the page importance, and
desirable characteristcs of an importance metric.
If you can think of other factors that should also be taken into consideration when deciding page
importance, please do make a comment:

It should be sensitive to
The query
Or at least the topic of the query..
The user
Or at least the user population
The link structure of the web
The amount of accesses the page gets
It should be stable w.r.t. small random changes in the network link structure
It shouldn't be easy to subvert with intentional changes to link structure

Tuesday, February 20, 2007

CSE494/598 Spring 2007 Blog

I want to paritcipate in discussions on the blog, but my invite to this blog has expired. Could you please send me a new invite?
Thank you,
Sindhuja Sadayandi

Saturday, February 17, 2007

Readings for next week..

I am assuming that people are looking at the readings list for the appropriate readings for the class.

The discussion on link analysis (authorities and hubs; page rank) will be guided by the search engine
technology paper, with the other indented papers serving as backup references.

I have also added another "general interest" non-technical paper on Web 2.0 to the readings list
(right next to the long tail paper)


Thursday, February 15, 2007

Please find below the matlab code for creating a long tailed curve as discussed in class today, looks like this if we use a criteria to increase values which is satisfied for large faction of samples it gives a flatter cure(bottom image, hardnes=1, 50% satisfy) and gives long tailed cure if we use a criteria satisfied by a smaller fraction of numbers(top image, hardnes = 10, 10% satisfy)

% To build a power law distribution

powDist = [1,1000];
for i=1:1000
powDist(1,i)= rand;

for j=1:1000

hardnes = 10;
s = sum(powDist) * hardnes;

for i=1:1000
if( rand * s / 1000 < powDist( 1,i))
powDist(1,i) = powDist(1,i)+rand;


plot( -1*sort(-powDist));

Monday, February 12, 2007

An academic paper on whether google (information extraction) invades your privacy

I mentioned an anecdote last class about how Google's cache inadvertently outed
a colleague's as-yet-not-public plans to move from his home institution.

Here is a paper on whether information extraction of the sort exemplified by Adwords
can be seen to be invading your privacy.

I mentioned this paper to a couple of you. Here it is for everyone. We will read this
formally at some point of time--but until then, you might enjoy the arguments
(the one I like best is the argument that as the sophistication of NLP techniques used in the
information extraction increases, and as search engines/portals extract and keep large volumes of
text tucked away in their cluster farms, they can be *held liable* for not acting on information they have
if for example, people use email to hatch plots and then carry them through.. This is one reason why
increasingly many institutions now have a standing policy of removing all email every year..)

ps: Sergei related an anecdote where Google started giving out addresses of various people because
it had access to a large library of resumes and did some elementary address extraction..

blog in the news ;-)

In case you missed your morning paper, cse494 blog is in the news today..

You can check it out at


Thursday, February 8, 2007

A good source of textbookish readings for IR material + Links for/from today's class (including why "miserbale failure" didn't work)..

First, off, here is an NYT article explaining that as of Jan end, 2007, Google decided to
quit acting all innocent and hand-removed the miserable failure google bomb link to

On a more serious front, I added a new uber-link in the readings list to a draft text book on
on information retrieval that is freely available on the web.

You might consider consulting it as  a good additional source of readings on topics discussed in the class
(note: some chapters are more drafty than others)

You can find discussion on how to statistically estimate relative index sizes of search engines in chapter
19 (section 19.5). The specific link is:

That whole chapter is a quick overview of web search issues/challenges.

 distributed index and crawling discussion can be found in the next chapter: ch 20.


Tuesday, February 6, 2007

[discussion/aggregation topic]: Please add your answers to Qn 0 on the homework as comments to this blog entry

Homework 1; Qn 0:

Think of and list 3 queries (or activities) that you would like to do on the
Web that the current day search engines (e.g. Google) don't quite

A quote to get you inspired:

"Some people see things as they are and say why? I dream things that
never were and say why not?"
-(Mis)attributed to Robert Kennedy
who paraphrased from Bernard Shaw

Sunday, February 4, 2007

Brain cycle stealing (to find Jim Gray's missing boat)...

At the beginning of the semester, I mentioned that one of the important directions on web is to try and steal the brain cycles of the vast web population. One application was "image labelling"-- and I mentioned the Mechanical Turk effort by Amazon and and ESP game and the related efforts (which have since become part of Google image labelling service).

You may have heard that Jim Gray, the noted compute scientist has been missing since last weekend (with his yacht). After coast guard gave up searching for him, the various technorati have stepped in. Google made the most recent satellite imagery of the area available and Amazon then took it and put it into its mechanical turk site--to allow people to look for possible missing boats in the mass of cloud cover and ocean waves.

Here is the site:


ps: Jim Gray is the head of the Microsoft research, Bay area facility (which essentially was opened just so that Jim Gray can be with Microsoft and be in Bay area). Can you guess as to why Amazon engineers are so interested in helping him? (granted he is a great computer scientist--but what
is the connection between his contributions and Amazon?)

Friday, February 2, 2007

project 1

How is the project 1 going on? Any difficulties in understanding the code?
I wanted to let you know something regarding the structure of the index.
You need to read the entire index for pre computing norms of the documents.
The content of the index contains terms of the type: "contents", "title", "modified", "uid" etc. But you need to consider only the "contents". You can find out the type of the term using termval.field() as given in the following code which is from
   Term termval = termenum.term();
   System.out.println("Type: " + termval.field() + "  The Term :" + termval.text() + " Frequency :"+termenum.docFreq());
     Add following here to
     retrieve the <docNo,Freq> pair for each term call
      TermDocs termdocs = reader.termDocs(termval);

    to retrieve the <docNo,Freq,<pos1,......posn>> call
      TermPositions termpositions = termval.termPositions(termval)



Thursday, February 1, 2007

What is the difference between PCA and SVD?

I'm a bit confused with PCA/LSI and SVD.

In PCA, we do Eigen Value Decomp of the "covariance matrix" of the original matrix M. M does not have to be square symmetrical. (And LSI is a special name of PCA when it is applied to documents.)

In SVD, for original matrix M, we do Eigen Value Decomp of MT M and MMT (MT means the transpose of M, I don't know how to input superscript here)respectively and using their eigen values and two sets of eigen vectors for the result of SVD. And M does not have to be square symmetrical either.

What is the difference between PCA and SVD when they are used for dimensionality reduction?

Homework 2 socket opened and three questions added



Additions to the slides+ next week's reading

1. I added three slides to the end of LSI to summarize the discussion we did connecting
LSI to LDA, feature selection and non-linear dimensionality reduction. I also added a slide
showing the reconstructed database/statistics example matrix for k=2 and 4.

2. For the next week, please read the first reference in Search Engine Technology.
Our plan over the next two weeks is to
1. discuss what it means to do IR in the context of Web pages.
2. notice that web pages--in addition to being text pages--also form a social network in terms of their hyper links
3. discuss social networks
4. apply social network analysis to web pages (authorities/hubs & page rank)