Folks
Here is what you do with the comment/review sheed that you prepared for the class
1. Enter it as a comment to this blog post (so y'all can see each others' comments)
2. give it as part of hard copy homework 2 submission
thanks
rao
Subscribe to:
Post Comments (Atom)
16 comments:
COMMENTS AND OBSERVATIONS
One can classify the ranking critera into 3 parts.
1. Link structure based popularity based on which the whole PageRank algorithm was developed.
2. Characterstics of the page. Eg. word frequency, Anchor Text, Headings, title, Font size, color, etc.
3. Content analysis, which is the more difficult of all of them.
One of the interesting things that Google seems to do efficiently and effectively is the ranking of WebPages on the internet and it’s virtual immunity to spamming atleast in those days.
Apart from Page rank, the paper also gives a brief line about extensive use of Proximity search, keeping track of visual presentation details like font size, etc.
One of the important issues that the Page Rank Algorithm tries to exploit is the link structure of the web, contained in HTML pages. This was true till about the late 1990’s or so, but nowadays information has changed drastically. No longer is all information readily available via HTML pages, there is a lot of embedded content, data stored in a database which might be out of reach of a search engine. So to some extent, Page Rank though a good algorithm, but at the present context, might seem a little old.
Another observation that I came across was that with the Page Rank algorithm, Google tries to associate a rank for each webpage, which clearly depends on the number of in-links to that page. But this does not truly take into account relevancy. For example if there was a relatively newer page and when Google crawled it and assigned it a page rank, since it was a new page, it might not have many in-links and therefore it will not have a high page rank, even though that page was very relevant. This is an important issue in today’s web, where information is changing very quickly.
Paper Review 1——“The Anatomy of a Large-Scale Hypertextual Web Search Engine”
Overview:
This paper describes the overall architecture as well as technical details of each aspect of google. Main processing steps and related challenges and technologies are as follows:
1. Crawling:
Challenges: huge scale, requirements of high reliability, with regards to social issues, virtually impossible to test, unpredictable problems, …
Technologies: distributed crawling system, using separated DNS cache, using zlib for compression…
2.Indexing
Challenges: a huge array of possible errors in parsing, sharing lexicon among parallelized indexers, …
Technologies: using flex to generate a lexical analyzer, indexing documents into forward barrels and producing inverted barrels by sorting, using a log of extra words, distributed barrels, parallelized indexers and sorters,…
3. Searching
Challenges: huge amount but limited response time, ranking, …
Technologies: Page Rank, anchor text, different types, proximity, a mixture of many weights, feedback, …
Observations and Questions:
1. Page Rank: 1) add “damping factor” to a single page or a group of pages which allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking; 2) the “importance” of pages are recursively propagated through the link structure of the web.
2. Anchor Text: associate a link not only the page that the link is on, but also the page it points to. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. In Part 5, it mentions that the first result in figure 4 does not have a title because it was not crawled. Instead, Google relied on anchor text to determine this was a good answer to the query. But how to calculate its PageRank without crawling it? Without PangeRank, how to avoid dead links??
3. Are the performance, efficiency and reliability of the system always able to catch up with the unbelievable increasing of webpages and users(queries)?
4.Is it possible to collect and keep the “accumulated time of being viewed”? It may be more effective to describe the importance of a page than just the number of accesses from a qualitative point of view rather than just quantitative points of view. It may also help avoid the problem of “Popular guys become more popular”.
5.How about giving a “trust weight” to each page. For example, when it comes to a query about a disease, results from a hospital’s URL may have a higher “trust weight” than other non-professional(but may be more popular with regards to the PageRank) websites.
Paper Review 2——“Challenges in Web Search Engines”
Overview:
This paper discusses several important problems in enhancing the quality of search engines, including Spam, Content Quality, Quality Evaluation, Web Conventions, Duplicate Hosts and Vaguely-Structured Data.
Observations:
1.Spam: text spam, link spam and cloaking.
Here, “link farm” and “doorway pages” are very interesting. A link farm is a collection of links that points to every other page in that site, or indeed to any site the author controls. “doorway pages” seem more evil. They consist entirely of links and are not intended to be viewed by humans. They use so-called “cloaking” technology to differentiate search engine crawlers’ visits from average users’ visits and serve crawlers by contents which may have high scores of page rank. Cloaking can only be discovered by crawling a website twice, once using an HTTP client the cloaker believes is a search engine, and once from a client the cloaker believes is not a search engine.
2. Content quality and its evaluation.
I think this is one of the most important aspects to for current search engines to improve their performances. Current search engines, basically, just take quantitative link analysis, e.g. authority/hub, page rank, into consideration. They do not care the reliability of the source, the accuracy of the contents and so on, which can be described as a weight of “trust” in my opinion.
As mentioned in the paper, there some ways to evaluate the “quality” of links. Direct feedback and implicit feedback from the log data, such as the position of clicks for a search and the time spent on the click.
I think the search engines should also include something like “social credit/fame” in the calculation of page importance. For example, in terms of political news, contents from governmental website should have much higher “social credit/fame” than those from a random website. When it comes to diseases, professional medical websites should hold a higher “social credit/fame”. Registration information of the DNSs or even the DNSs themselves(semantic knowledge) can be used to set the “social credit/fame” of a page in a certain aspect.(I extend my ideas here from my last paper review.)
3. Web Conventions: in “anchor text”, “hyperlink” and “META tag”
I think the critical point here is that search engines assume adherence to these conventions to improve search results. However, search engines should pay more and more attention to those who contravene established web conventions in order to enhance their performance further.
4. Duplicate Hosts
I think the essential point here is that “the solution of detecting duplicate hosts needs to be much less expensive than the brute-force approach that compares every pair of hosts.”
5. Vaguely-Structured Data
It discusses structure problem of web data in this part. Important concepts here: “unstructured data”, “structured data”, “semi-structured data”, “vaguely-structured data”, “layout”(page description) V.S “semantic structure” (document description).
1. “PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))” is a little different from what we leant in class, but I think the latter is more reasonable
2. “The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.” Why does URLresolver need to put anchor text into the forward index associated with the docID that the anchor points to?
3. “A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096).” How to find the accurate position of a word if it is higher than 4095?
4. “We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font.” That’s quite more reasonable.
5. “Google counts the number of hits of each type in the hit list. Then every count is converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help.” How is every count converted into a count-weight? And why do they have such a trend? This seems not based on vector space.
6. “The hits from the multiple hit lists are matched up so that nearby hits are matched together” How????? And does Google only consider the documents which contain all the words in the query? If not, the description of multi-word search in the paper may be not clear enough.
Review: Google Paper
Google uses following techniques to implement its search technique
Page Rank: It utilizes citations and backlinks, giving unequal weights to links from different pages and normalizing number of links on page.
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
d- damping factor, Tn – Pages which point to that page, Cn – Links from page
Anchor Text: This is the text of link. Usually it means some description of the page it points to, and hence this is used in calculating page relevance.
Google System Anatomy
• Web Crawling – It is done by distributed crawlers.
• StoreServer – Compress and stores the crawled web pages in repository.
• Indexer – Reads repository, uncompresses the document and parses it. It converts each document into hits and stores them into barrels. It also parses and stores anchor text
• URLresolver – Converts relative URLs in Anchor text file into absolute URL and DocIds.
• Sorter – Takes barrels which are sorted by DocId and generates inverted index
Data Structures
• Big Files – Virtual files spanning multiple file systems
• Repository – Contains full HTML of each page compressed using zlib
• Document Index – ISAM index, ordered by DocId which stores all information about documents
• Lexicon – Consists of list of words and hash table of pointers
• Hit Lists - A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information
• Forward Index – Partially sorted and stored in barrels, each barrel holding a range of wordIDs.
• Inverted Index - For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into.
Searching
• Ranking System – It considers of different types like (title, anchor, URL, plain text large font, plain text small font etc
• Feedback – It compares the feedback given by user, with the new results when search algorithm is modified
This paper explains the architecture of Google, which is currently one of the major search engines in use. The technological aspects and the specific details are well organized that explain why Google is used widely compared to other search engines. The main aspect that is mentioned in this paper is about the scaling of the web and the steps they take to handle it. The number of documents keep increasing manifold and handling them requires a high level of engineering. Fast crawling technology, efficient storage space utilization of the indices and the documents are few of them. Google stores all the documents that it crawls in a compressed form for which it uses the bzip algorithm.
Features
It is clearly observed that the high precision is more important in the web compared to that of recall. The two main things that allow google to produce high precision are
• Use of the link structure and calculate a quality ranking.
• Utilizing the link to improve the search results.
PageRank
The pagerank is an objective measure of the citation importance which goes hand in hand with the people’s subjective idea of importance.PageRank does not consider links form all pages equally
Anchor Text
Google gives importance to anchors since they provide more accurate description of the web pages then the page themselves. Moreover anchors could be for documents that cannot be indexed by text, such as images, programs and databases.
Apart from the PageRank google also considers the location information of all the hits, makes extensive use of the proximity of the words, font size of the words and also the full raw HTML of the pages
Architecture
Google uses distributed crawlers for web crawling. Each document that is crawled is stored in the repository and then converted into a set of word occurrences called hits. These hits are distributed in barrels. Then with the docID, it considers each worded and generates an inverted index. With this, for the given query the top matching documents that have a higher PageRank value are returned very fast. The performance of the crawler is 48.5 pages per second and that of the indexer is 54 pages that is considered to be good.
The quote in their future work “We do not expect this Future Work section to become much shorter in the near future” depicts their urge to invent new things and today the word Google is synonymous to “Search”!!!
Being an ardent fan of Google, I read the paper with great expectations and the paper did keep up to my expectations( although it was a bit long). I am quite surprised about it being rejected at the conference.
To start with, the paper has been presented in such a manner that even a person who doesn’t have in-depth knowledge about search engines could get some idea about the working and functionality of Google. To show the success of their work , the authors have very well documented the history of search engines and compared it to their work. The vivid features of their system, especially the Page Rank technique which forms the crux of the system has been explained in a lucid language. However, it could have been better if the concept of Anchor text had been explained with its definition. An important aspect of the Google system which was highlighted was that Google considers the font size of words, larger or bolder font to assign the weight to the words apart from the frequency of their occurrence.
The overview of Google architecture is presented in quite well, but it does not help a reader who wants to get the more details about it . It is an literally an overview. One of the astonishing features observed was the different types of Data Structures used by Google. Every data structure specified has almost an equal importance in the overall working of Google.
The remaining concepts of Crawling, Indexing, Searching, Results , Storage requirements and System performance have been described very well. This paper has given a complete and elaborate structure of the Google Search Engine. It can always serve as a good reference for someone who wishes to build a small search engine specific to a particular domain or an institution.
I am quite sure that if Google founders were to present the paper on “present Google system”, they would receive high accolades since Google now not only performs a document search, but also searches images, maps, videos and news.
Being an ardent fan of Google, I read the paper with great expectations and the paper did keep up to my expectations( although it was a bit long). I am quite surprised about it being rejected at the conference.
To start with, the paper has been presented in such a manner that even a person who doesn’t have in-depth knowledge about search engines could get some idea about the working and functionality of Google. To show the success of their work , the authors have very well documented the history of search engines and compared it to their work. The vivid features of their system, especially the Page Rank technique which forms the crux of the system has been explained in a lucid language. However, it could have been better if the concept of Anchor text had been explained with its definition. An important aspect of the Google system which was highlighted was that Google considers the font size of words, larger or bolder font to assign the weight to the words apart from the frequency of their occurrence.
The overview of Google architecture is presented in quite well, but it does not help a reader who wants to get the more details about it . It is an literally an overview. One of the astonishing features observed was the different types of Data Structures used by Google. Every data structure specified has almost an equal importance in the overall working of Google.
The remaining concepts of Crawling, Indexing, Searching, Results , Storage requirements and System performance have been described very well. This paper has given a complete and elaborate structure of the Google Search Engine. It can always serve as a good reference for someone who wishes to build a small search engine specific to a particular domain or an institution.
I am quite sure that if Google founders were to present the paper on “present Google system”, they would receive high accolades since Google now not only performs a document search, but also searches images, maps, videos and news.
Being an ardent fan of Google, I read the paper with great expectations and the paper did keep up to my expectations( although it was a bit long). I am quite surprised about it being rejected at the conference.
To start with, the paper has been presented in such a manner that even a person who doesn’t have in-depth knowledge about search engines could get some idea about the working and functionality of Google. To show the success of their work , the authors have very well documented the history of search engines and compared it to their work. The vivid features of their system, especially the Page Rank technique which forms the crux of the system has been explained in a lucid language. However, it could have been better if the concept of Anchor text had been explained with its definition. An important aspect of the Google system which was highlighted was that Google considers the font size of words, larger or bolder font to assign the weight to the words apart from the frequency of their occurrence.
The overview of Google architecture is presented in quite well, but it does not help a reader who wants to get the more details about it . It is an literally an overview. One of the astonishing features observed was the different types of Data Structures used by Google. Every data structure specified has almost an equal importance in the overall working of Google.
The remaining concepts of Crawling, Indexing, Searching, Results , Storage requirements and System performance have been described very well. This paper has given a complete and elaborate structure of the Google Search Engine. It can always serve as a good reference for someone who wishes to build a small search engine specific to a particular domain or an institution.
I am quite sure that if Google founders were to present the paper on “present Google system”, they would receive high accolades since Google now not only performs a document search, but also searches images, maps, videos and news.
I have written a review rather than a critique...!!!
Below are some comments, about the paper “The Anatomy of a Large-Scale Hypertextual Web Search Engine”.
The paper gives us an overview of the architecture of Google system in its early age. Something in the paper impresses me, as follows:
1. Traditional IR techniques could not be used in the scenario of web, unless modifications are made.
2. Google makes very good use of the information about links in the webpage, in order to improve the performance of search. It’s really reasonable, although. The link or hypertextual navigation is the core characteristic in the text-dominant web page. Even in the page where images or objects dominate, the link information including the anchor text is still very important feature.
3. The paper presents a lot of details of the implementation. Believe most of them have been improved with the development of the web and Google itself. But they cover almost all important aspects that one search engine should have, from the web crawling, through parsing and indexing, to searching etc. On the other hand, these details are more of technical. Data structures used for storing and indexing are illustrated, and even the experiential values of some parameters are discussed as well. All of these make this paper more like an extension for the experiment section of one paper.
4. Thus, one question is that, can we make a search engine as the paper tells. Perhaps we could construct a ‘toy’ from the beginning, according to the description in the paper. Call it a ‘toy’, because it might not be really useful, although it could possess ‘everything’ same as Google. The huge difference lies in the deeper level. In that level, the idea is not as important as it sounds, but the technique plays major role in the improvement of the performance.
5. The theory of mathematics used by Google is not so complex, and that’s the part of the reason that many people can make use of the rule of PageRank to promote their pages. Some people even earn money by that. Usually, it’s not easy to distinguish this ‘cheating’ pattern from other ‘right’ ones. One solution might borrow ideas from the TrustRank mechanism.
6. Another issue I am concerned is about the topic-specific search on the basis of Google. Usually, there seem two approaches to the topic-specific search: one is to classify the user’s query, and put it into the RIGHT topic-specific search engine, then return the results to the user; the other prepares DIFFERENT kinds of search engines which could be topic-specific, and feeds the user’s query into all of them, finally combines results from different search engines according to some ranking mechanism.
7. Different people tend to be interested in the specific information. For example, students in the computer science department could be more interested with information about the development of the computer science. When they use Google to search, more results about that topic should be returned. Creating the profile for users is a choice to help to realize such a goal. The history of the search from one user could be recorded and analyzed by the search engine to provide the user-specific service. Questions about this method include what kind of information from the user should be recorded, and how to process them. The user’s privacy might be considered as well.
Anatomy of search engine.
General Observations:
The pagerank models the web pages as graph and their importance by
assuming random surfer model. Now this model gives unconditional
probability whereas actual user behavior is conditional. Traditionally
many techniques have been established to capture conditional
probability (like Bayes Networks) behaviors. Can these techniques be
useful for user specific search ? Using topic sensitive pagerank
absorbs the conditional aspect or it is just approximation?
The overall ranking is decided by vector space model based rank, page
rank, weight of anchor text, weight of visual information and model
damping factor. These can considered tuning knobs of search engine and
user or domain specific knowledge can be used to influence system. To
tackle 'one size fits all' type of page rank.
Ranking of doc-ids in document list may help finding hits early, does
this mean actually entire list is not considered before ranking?
Challenging task from programming point of view:
Extraction of links, handling huge variation of important information
(zip,emails etc), tackling potholes with adversarial intentions.
Designing data structures based on disk size and using tricks like
borrowing length bits from word-ID etc.
Interesting facts to know from architecture:
The aspect IO operation speed has hugely influenced the data
structures and components much more than I was expecting before I read
details.
The pipelining kind of processing is used very effectively in the
design of entire system; at high level
(crawler,storage,indexing,sorting) and at component level too (forward
and inverted barrels, inside design of crawler )
Wherever effective redundancy is used to provide efficient performance
(ignoring rare cases) this may be due the cheap cost/space ratio
google cluster has.
Entire systems has number of trade-off decision points and selection
optimum combination of all seems difficult task. Considering whole
search engine architecture, file system and underlying linux cluster.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
1.Overview
This paper presents Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google maintains a large database with at least 24 million pages indexed in an efficient way to retrieve. With a high level introduction of Google, the author provides details for realizing large-scale search engines including algorithms as well as data structures.
2.System Architecture
Google has different components for different functions in search engine. It mainly includes: web crawler, indexer, sorter, URL resolver and searcher.
First, the web crawler crawls as many pages as it can and send them to the store server to compress and store into a repository.
Second, the indexer reads the repository, uncmopresses the pages, and parses them. Each page is converted to hits. Then the indexer distributes these hits into a set of “barrels”, creating forward index. Indexer also extracts link information from pages and store them in an anchors file.
Third, the URL resolver reads the anchors file and converts relative URLs docID. It also computes the information of link structure for further computing PageRank.
Fourth, the sorter computes the inverted index.
Finally, the search uses the lexicon, inverted index and PageRanks to answer queries.
3.System Features
1.PageRank
Google make use of the link structure of the Web to calculate a quality ranking for each web page with the algorithm PageRank. In this case, not only does the content of the page matter the result, but also the link structure matters it.
2.Anchor Text
While most search engines associate the text of a link with the page that the link is on, Google associates it with the page the link points to since anchor texts provide more accurate descriptions of web pages.
4.Interesting and Important Points
1.The search engine might return a page that never actually existed, but had hyperlinks pointing to it.
2.Both time efficiency and storage efficiency are very important for Google. This key idea influence the design through out the whole architecture.
3.In order to give quick response, Google does not use the information from user queries for better results.
4.Use barrels to partially sort pages.
5.Can we build inverted index directly?
6.While the size of web is increasing in a high speed, will the computation of PageRank still be applicable?
Challenges in Web Search Engine
1.Overview
With the fast development of world wide web, web search engines are faced with a number of difficult problems in information retrieval. This paper proposed many interesting and challenging points that have raised during recent years.
2.Challenges
1.Spam:
To achieve higher rankings.
1.Text-based approach
2.Link-based approach
3.Cloaking approach
4.A combination of the above three
2.Content Quality
To reach high quality of the web, since the web is full of noisy, low-quality, unreliable, and indeed contradictory content.
3.Quality Evaluation
Evaluating the quality of different ranking algorithms.
4.Web Conventions
Using the conventions in the web pages to improve searching results.
5.Duplicate Hosts
Try to detect duplicate hosts in order to avoid crawling and indexing duplicate and near-duplicate pages.
6.Vaguely-Structured Data
Use the vaguely Structured Date to improve the quality of the results.
3.Interesting and Important Points
1.Is there any possibility to make the search engine itself able to learn while working? Then it might find some interesting features from the web pages which are not obvious for people.
2.For Spam, in my opinion, it would be better to design it to be user-oriented, since some of the pages maybe useless while others are pretty interested in them.
Anantomy of Google Paper
1. Section 1.3.1: "... as of Nov 1997, only one of the top 4 commerical search engines found itself in the top 10 results".
This raises an important point that the number of documents has drastically increased, but the user's ability to look beyond the top 10 results and his patience has not increased too much. This illustrates the importance of precision over recall in the web scenario of IR.
2. Section 1.3.2: Was what the authors are describing - "academic search" - is this a precursor to Google Scholar, and were they borrowing ideas at that point from CiteSeer?
3. Section 2.2: Anchor Text "... makes it possible to return web pages that have not actually been crawled."
If Google was the first to use this idea, then it is a very novel idea because you don't even have to crawl some pages to retrieve them as hits. This saves a lot of time and other resources.
4. Section 3.1, paragraph 1: What do the authors mean when they say "reasonable results"?
To me, the result "Bill Clinton sucks" is just as relevant and important as "whitehouse.gov", if all I am tryng to do is find people's views on Bill Clinton. Either the authors should be more transparent in about their notion of importance, or they shouldn't put statements that hang in the air about it and assume it.
Also important in this context is the example of hotes and how when a user searches for a specific hotel in Google, invariably the first few results are actually third party sites that make a commission for every click from their site that takes a user to the hotel's website; the actual website of the hotel itself is often way down the list.
5. Section 4.1, paragraph 4: "sorting is done in place"
A lot of space is ostenibly saved by using in-place sorting.
6. Section 4.2.5, paragraph 2: "... all positions higher than 4095 are labelled 4096."
This points to the fact that (as conceived) Google only scanned the first 4095 places in a document, and therefore it is always better to put things more relevant to your page near the beginning.
7. Section 4.2.6: "... forward index is actually already partially sorted" (by barrels)
This cuts down on the sorting time required and will increase the efficiency.
8. Section 4.2.6: "... each wordID is stored as the difference from the minimum wordID in the barrel"
This is a very cleber idea, because it saves a lot of bits that would otherwise be used in representing large numbers; this idea is already used in multimedia for data compression and tranmission.
9. Section 4.5.1: "... ranking function so that no particular factor can have too much influence"
Necessary feature to handle the adversarial nature of the web and people who would like to get to the top of Google search results for commercial gain (GoogleKing)
10. Section 5, paragraph 1 and figure 4: Shows the validity of point # 3 (relying on anchor text) because it returns an e-mail address result (which cannot be crawled) which is actually quite relevant to the query.
The paper describes google -- a large scale search engine that uses a pagerank scheme to display the search results. The complete architecture and working of google has been presented in the paper with considerable depth.
The paper describes how the traditional search engines had failed to keep up with users expectations. The authors mention a very interesting fact that precision is more important than recall because of the restless user community.
It discusses how to produce better search results by exploiting the link structure of the web. This idea is called as pagerank. Page Rank is a very interesting idea that is discussed and how various factors affect page rank is also presented. Although this idea is very good certain flaws have to be corrected:
--If a page is new there will be lesser pages pointing to it which will result in a lower rank than an older page less important which has a higher rank due to more pages pointing to it.
--Page rank is sensitive to the trust of the page. How does one describe this trust matters a lot on the search results.
--k-words before and after an anchor text must be considered because just the anchor text does not describe the page completely.
--Page Rank can be exploited by the user community to get a higheer rank in the serach results.
1. The authors mention the utility of page views in calculating the rank of the a page. The algorithm itself does not use the number of page views in calculating page rank. Clearly directly getting the number of page views is currently not possible and indirect methods have to be used for this.
2. The page rank idea is based on the assumption that link structure of the web is a good predictor of the page importance. The number of page links as the major predictor of page rank has many problems. Many of these problems like link farms etc has been addressed by later papers. Also with pages containing lot of other types of data like Images, Flash and Videos the use of HTML structure alone would not be enough to get good page rank.
3. The paper talks in detail about scaling properties and the implementation details since google search is a centralized system. It would be interesting to see how a decentralized search system would work. (See about wikia inc.'s search for work in this direction)
4, In section 2.1.2 the authors mention that page rank algorithm models the behavior of a random surfer and the probability that a random surfer visits a page is its page rank. But in reality the user behavior varies very widely and the probability that a user visits a page is determined by that particular user's preferences.
5. In section 3.2 the authors mention that Yahoo's page must be treated very differently from the historical article which might receive one view for every 10 years. But the paper has not mentioned anywhere how the search engine actually achieves this. In fact using page links has an inherent bias towards static pages with a single topic compared to transient pages which talks about obscure topics and hence lesser number of links pointing to it.
6 The Query Evaluation mentions "Scan through the doclists until there is a document that matches all the search terms". So it seems that the algorithm does not perform partial query matches. Even for all the terms it seems the same weight was given to each term (apart from proximity) and does not consider the difference of importance of each term.The explanation in the paper about this part of the algorithm is insufficient.
7. The use of "Anchor text" is a good idea and does help in providing better results. However this also induced problems like google bombing which needed to be addressed.
8. Evaluating a search engine objectively is difficult since relevance for search itself differs for each individual and with circumstances. So even for the results in the paper arguments can be made about the relevance of the results. The authors just mention that "All of the pages are high quality pages" and they do not mention what they mean by this.
The Google search engine apart from the usual word matches uses two other important features that improve the search results. They are page rank and the concept of anchor text. The page rank is similar to the model of a user behavior. If a large number of pages point to a particular page then the one pointed to will have high page rank. This is intuitive because when many links point to the same page it must have some importance and this is reflected in the page rank. The other feature that Google uses is the Anchor text. Anchor text refers to the text on the links. Anchor texts often provide more accurate information of web pages than the pages themselves.
The authors also discuss the Google architecture and the data structures. The architecture consists of components such as crawlers, indexer, barrels, sorter, searcher etc. The system uses a sophisticated inverted index for storing the documents. The data structures used has been designed with large document collection in mind. The data structures selected have a good balance between efficiency and coding complexity. The design decision has been driven by the desire to have a reasonable compact data structure and the ability to fetch a record in one disk seek, so as to reduce the time taken due to disk fetches. Another important module is the Lexicon. The entire Lexicon resides in the main memory. It is implemented in two parts- a list of words and a hash table of pointers.
The good points…
The Page rank gives varied importance to different links in a page.
The data structures used are efficient both in terms of space and time. Though complex, every bit of space has been used efficiently. The data structures are designed such that it can be accessed with 1 disk seek. This can make a lot of difference
Sorting the docs based on their checksums improves the efficiency.
WordIDs are stored according to their relative distance from the first word. This can save a lot of space.
It uses short barrels and long barrels in such a way that the more precise pages can be obtained by just searching the small barrel as it stores the important anchor and title texts.
Which could have been better…
Google stores all the actual HTML documents. Storing all the documents in the world is not a viable task. Instead of storing the entire document, why not just store the critical parameters and key words from the document into a structure?
The random surfer model though good is not very effective. This model assumes the user never clicks the back button, but in reality most of the users click the back button more than anything else. A single valued damping factor cannot help much in this scenario. It must be more of a Gaussian distribution with the higher and mid values assigned to parent pages and other important links in the page and the other link the low end values of the Gaussian function.
With the advent of all the context based ads recognizing the relevant docs becomes even more difficult. The hard coded links and the links within the frame of the key word hits can be considered as relevant links and be assigned higher weights and the other lesser links and the conceptual links can be assigned values near the tail of the Gaussian curve.
Observation
The concepts like Page ranking or Anchor texts were not something new. Bringing together all these existing concepts has resulted in this successful search engine. Yeah, I love Google ;-)
Reiview of the Google Paper.
Bhushan Pendharkar
ASU ID 993934582
Being an ardent fan of Google, I read the paper with great expectations and the paper did keep up to my expectations( although it was a bit long). I am quite surprised about it being rejected at the conference.
To start with, the paper has been presented in such a manner that even a person who doesn’t have in-depth knowledge about search engines could get some idea about the working and functionality of Google. To show the success of their work , the authors have very well documented the history of search engines and compared it to their work. The vivid features of their system, especially the Page Rank technique which forms the crux of the system has been explained in a lucid language. However, it could have been better if the concept of Anchor text had been explained with its definition. An important aspect of the Google system which was highlighted was that Google considers the font size of words, larger or bolder font to assign the weight to the words apart from the frequency of their occurrence.
The overview of Google architecture is presented in quite well, but it does not help a reader who wants to get the more details about it . It is an literally an overview. One of the astonishing features observed was the different types of Data Structures used by Google. Every data structure specified has almost an equal importance in the overall working of Google.
The remaining concepts of Crawling, Indexing, Searching, Results , Storage requirements and System performance have been described very well. This paper has given a complete and elaborate structure of the Google Search Engine. It can always serve as a good reference for someone who wishes to build a small search engine specific to a particular domain or an institution.
I am quite sure that if Google founders were to present the paper on “present Google system”, they would receive high accolades since Google now not only performs a document search, but also searches images, maps, videos and news.
HATS OFF TO GOOGLE….!!!!
Post a Comment