Friday, January 19, 2007

Some interesting readings + a discussion topic

Here are two interesting (not-fully-technical) readings:

http://www.wired.com/wired/archive/12.10/tail.html
 (This talks about the long tail phenomena on the web..)

http://rakaposhi.eas.asu.edu/prec-recall-rev.pdf
 (this applies the ideas of precision and recall to conference paper reviewing
  and in so doing strengthens your intuitions about precision/recall)

========================

Here is something to think about.  We noticed that information retrieval uses
precision/recall measures to evaluate systems.

Database systems, on the other hand, do not care about precision/recall.

Why don't they? 

Should they?

When should they?

Feel free to add your comments to the blog.

Rao



8 comments:

Zheshen(Jessie) said...

In general database system, data should be structured and well-organized and the query used for search is also structured(like SQL), so the results are decided, althrough unknown before searching. In other words, the "feature" used for comparison between query and data, and the definition of "relevance" are explicit. So both precision and call should be 1 when time is enough.

But if some unstructured or semi-structured data are included in the database(such as a database of 100 images without tag and the query is about the contents of each image), in other words, when the "feature" or "relevance" or both are ambiguious, then the database system should have precision and recall for its system evaluation.

oneuponzero said...

as said by zheshen data base systems are precise due to the inherent nature of querying i.e : structured query (unlike web search where imprecise queries are addressed)
and these can be made to handle precision levels for tagged/heterogenous data ..

About recall aspect , data bases are expected to return each and every relevent tuple . the user here waits for results of query.

But for systems where we need super fast answers (impatient users) then we may consider some recall degree and return few of the tuples ...
(or top k of ranked tuples after we have used time we have for retrieval)

Aravind Krishna K said...

The discussion in the 2nd paper is interesting. This reminds me of a paper I had read sometime ago, that discusses the issues in conference paper reviewing. (might
be a digression!) but, things that are relevant would be the way the authors compare conference reviewing to the way university admissions go about, spam filtering, and Amazon's rating of products etc.
Overall, it is fun to read that paper, after all the conference itself was named FUN 2004 :-P (3rd International Conference on Fun with Algorithms)

title:: How to increase acceptance ratio of top conferences?
- Grahan Cormode, Artur Czumaj, S. Muthukrishnan

Link: http://www.cs.rutgers.edu/~muthu/ccmfun.pdf

VJ said...

Probably an important reason for which we need precision and recall for information retrieval on the web, is that if we treat the whole web as a Database, then it does not have a well defined structure which can be queried. Even if it does have a well defined structure, the User would never understand how to communicate effectively with the database. Imagine all of them learning some kind of SQL for the web. A user typically queries on 1 or 2 words and expects the right answer each time.

To some extent, I feel that as the IR system gets smarter and smarter, the no. of words in the query would also reduce, thereby increasing the expectation of the IR system to perform better magic.
But as in the case of a database, the user knows how to exactly communicate with the database. Image and video retrieval from a database could be an area where precision and recall might be needed.

VJ

Subbarao Kambhampati said...

Here are a couple of (additional) angles to consider:

--> Who were the original users of databases and who are they now (that the databases are widely accessible via web)?

--> How does this change the nature of database queries?

Aravind Krishna K said...

Originally, databases were mainly used in companies/universities/financial institutions internally and the users are the employees in managing all the information efficiently, transactions over them, and as statistical tools in analyzing the information. In those days, traditional databases with structured querying languages like SQL are supposed to have full precision and recall.

But, with the advent www, web databases and autonomous data sources are becoming popular. This lead to the usage of databases by lay user community has drastically increased. Example: online movie
databases / used car database etc. This introduces uncertainty/imprecision/incompletness in database querying.

The issue of relevance to the user comes into picture, like the usual search of documents in search engines. Thus, the database systems have precision and recall for their evaluation.

PS: Doesn't the precision also carry a degree with it ?? I mean, the degree of relevance
(the level of precision) is also an important issue, as ranking is a major issue in querying, right?

Yang Qin said...

In my opinion, both precision and recall depend on the user's perspective. Internet today is still a data centric entity, not user centric. Although queries on those well structured database system can get results with 100% precision and recall, these low level operations can be done by gurus only but not normal users. As long as there's somebody who cannot use languages which machines can understand precisely, the ambiguousness exits. And in this case, some kinds of media between the users and machines have to be used, hence, the precision and recall are still necessary issues.
In conclusion, for low level usage of database systems, precision and recall perhaps do not make any sense, but for high level usage, they are necessary.

Subbarao Kambhampati said...

I think Yang Qin hit the nail right on the head.. With the advent of the web, the "normal users" of the databases are unwashed masses like you and me--the same dunderheads who write average 2.1 word keyword queries. They are unlikely to know enough to exactly specify the query that they have in their head. This means "relevance" cannot be judged just in terms of exact match with the query, and we need to think of precision/recall measures just as we do in IR.

(If you are interested, look at
http://rakaposhi.eas.asu.edu/quic-short.pdf
)

--Rao