Thursday, January 25, 2007

Discussion Topic: (mis)intuitions about High dimensional space..

Here are two interesting facts.

1. High dimensional apples are all peel and no core. Specifically, consider spherical apples of
n dimensions. Suppose you were to peel off an epsilon-think apple skin off the apple, what is the
fraction of the apple that is left? If you consider unit radius apples (spheres) you can show that
the percentage of apple left is 97% in 3-D, and 0.004% in 1000-D (i.e., thousand dimensional space). (You can
assume that the volume of an n-dimensional sphere is O( r^n) where r is the radius.


2. In high-dimensions, any randomly chosen pair of vectors are, with very high probability, perpendicular
to each other.

You are welcome--on the blog--to (a) try to prove them
and/or (b) discuss why any of these are relevant to information retrieval using vector space
ranking.

Rao

ps: If intellectual adventure/stimulation is not enough of an incentive to post on the blog,
    note that blog participation is viewed as class participation--and part of your grade for this course will be participation based.



5 comments:

VJ said...

If the peel is constant thickness and is of size ε.
Let us assume ε to be a constant value of 0.01. Consider r is 1. Considering k to be a constant.

Core = k((1 – ε)^3 ) = 0.993 = 0.97 = 97%
Peel = k(1^3) –k((1 – ε)^3.) = 0.03 = 3%
For 3-D the Core is pretty big.
But as the value of n, (i.e. the number of dimensions) increases
Peel >> Core
For n = 1000
Core = k((1 - ε)^1000) = 1 – 0.991000 = 4.3 x 10-5 = 0.004%
Peel = 99.996% which is very large.
This is relevant to information retrieval, in the sense that, if you assume that you have a very high dimension space, English dictionary lists about 991,207 words http://www.languagemonitor.com/wst_page7.html), which is very huge. The problem of having a huge dimension space is that, it does not really make sense to compare similarities if their values are very high.

eg: if the similarity of doc1 with Q1 = 1, and doc2 with Q1 is 5, the it can be said that doc2 is more similar to Q1.
But if the similarities were 10,000 and 10,001 then the meaning is lost.


When we randomly choose a pair of vectors and if we find out the similarity between the vectors using the cosine similarity, vectors would be pretty sparse. Then the normalized cosine similarity value is approximately equal to zero, therefore the angle between the vectors are more or less perpendicular to each other.

VJ

Aditya Kanitkar said...

Q2)

In n dimension space let :
A = (a1,a2,a3,......,an)
B = (b1,b2,b3,......,bn)

The cosine of angle between them is : (a1b1 + a2b2+....+annb)/|A||B|
When this product is 0, it means vectors are orthogonal to each other.

Hence, to be non perpendicular to each other, both the vectors must have non-zero component in atleast 1 common axis. Hence vectors in 1 dimension can never be perpendicular to each other. As we keep on increasing dimensions, the chances of them having components in same axes decreases. Moreover, even if the dot product is not 0, many times it will be close to 0, signifying that vectors are almost othrogonal to each other. Hence for very high dimension space more often than not, vectors will be orhtogonal to each other.

Assume that we are using vector model for answering queries. Hence for high dimension space the similarity between documents is reduced, affecting the quality of our results.

Aditya

Yang Qin said...

for (1), simple, (r-ε)/r is smaller than 1, so when n->infinity,((r-ε)/r)^n is getting closer to 0.
for (2), as they said, the dot product is likely to be 0.
in fact, it's very obvious that the more keywords in your query, the less result you will get.

Aravind Krishna K said...

The discussion gives us an intuitive understanding to 'curse of dimensionality' problem (as how learning becomes difficult when tons of features are needed to represent a vector.. which here are document-vectors & dictionary of words is the feature space)

- From IR point of view, this motivates us the need for dimensionality reduction techniques / better feature selection methods rather than taking the vector space using words as each unit, in document(and Query) representation.

feelingbird said...

The problem in the high dimensional space becomes rather complex, since the number of solutions increases exponentially. The best way is to reduce the number of concerned dimensions, and many methods could be used, such as MDS, SVD etc. However, many questions should be concerned in such a process. For example, what criteria should be used for the reduction, how many dimensions should be reduced, how many should be left, .... It seems that there is no answers unless you know the specific problem that you want to solve very well. (kind of heuristic things)