Tuesday, February 16, 2010

(re-sent) [Thinking Cap] on Latent Semantic Indexing

[Sorry--the previous version got sent prematurely. Here is the correct one]

0. We have 100 documents that are all described as vectors in the space of 1000 keywords. What is the largest number of non-zero singular values this document-term matrix can have?

1. suppose we have two documents d1 and d2 whose cosine similarity in the original space is 0.6. What is their cosine similarity in the factor space (i.e. df*ff representation)
   1.1. We decide to retain *all* dimensions
    1.2. We decide to retain just one dimension

2. We considered the "malicious oracle" model where the true documents were distorted by (a) introducing fake terms corresponding to linear combinations of real terms (b) adding noise to these fake terms (so they are not exact linear combinations) and (c) removing the original "true" terms. To what extent is LSI analysis able to recover the original data thus corrupted? Speak specifically of how LSI handles a, b and c parts. 

3. We have documents (data) that are described in the space of 3 keywords. It turns out that the specific documents that we have all fall on a 2-d plane. What does LSI on this data tell you?    3.1. In the previous case, suppose the data that we have forms a 2-D paraboloidal surface. What does LSI does for this case?


1 comment:

  1. Question 0: The largest number of non-zero singular values would be 100 because the documents are represented in 1000X100 document term matrix whose rank is 100.


Note: Only a member of this blog may post a comment.