Saturday, February 27, 2010

Learning to Rank challenge by Yahoo...

Though over 100 papers have been published in the learning to rank (LTR) field, most of the large-scale, real-world datasets are not publicly available. This makes drawing comparisons between algorithms difficult.

In the spirit of changing this, Yahoo! is hosting the Learning to Rank Challenge. We'll offer up two of our never before released actual datasets. These datasets—used for learning our search ranking function—can only be accessed through participation in the competition.

This exciting machine learning challenge will consist of two tracks: the first is a standard LTR track and the second is a transfer-learning track. Both tracks are open to all external research groups in academia and industry.

Here is the link

Friday, February 26, 2010

Using map reduce in a declarative paradigm to do tf-idf

I found this very nice post on how to do a TF-IDF very quicly using a declarative language called PIG. The main benifit is that PIG integrates well with Hadoop and Map-reduce. For any one who knows how mechanically long it is to writing a Java program for map reduce on hadoop would like this.

Thursday, February 25, 2010

on the "uniqueness" of Singular value decomposition (answer to the blog question)

There is a question on the blog as to whether SVD is guaranteed to be unique for a given matrix.

dt= df * ff * tf'

It is easy to see that  ff is unique--since it is just the square roots of eigen values of dt*dt' (or dt'*dt).

The question is about df and tf' .  Their "uniqueness" depends on whether or not the singular values are all distinct (which in turn
depends on whether the eigen values of dt*dt' are distinct)
(Recall  that a square matrix can have repeated eigen values--which happens when its characteristic equation has a factor--say
(x - l)^k   --so we have k roots at l, and the eigen value l is said to have multiplicity  k )

When the singular values are all distinct, then df and tf' are unique *upto* the sign of the vectors (that is, their magnitude is unique, but sign can be either positive or negative--but the signs of df and tf' have to be consistent).

When the singular values are degenerate, then df and tf' are no longer unique.

The lack of uniqueness doesn't however affect any of the low-rank approximation properties of SVD we cared about in doing LSI.


ps: Here is a longer technical description on SVD uniqueness:

Wednesday, February 24, 2010

Please participate by tomorrow morning (if you plan to)...Fwd: CSE 494/598 Class Survey: Please participate..


 If you intend to take part in this survey, please do so by tomorrow (thursday) morning at the latest. I plan to close the survey and digest the comments after that.


---------- Forwarded message ----------
From: Subbarao Kambhampati <>
Date: Tue, Feb 23, 2010 at 1:51 PM
Subject: CSE 494/598 Class Survey: Please participate..
To: Rao Kambhampati <>

Here is a brief class survey. This is your chance to provide anonymous feedback.


Tuesday, February 23, 2010

SVD queries

Dear Dr. Rao

I got confused a lil bit when I started computing:
  1. df from [df ff tf]=svd(dt)
  2. df from [df dl]=eig(dd)

Also see your pdf of SVD MATLAB.

Why these two answers differ in magnitude as well as in sign?

But in practically both should be exactly same.

Also I found on wikipedia that
A=U S V1
A=U S V2
Mean: SVD is not unique?
Can you elaborate thses two points here.

[Thinking cap] on anchor text/page importance etc. (with a carrot about homework 2 deadline extension)

Some of you have asked for extension on the homework 2 due date. Since the requester set includes non-zero number of "rain-or-shine" brigade (i.e., those who actually show up to the class regularly), I am willing to extend the deadline to next Tuesday (2nd March).

Here however is the catch. Since NAACP says a mind is a terrible thing to waste, I want you to keep yours busy by thinking about (and *commenting on* the following) thinking-cap questions:

1. We said that the relative importance of the anchor text characterizations of a page P depends on how many other pages are pointing to P with that  characterization. How should the "number of pages" be taken into account?  Should the "type of pages" somehow matter? (i.e. does it as to *which* page is saying those things about page P?) If so, how do you propose it should be taken into consideration?

2. We had a slide about the page importance desiderata. Comment on
  2.1. To what extent are each of those desiderata actually subsumed by link-based analysis?
  2.2.  In the old days, we used to put links to various pages because that was the easiest way to get back to them when you need to. Now, with search engines getting more and more effective, there is not as much of a reason to put links to each other. How does this affect the utility of link-based analysis techniques for finding page importance?
  2.3. Can you give some examples of how current day search engines actually handle notions of importance that are not strictly subsumed by link-based analysis?

3. [The "I finally had an orgasm, but my doctor said it was the wrong kind" question]: At the start of IR discussion, we said what we are trying to compute is the "relevance" of a document d, given a user U and query Q. We then decided to approximate the relevance by a similarity computation between the document and the query (and spent the intervening weeks getting deeper and deeper into how best to compute this similarity). Now that we decided to throw in the notion of page importance, do you think this should be seen still as a part of relevance (just a more accurate computation of relevance..) or is it some other orthogonal dimension?  (extra credit: Why does the orgasm quote related to this question?).

4. [The "The woods will be silent indeed, if no birds sang except those that sing the best" flame war]:  Suppose you search google to find the exact quote and source of the bird quote by Henry Vandyke  you get frustratingly many minor variations of the quote (including one by yours truly, which attributes it to Henry David Thoreau! ). It looks as if letting the unwashed masses put up web pages is leading to all sorts of inaccurate information. Don't you think it would be simpler to go back to the  peace-and-quiet of the age of  poll-taxes and control web page creation?  (I know this is beginning to look suspiciously like a SAT essay prompt.... you can focus also on whether life will be more or less interesting for CS folks if the society were to go to this model.)


CSE 494/598 Class Survey: Please participate..

Here is a brief class survey. This is your chance to provide anonymous feedback.


project 1 blues..

The energy level in today's class seemed to be about the same as that in James Ray's Sedona sweat lodge ---
too many sleepy and absent faces.

So I thought a bit of morale boosting is in order. So here goes (only if you need it) :

1. If you are unable to complete project 1, and are contemplating dropping the course because of that, you should get in touch with me first.

2. While it is only reasonable for you to have great anxiety about this class, I prefer that it not be solely grade anxiety.. It might help you to note that there are no fixed cut-offs for letter grades in this class; do not assume the usual 90-80-70 system.  (To the extent I have a choice of making you sweat during the semester vs. after the semester, I prefer the former.)

3. Take part in the mr poll survey you will be getting in  a bit and express any suggestions you may have about the class.


delay in class ending today

My apologies for all those who got inconvenienced by us going over time; apparently my quixotic venture to make the class clock align with the national atomic clock went awry somehow. Will try to avoid that in future.


ps: While we are on the subject of class ending time, we should all remind ourselves that the beginning time is *10:30* ... :-b

Thursday, February 18, 2010

a good lecture on SVD (if you haven't done it in your linear algebra class)

You might want to check out this lecture on SVD from Gilbert Strang's course on linear algebra, if want to get a good background on the development


Tuesday, February 16, 2010

(re-sent) [Thinking Cap] on Latent Semantic Indexing

[Sorry--the previous version got sent prematurely. Here is the correct one]

0. We have 100 documents that are all described as vectors in the space of 1000 keywords. What is the largest number of non-zero singular values this document-term matrix can have?

1. suppose we have two documents d1 and d2 whose cosine similarity in the original space is 0.6. What is their cosine similarity in the factor space (i.e. df*ff representation)
   1.1. We decide to retain *all* dimensions
    1.2. We decide to retain just one dimension

2. We considered the "malicious oracle" model where the true documents were distorted by (a) introducing fake terms corresponding to linear combinations of real terms (b) adding noise to these fake terms (so they are not exact linear combinations) and (c) removing the original "true" terms. To what extent is LSI analysis able to recover the original data thus corrupted? Speak specifically of how LSI handles a, b and c parts. 

3. We have documents (data) that are described in the space of 3 keywords. It turns out that the specific documents that we have all fall on a 2-d plane. What does LSI on this data tell you?    3.1. In the previous case, suppose the data that we have forms a 2-D paraboloidal surface. What does LSI does for this case?


Project 1 - Both TF based and TF-IDF based similarities are required

Hi all,
  There was some misunderstanding regarding the similarity computation needed for the project. The project description has now been changed to state that you need to do both the raw term frequency based similarity and the tf-idf based similarity. You can now also attempt an extra credit part where you evaluate the effectiveness of tf-idf scheme.

Here is a direct link to the project page.

Thanks and Regards,
Sushovan De

Solutions for homework 1 (along with marks distribution) posted


Homework 2 posted; due Feb 25th.

Homework 2 is now available and will be due next thursday.

Annotated matlab session available at the following URL

An annotated version of the  matlab session I did in the class--with annotations/explanations in red--is now available at  

I also linked it from the lecture notes page.

Make sure you look at it and understand what is going on..

I will start the homework 2 socket and add problems for correlation analysis and LSI; you should try to work on them to make sure you understand.


Homework 1 stats

Hi all,
  Homework 1 grades are out. The average score is 45.9 out of a possible 60. The standard deviation is 14.2.

  The two most common mistakes were
  (1) q1 - Not normalizing the eigen vectors (from [1, 1] to [1/sqrt(2), 1/sqrt(2)]) -- a loss of one mark.
  (2) q2 - Not realizing that when an irrelevant document is shown, the recall stays put, but the precision goes down, and that fact should reflect on the graph. -- a loss of 3 marks.

Thanks and Regards,
Sushovan De

Monday, February 15, 2010

Are there any paper on the Term-Term correlation matrix ?

Last class is quite interesting about the term-term correlation matrix. Are there any paper talking about that ?

I'm doing with some text mining, I want to know more about the related works.

Thursday, February 11, 2010

Recommended talk... Fwd: SEMINAR: Randy Guthrie: Microsoft: Cloud Computing/Windows Azure

Folks in CSE494 are encouraged to attend this talk as I won't have as much time to cover "cloud computing"..

---------- Forwarded message ----------
From: Audrey Avant <>
Date: Thu, Feb 11, 2010 at 1:18 PM
Subject: SEMINAR: Randy Guthrie: Microsoft: Cloud Computing/Windows Azure

Behalf of Dr. Yann-Hang Lee



PO Box 878809

Tempe, AZ   85287-8809

Ph:   480.965.3190

FAX: 480.965.2751





Cloud Computing/Windows Azure


Randy Guthrie

Microsoft Corporation


Monday, February 15, 2010

11:00am – 12:00pm






In this presentation Randy will give a brief history of computing infrastructure paradigms and review in depth what cloud computing is, the business model for implementation, compare the different offerings by Microsoft, and Google, and give a live demonstration of a simple web application and data storage deployments to the Windows Azure cloud. 




Randy Guthrie is an Academic Developer Evangelist (ADE) for Microsoft Corporation. As an ADE, Randy provides academic and research support to faculty and students at top-tier schools in Colorado, Arizona, New Mexico, Oklahoma & Kansas. Prior to his working at Microsoft, Randy was a professor at Cal Poly Pomona, where he taught software engineering, programming & MIS courses. Before his appointment at Cal Poly, Randy spent thirteen years working for the Northrop Grumman Corporation, where he was a contract manager on the Stealth Bomber project, a project manager, and financial analyst. Randy earned his MBA in 1998 from the Peter F. Drucker School of Management, a master's degree in Information Science in 1999 and a PhD in Information Systems and Technology in 2008 from Claremont Graduate University. Randy's dissertation studied how developer team bias software interfaces to particular types of users. Randy's favorite gadgets are his Zune music player and XBox 360.




Regarding orthonormality vs. Spanning (or my slightly botched up example about slanting axes in 2-D)


 If my discussion and retraction of the 2-D example meant to convey the importance of orthonormal dimensions left you confused, here is a more elaborate answer (which might require a bit more understanding of linear algebra).

We say that a space is spanned by a set of vectors (called "basis vectors"; normally you consider that are unit vectors) if every other vector in that space can be written as some linear combination of these vectors.

I.e. for every vector u in the space

 u = a1*v1+a2*v2....+an*vn   for some arbitrary real numbers (a1, a2... an)

In this case, the tuple (a1, a2, is considered the *coordinates* of u in the space.

Note that this definition does not say anything about the inter-relation between the basis vectors v1... vn . In particular, nothing in the definition says that must all be pairwise orthogonal (independent). [ *By the way, an easy way to check if two vectors are orthogonal is to take their dot product and see if it is zero. Because of this, if
you make a matrix M which has as its columns the n orthogonal vectors, then both M'*M and M*M' will be  diagonal matrixces. If the original vectors are all unit vectors, then the result is the
Identity matrix..].

The *rank* of a space is the minimum number of vectors needed to span that space.

The whole issue of dimensionality reduction is that the space I want to span is really a (linear) manifold in a higher dimensional space, and thus doesn't really need all those dimensions.
(Suppose your entire data/documents fall on a plane--oriented in some funny way--in the 3-D space; then although all the datapoints on the plane have three cordinates, they really don't
need all three; they only need 2! If all your data falls on a line in 3 D, then you only need 1 cooridnate..)

Thus the worry is that I might give you a set of basis vectors that do span the space, but are more in number than  the minimum needed.

If this happens, then it must be the case that some of the basis vectors are themselves writable as the linear combination of the other basis vectors (and are thus
are essentially redundant). For example, suppose all our data is on the plane  3x+4y+2z=7 (where the vector 3,4,2 is orthogonal to the plane
see ), then knowing X and Y coordinates of a point allows us to figure out its z coordinate...


Now, coming back to orthonormality, it is easy to show that  *IF*  we are starting with more basis vectors than is required  by the rank of the space, then the basis vectors
cannot all be pairwise independent (at least some of them should be linearly dependent on others, which means their dot product with those basis vectors will be non-zero).

The converse, however, is not true--it is possible to span a space with a minimum number of non-orthogonal vectors.
In particular, you can span 2-d space with *any* two vectors in that space whose dot product is not 1.

And if you span the space with minimum number of non-orthogonal vectors, then none of them can be written as a linear combination of the others (and thus you cannot guess any
coordinate even when given all other coordinates)

====What does this have to do with the class 2-D example=======

What I was purportedly trying to show was that if your basis vectors are orthogonal, then you cannot guess the nth coordinate even if you are given all other coordinates.
This  is said assuming implicitly that the basis vectors (in addition to being orthogonal) also span the space you are interested in.

Conversely, if your basis vectors are not minimal--i.e., they are more than the number needed to span your space, then you can guess some coordiante given other coordinates.


Let me know/post on blog if you have any questions.


Wednesday, February 10, 2010

tomorrow's lecture--you might find it useful to have printed version of slides with you..

The slides for tomorrow are online; I will do correlation analysis and Latent Semantic Indexing.
It might be useful for you to take a quick look at the slides before the class (and certainly useful to
refresh your linear algebra )


Slides from linear algebra review session with Sushovan

are available on the notes page (here is the direct URL )

I only understand that about 7 students showed up for the review. If you didn't show up, and you are not too sure about eigen values etc, make sure you look at these slides.


Tuesday, February 9, 2010

Project part A due on Feb 23rd

Project part A will be due on Feb 23rd in class.


Linear algebra review session

The linear algebra review session will be held tomorrow, Feb 10 at 2-3pm in BYENG 576. (That's 5th floor, exit the elevator and turn left, and then look for the room near the far end of the corridor).

Thanks and Regards,
Sushovan De

On Mon, Feb 8, 2010 at 8:19 PM, Sushovan De <> wrote:
Hi all,
  A few of you have requested a review session of linear algebra. If you want to attend, please reply to this email with your time preference. If there is a sufficiently large number of students who wish to attend, the review session will take place at the time that gets more votes.

Wednesday Feb 10, 2-3pm
Wednesday Feb 10, 5-6pm

Thanks and Regards,
Sushovan De

Monday, February 8, 2010

RSVP: linear algebra review session

Hi all,
  A few of you have requested a review session of linear algebra. If you want to attend, please reply to this email with your time preference. If there is a sufficiently large number of students who wish to attend, the review session will take place at the time that gets more votes.

Wednesday Feb 10, 2-3pm
Wednesday Feb 10, 5-6pm

Thanks and Regards,
Sushovan De

Sunday, February 7, 2010

regarding the linear algebra refresher question

Some of you seem to be having trouble with linear algebra refresher question. This means you need to read up on linear algebra--especially
eigen vectors/eigen values/matrix decomposition. This question is put there to make you realize the importance of reviewing this material if you don't
already know it (as we will need it starting about Thursday of this week for the rest of the semester).

You can review this material by reading any of the linear algebra refresher readings at the end of the readings list (it is titled "background and refershers").

If you think you will need more direct help, send a note to the TA ( ). If enough of you need help, he can hold a review/recitation session
on Wednesday during his office hours.


Saturday, February 6, 2010

Does Anybody know more about principal eigen values?

Hello Professor and My Compeers

I am confused about principal eigen value. So here I am posting two queries regarding that.
What is the principal eigen value?
Q1. If C1=-2 and C2: 2
Q2. If C1=-8 and C2:3

So anybody who knows about the solutions can enlighten this post.

Tuesday, February 2, 2010

Homework 1 will be due on Feb 9th in class


Homework 1 will be due on 9th Feb (which is next Tuesday).
 Everything for homework 1 has now been covered (with the exception of the "relevance feedback question" in part 4; which you won't need to do ).