Friday, April 30, 2010

on the extra class and the tuesday's class

Thanks to all the students who showed up for the information extraction lecture today.  It would certainly not have been any fun lecturing to the walls ;-)

To those who couldn't make it, the video as well as audio are available on the class page. I hope you will make time to watch or hear it. As I said, IE is one of the more active topics of research now, and I would hate for you not to know any entry into it--especially given that all you have done to-date provides you a pretty good infrastructure.

For Tuesday's last class, I plan to spend the first 45min or so discussing some issues in "information integration" and then then doing  a quick wrapup.

The slides for the information integration are on line, but  are likely to change by the class time.


Thursday, April 29, 2010

Clarification regarding tomorrow's "extra class"

There was a question today as to whether the material to be covered in tomorrow's extra class will be "on the test."

My flippant response was possibly driven by the fact that  I am always freshly surprised to find out that there actually 
might be students taking one of my courses with an eye towards  their GPA (clearly, ASU student corporate memory/social networking
isn't all that it is cracked up to be :-).

Thinking about it again, I can imagine some students being unduly inconvenienced by having to keep up with an extra meeting that they haven't bargained for, in the last week of classes. 

So, I will certify that the material covered in tomorrow's lecture will not be "on the test". 


Apology to the student who asked the "transitive relation" question

To the student who asked the question about "transitive relation" in semantic web:

 My sincere apologies for my poor choice of words in my first response. As I clarified, all I meant to tell you was to hold on a minute as the answer to your question is going to be the center of the discussion that was to follow immediately. No disrespect was meant either for the question (which, as I repeatedly came back to as an important one) or the questioner.

Rao "Thank God I am not Gordon Brown running for re-election" Kambhampati

Re: Tomorrow's class will be in BYAC 240

Oh--just to clarify, the class will start at 10:30 and will end at 11:45 (like normal classes). The room is reserved for longer periord just to allow you to get in ahead of time.


On Thu, Apr 29, 2010 at 12:02 PM, Subbarao Kambhampati <> wrote:

 The "extra class" tomorrow will be in BYAC 240 (i.e., the same class room building but in the second floor).

See you there (hopefully). Just in case it might help clinch the deal, let me go ahead and announce that there will likely be donuts ;-)


Breakout session - Friday, April 30th, 2010 10:15 AM to 11:45 AM in BYAC, 240

Room reserved for : 10:05 AM to 11:55 AM


Tomorrow's class will be in BYAC 240


 The "extra class" tomorrow will be in BYAC 240 (i.e., the same class room building but in the second floor).

See you there (hopefully). Just in case it might help clinch the deal, let me go ahead and announce that there will likely be donuts ;-)


Breakout session - Friday, April 30th, 2010 10:15 AM to 11:45 AM in BYAC, 240

Room reserved for : 10:05 AM to 11:55 AM


Wednesday, April 28, 2010

a clarification regarding blog comments (especially to thinking caps of yore)

 There seems to be a sudden new found interest in responding to the "thinking cap" questions from many moons ago. I can't be sure of its origins, but one explanation may be  that  people are thinking they need to beef up their participation credit ;-)

 I would like to clarify  that I only follow the thinking cap responses in the short window surrounding their posting. For now, the two threads I will be following are in response to the two messages I posted yesterday (the "what did you like" question and the "what would you recommend from WWW 2010" question).



Current cumulatives


 Here are the cumulatives that take project 2 and homework 4 marks into account.

Still to come:

 --> Project 3 and Demo points
 --> participation points
 --> Final


Tuesday, April 27, 2010

Assignment: Recommend a WWW 2010 paper to your classmates...[Due by 5/11--post to the blog]

So NPR has this nifty segment called "You must read this" ( ) which gets writers and authors to recommend books that they think others should read.

Your last "homework" assignment for this course is patterned after it.

Here is the assignment:

1. Look at the papers being presented at the World Wide Web Conference this week
(the program --along with most of the pdf files--is available at ; if you are interested in a particular paper but the pdf is not available, you can probably google the authors' pages--technical paper authors tend to be a narcissistic bunch and will put every paper up on their web page as soon as it is accepted ;-) 

2. Check out the "abstracts" of the papers whose titles seem interesting to you based broadly on the aims of this course.

3. If you like the abstract, try reading the introduction (optional, but recommended).

4. By 5/11, post a short comment in response to this article  giving

    4.1 paper title and link to its pdf
    4.2. why you would like to read it and/or why you think others in the class should read it
     4.3. how the paper is connected to what we have done in the course (you could also phrase this as a recommendation
            "if you liked that power iteration discussion, you will probably like this paper as it gives ways to speed the       

    (your inputs to 4.2 and 4.3 can be interleaved).


Here is the rationale for the assignment--unlike Physics 101, after which you don't expect to be able to read the state of the art papers,  this course is about an area that is very much recent and in progress (recall the farside neanderthal archeologist cartoon..). So,  you actually
do have a shot of understanding the directions of most working being done at the state-of-the-art (and in some cases even understand their contribution).

Rather than ask you to take this assertion at face value, I would like to encourage you to "do it" and thus "see it to believe it" as it were ;-) [Plus, this is a rather cheap way for me to figure out which WWW papers to read.]


Mandatory Blog qn (You should post your answer as a comment to this thread )

Here is the "interactive review" question that I promised.

"List five nontrivial ideas you came to appreciate during the course of this semester"
(You cannot list generic statements like "I thought pagerank was cool".  Any cross-topic connections
you saw are particularly welcome. )

Note that you must post your response as a comment to this particular message.
All comments must be done by the end of 4th May (next Tuesday--last class).

If you have forgotten what we have done, see the class home page for a description.


Project demo time slots

Hello everyone,
  The project demos will be held in the week of May 4th, Tuesday through Friday. Slots are available on a first come, first serve basis.
How to reserve your slot:
2. Find a slot that hasn't already been taken (Count == 0)
3. Click the checkbox for that slot, write your name and click 'Save'
On the day of the demo:
1. Bring your laptop with your code and program on it, or a link to the webserver where your program is running, or if nothing else works, your code and program on a flash drive.
2. Be prepared to answer questions regarding the implementation (did you use sparse arrays? did you cache the document norms in a file?)
3. Be prepared to run arbitrary queries on your system using any of the methods you have implemented.
If you absolutely cannot make it on any of the time slots, email me so we can work this out.

Thanks and Regards,
Sushovan De

A poll on a proposed extra class on Friday


 I seem to have gone slower than expected and at the current rate, I am not going to be able to do justice to information integration and information extraction topics in the remaining two classes. So, I am contemplating holding an extra class meeting on Friday. Please vote on the poll below so I can get a sense of your availability. I will use that to decide whether and what time to hold the class. [Please vote wednesday night so I can make arrangements and announce by Thursday's class.]


Monday, April 19, 2010

[Thinking Cap] on collaborative filtering...

Qn 1. We considered using the user-item matrix to find most similar users to the current user, and use them to predict the rating of new items for the current user. What if we decided instead to focus on items, and figure out items that seem to be most "similar" to each other in terms of the users who buy it.
   1.a. What techniques, that we have already seen, can be used to find the items most similar to each other?
    1.b. If you use those techniques to find, for each item, k most similar items, then how can we use this information to do ratings prediction for the current user on an item she hasn't yet rated?
      1.c. how does this item-centered analysis compare to user-centered one in terms of justifying the recommendations?

Qn 2. the User-item matrix has some similarities to the document-term matrix. In the case of d-t matrix, we saw that latent semantic indexing can be quite useful in finding "latent regularities" (e..g the "size" factor of the fishes, or the "database-oriented" vs. "statistics oriented" factors of the db example).
  2.a. Do you think LSI style analysis could be useful in improving collaborative filtering?
  2.b. What are some of the most immediate problems you run into when you try to do LSI-style analysis on user-item matrix (as against d-t matrix)?


Optional question: What does the following parable have to do with a CSE494/598 student who keeps asking the instructor "is there any way I can get extra credit to improve my grade?" ;-)

A man stood on top of his roof in the middle of one of the worst floods ever recorded. The water was rising around his feet so he looked up at the black sky and prayed. "Mighty God please save me!" An hour passed and the water was already over his feet when a man in a fishing boat came and called for him to get in. The man refused saying. "No, I am waiting for God to save me." The man in the boat waited as long as he could then left the man on his roof. Standing soaked on the roof as the waters rose he stared at the sky waiting for God to save him. Then a rescue boat pulled up beside his house and threw him a rope but again the man refused. "No, I am waiting for God to save me" Hours passed and the man had to cling to his chimny to avoid being washed away. He thought he was going to drown when he saw a bright light overhead he though God had come to save him this time. Then he heard a loudspeeker and realized it was a hellicopter. It dropped a ladder but again the man refused. "No! God will save me!" the man clung to the chimney until the water was over his head and he drowned. A moment later he stood before God weeping. "Mighty God, I prayed and prayed why didnt you save me?"
God said "YOW. I sent you a fishing boat, a rescue boat *and* a helecopter to save you. What the heck are you doing here?"

Sunday, April 18, 2010

a more useful paper on netflix prize (and how LSI-like techniques helped the winning team)

I added the following paper to the collaborative filtering readings:

This is a paper from IEEE Computer last year, written by Yehuda Koren--the same guy who wrote the CACM paper. Unlike the CACM paper that focuses just on temporal concept drift--that helped them vault over the last little percentage--this one talks about LSI-like techniques for doing collborative filtering, and how they allow him to combine multiple sources of feedback.

Certainly worth reading as most of the ideas he is using are in the epsilon-neighborhood of things we have covered in the class (and thus should make you feel warm and fuzzy about your knowledge state ;-)


Saturday, April 17, 2010

Correlation coefficient viewed as cosine theta measure..

Although I didn't mention it explicitly in the class, perason correlation coefficient can be seen as the
vector similarity between "centered rating vectors"

Suppose the two rating vectors are

[r11 r12 r13 r14]


[r21 r22 r23 r24]

Centering means subtracting the mean of the vector from the vector

let r1 be the mean of r11..r14  and r2 be the mean of r21 ..r24

then centered vectors are

[r11-r1  r12-r1 r13-r1 r14-r1]

[r21-r2  r22-r2 r23-r2  r24-r2]

now if you take the cosine theta metric between these two vectors, you get 
dot product divided by the norm of both vectors.

dot product will be of the form  [r11-r1]*[r21-r2]+ ...+[r14-r1]*[r24-r2]

this is the numerator of pearson correlation coefficient.

the norm of the first vector is 
sqrt [( r11-r1)^2+..(r14-r1]^2]

which can also be viewed as the squared variance of the first vector..



Friday, April 16, 2010

Re: An article on collaborative filtering sensitive to changing user preferences..

oh--by the way--I forgot to mention that this approach also won the netflix prize..


On Fri, Apr 16, 2010 at 10:59 AM, Subbarao Kambhampati <> wrote:
Here is an optional reading on collaborative filtering under temporally evolving user preferences, that came out in this month's CACM.

It may be useful for those of you curious about the current status on the collaborative filtering research:


An article on collaborative filtering sensitive to changing user preferences..

Here is an optional reading on collaborative filtering under temporally evolving user preferences, that came out in this month's CACM.

It may be useful for those of you curious about the current status on the collaborative filtering research:


Thursday, April 15, 2010

Required reading for next class: Database refresher


 The next five classes will be about getting structure and exploiting it during search. Much of this material straddles IR one one end--where we assume no structure; and databases on the other--where we assume data is fully structured. It will thus be very helpful to have some rudimentary background in relational databases.

If you haven't taken a course in databases, you might want to look at the following set of "refresher" slides for databases--so you have enough familiarity with the concepts.


Tuesday, April 13, 2010

Clarification on the "rand-index" based external cluster evaluation


 In talking to a student, I realized I didn't make something clear in describing the rand-index based external cluster evaluation method.

Rand-Index classifies every pair of entities e1,e2 into four categories

[in-the-same-ground-truth-cluster, in-the-same-generated-cluster]  A
[in-the-same-ground-truth-cluster, in-different-generated-clusters]  C
[in-different-ground-truth-clusters, in-the-same-generated-cluster]  B
[in-different-ground-truth-clusters, in-different-generated-clusters]  D

If A,B,C,D are the number in each class, then A+B+C+D will be n*(n-1)/2 (which is the number of pairs over n entities).

I modified the slide to make this point clear..


The napolean dynamite problem

check this New York Times  article for some background on collaborative filtering and netflix prize..

Monday, April 12, 2010

Current snapshot of the grade book


 Attached please find the current snapshot of the 494/598 grade book (posted by your *posting-id*--this is not your student id). It contains the grades for three home works, project 1 and mid-term. I will return homework 3 and midterm in class tomorrow. 

To give you an idea of your relative standing, I also put in total and percentage columns. The total is calculated out of 45--
with 10pts for project 1, 5 points each for the homeworks, and 20 points for mid-term. Note that these weights "approximate" and subject to change. Note that the extra credit is shown but is *not* included in the total.

The percentage column basically converts the raw score out of 45 into a percentage.

The first part--in light green--is the 494 section and the second part in the light red is the 598 section.

Please let me or the TA know if there are any discrepancies you find.

If you have any concerns about your performance please talk to me.


Thursday, April 8, 2010

mid-term grading and course withdrawal deadline


 I was hoping to be done with the mid-term grading by today so you would have that information by 4/9-11 which apparently is the course withdrawal period. I unfortunately didn't manage to complete it.

 I think most students probably made the "drop/stay" decision already, but  if you are one of those who is waiting to make a decision based on your midterm score, please do let me know and I will try to get you taht information before 4/11.


Homework 4 Released--will be due April 20th in class

I released homework 4 on clustering, classification and recommendation systems. These will be due on 20th April.
It contains a question on collaborative filtering--that will be covered next Tuesday.

(I know I said the last homework will be due the end of the semester--but I decided to split it into two parts, and make the first part be due 4/20 so you can get it back by the end of the semester).


Wednesday, April 7, 2010

Project part 3 released

It is now available from the projects page.

Please note that the deadline is being put on the last day of classes. This will be a non-negotiable deadline. Please plan to make sure you are done by then.

You will also be required to show a demo of all three parts sometime the week of May 4th. The TA will give out slot sign-up sheets.  Note that the demo marks are not just part of project part 3, but rather will be added to the cumulative--so those who didn't complete project part 1 or part 2 in time will get the demo-credit for those parts as long they are running by the demo time.


Tuesday, April 6, 2010

[Thinking Cap--Easter Resurrection] on Classification/clustering

Comment on this on the blog

At the end of today's class, we saw that classification is in some sense a pretty easy extension of clustering--training data with different labels can be seen to be making up the different clusters. When test data comes, we just need to figure out which cluster it is closest to and assign it the label of that cluster. 

1. If classification is so darned straightforward, how come the whole entire field of machine learning is obsessed with more and more approaches for classification? What can be possibly wrong with the straightforward one we outlined? Can you list any problems our simple approach can run into? (Alternately, it is fine to just decide that Jieping Ye and Huan Liu cannot leave good enough alone... :-)

2. If you listed some problems in 1 (as against casting aspersions on Ye and Liu), then can you comment on the ramifications of those problems on clustering itself? Or is it that clustering is still pretty fine as it is?


Sampling and Census... (according to Mr. Willis of Ohio )

Check this out for some context on the idea of using sampling in census (you can say I was almost channeling this in today's class; since as I told you my spring break involved watching west wing episodes that I bought for my son ;-)

Monday, April 5, 2010

An idf solution to the project 2 deadline

Some people have asked for an extension for the project. (Based on the mails I am getting, it is also possible that there is a bit of "behind the scenes organization" going on--since the frequency of mails has increased in the last couple of hours).

In extending deadlines at the last minute, my concern always is for those students who probably re-organized their lives to keep up with the announced deadline.  To make sure that these people do not feel penalized, I will do the following:  You can submit the project on Thursday with a flat 20% late penalty (this may look steep compared to zero penalty, but you should compare it to the "infinite" penalty implicit in the concept of "deadline"  and feel blessed ;-).

(If it is indeed the case that the whole class needs the two extra days, then all of you get 20% penalty--and the idf of the penalty will be zero.. )

Please note that projects are due at the *beginning* of the class; resist the temptation to skip class and give the project at the end of the day..


ps: As I mentioned, you will still get partial credit for the earlier parts if you get all parts working by the final deadline.

Thursday, April 1, 2010

Fwd: On the Topeka Pagerank questions..

Some of you were confused as to how Topeka pagerank compares to the traditional one.
The important thing to note is that  Topeka Pagerank first separates web pages into smaller partitions based on their
color and then does equal amount of processing on each of them. It is thus a significantly more efficient computation--since for
each partition, we have a much smaller transition matrix M. The convergence is fast, but the final stationary page rank can be significantly different from the traditional one
(in particular, the separate but equal processing can sometimes prevent pages in some partitions from reaching their full importance--as the other partitions effectively act as rank sinks ).