Monday, April 19, 2010

[Thinking Cap] on collaborative filtering...

Qn 1. We considered using the user-item matrix to find most similar users to the current user, and use them to predict the rating of new items for the current user. What if we decided instead to focus on items, and figure out items that seem to be most "similar" to each other in terms of the users who buy it.
   1.a. What techniques, that we have already seen, can be used to find the items most similar to each other?
    1.b. If you use those techniques to find, for each item, k most similar items, then how can we use this information to do ratings prediction for the current user on an item she hasn't yet rated?
      1.c. how does this item-centered analysis compare to user-centered one in terms of justifying the recommendations?

Qn 2. the User-item matrix has some similarities to the document-term matrix. In the case of d-t matrix, we saw that latent semantic indexing can be quite useful in finding "latent regularities" (e..g the "size" factor of the fishes, or the "database-oriented" vs. "statistics oriented" factors of the db example).
  2.a. Do you think LSI style analysis could be useful in improving collaborative filtering?
  2.b. What are some of the most immediate problems you run into when you try to do LSI-style analysis on user-item matrix (as against d-t matrix)?


Optional question: What does the following parable have to do with a CSE494/598 student who keeps asking the instructor "is there any way I can get extra credit to improve my grade?" ;-)

A man stood on top of his roof in the middle of one of the worst floods ever recorded. The water was rising around his feet so he looked up at the black sky and prayed. "Mighty God please save me!" An hour passed and the water was already over his feet when a man in a fishing boat came and called for him to get in. The man refused saying. "No, I am waiting for God to save me." The man in the boat waited as long as he could then left the man on his roof. Standing soaked on the roof as the waters rose he stared at the sky waiting for God to save him. Then a rescue boat pulled up beside his house and threw him a rope but again the man refused. "No, I am waiting for God to save me" Hours passed and the man had to cling to his chimny to avoid being washed away. He thought he was going to drown when he saw a bright light overhead he though God had come to save him this time. Then he heard a loudspeeker and realized it was a hellicopter. It dropped a ladder but again the man refused. "No! God will save me!" the man clung to the chimney until the water was over his head and he drowned. A moment later he stood before God weeping. "Mighty God, I prayed and prayed why didnt you save me?"
God said "YOW. I sent you a fishing boat, a rescue boat *and* a helecopter to save you. What the heck are you doing here?"


  1. 1.a - We can do co-relation analysis on the user-items matrix and find the set of items that are the most related to each other and use the same for predicting new items to users. The co-relation clusters so obtained will let us know what other items have been liked by people who have liked the items u have liked.

    1.b - Ratings for a new item a, can be predicted as follows:

    The Summation here is to be performed over the top k co-related Items to Item a.
    Where, Avg=current user's Average Rating of all the Items, CVal(i) is the co-relation value of the Item i with a.

    1.c - As explained in 1.a, the co-relation clusters so obtained will let us know what other items have been liked by people who have liked the items u have liked. Therefore, Justification can be done by saying Items A and B were frequently bought together by others, if by co-relation analysis it was found that there is a high co-relation between Items A and B,

  2. 1.a- One way we could find items that are most similar to each other based on users is to do association rules. After finding the rules take the new set and see if it is the antecedent in any of the rules. If it is you should find in the consequents similar results based on the user. The best result would be the one from the rule with the highest confidence. The most similar users would be the ones with the most rules in common.

    1.b- Get the users with the most similar rules and then use the the average rating in the antecedents to predict the consequents rating.

    1.c- This wouldn't be as fine tuned. Instead of getting combination ratings you'd just get the same rating as the antecedents. For example one rule could be if A B and C are rated 3 stars, D will be 3 stars. You'd never get 3.5, 2.1 or any amount that would take multiple ratings into account.

    2. Yes I think it could be useful but the problem is in most cases people haven't rated everything so the matrix would be sparse and you'd have to figure out what to do with the null values.

    As far as the parable goes, you've already given a lot of ideas for fleshing out the project. I'm assuming that just getting to work on making the project amazing (and implementing all we can) would cover all the extra credit anyone would need.

  3. 2-a. LSI on user-item matrix might be effective in comparing items/users on some "ineffable" qualities, which otherwise cannot be clearly articulated.

    2-b. The sparsity of the user-item matrix might make SVD computation a challenge, which was never a problem for the d-t matrix.

  4. 1.a,
    I think we can use the Pearson coefficient itself to find the items most similar to each other, a little variation on the covariance would be,
    Cix, iy = Sum(ui=1ton)[(Rui,ix – Rui) * (Rui,iy – Rui)]

    Essentially you capture the correlation between item x and item y, by computing how closely items x & y are rated by each user, ui=1 to n.
    Using this we can compute the top-k similar items for each item.

    The rating prediction for an item ‘x’ and active user ‘a’ can be done as follows:
    Sum(i=1tok)[Wx,i(Ra,i – Ra) ]
    here, Wx,i is the correlation weightage of (item x,item i).

    Essentially the prediction is made by calculating the sum of the ratings given by the active user to each the most similar items, taking into account the correlation weightage of each similar item. So if the active user has rated the most similar items low in general, then it is predicted that the user would rate this item also low.

    I feel, item-cenetered recommendation is more justifiable for recommending objects to be bought scenario, for example: recommend the users to buy mobile accessories when he is interested in buying a mobile, as many other users who has bought a mobile also bought the accessories.

    Whereas, the user-centered analysis is more suited for rating prediction as ratings depend on a user’s likes and dislikes and may depend on other user’s who have similar tastes.

  5. 1a. An answer not yet posted yet is that we can leverage the information in user queries to find similar items. Google does this by taking similar iterative queries and realizing that the user is substituting words for their synonyms to narrow results. An example would be if the user searched for "cute dog pictures" and then right afterwards searched for "cute puppy pictures". If this happens enough over the user base, then puppy can be associated with dog. We can apply this idea to the database of items. If a user is searching for an item, then successive queries that they do can be used to build a similarity matrix.

  6. 1.a Since in terms of representation, the item-user matrix is quite similar to user-item one (both sparse matrix with missing values). The techniques we used in finding similar users can be directly applied here. For example, we can use Pearson's similarities.

    1.b Yes. We can find k most similar items which the particular user has already rated, and predict the rating based on them.

    1.c When justifying the recommendations, we can just say: "you will like this because you liked A, B and C...". This is much more convincing than say "because some of your anonymous soul mates like it as well".

    2.a Yes. If we can use SVD to find some "eigen user/item", it will for sure help us reduce the huge dimension and improve the analysis.

    2.b The firs issue will be how to handle the matrix with missing values when perform SVD. One common technique will be do matrix completion first. The matrix completion can then be formulated as an optimization problem, where we try to find a matrix that matches the original one in the existing entries, and at the same time with low rank.

  7. 1.a We can use something like item-item similarity, that we used for term-term similarity to recommend the query expansions. We can make use of scalar clusters and association clusters to take care of transitive related products.

    1.b A metric which normalizes the global rating predicted from item-item co-relation and the active users can be used to find the rating of the current user.

    1.c. The item-centered analysis would not consider user's preferences as does user-centered analysis. Item-centered analysis would suggest products which are globally preferred

  8. This comment has been removed by the author.

  9. 2.a. Yes. It knows what the important dimensions why the user likes the product.

    2.b. If you use LSI, it is very difficult to tell the user why they recommended the product, even if they give the principal axis which is no understanding to the users. (the users were not cared about why the document is recommended in d-t case)


Note: Only a member of this blog may post a comment.