Tuesday, April 27, 2010

Assignment: Recommend a WWW 2010 paper to your classmates...[Due by 5/11--post to the blog]

So NPR has this nifty segment called "You must read this" (http://www.npr.org/templates/story/story.php?storyId=5432412 ) which gets writers and authors to recommend books that they think others should read.

Your last "homework" assignment for this course is patterned after it.

Here is the assignment:

1. Look at the papers being presented at the World Wide Web Conference this week
(the program --along with most of the pdf files--is available at  http://www2010.org/www/program/papers/ ; if you are interested in a particular paper but the pdf is not available, you can probably google the authors' pages--technical paper authors tend to be a narcissistic bunch and will put every paper up on their web page as soon as it is accepted ;-) 

2. Check out the "abstracts" of the papers whose titles seem interesting to you based broadly on the aims of this course.

3. If you like the abstract, try reading the introduction (optional, but recommended).

4. By 5/11, post a short comment in response to this article  giving

    4.1 paper title and link to its pdf
    4.2. why you would like to read it and/or why you think others in the class should read it
     4.3. how the paper is connected to what we have done in the course (you could also phrase this as a recommendation
            "if you liked that power iteration discussion, you will probably like this paper as it gives ways to speed the       

    (your inputs to 4.2 and 4.3 can be interleaved).


Here is the rationale for the assignment--unlike Physics 101, after which you don't expect to be able to read the state of the art papers,  this course is about an area that is very much recent and in progress (recall the farside neanderthal archeologist cartoon..). So,  you actually
do have a shot of understanding the directions of most working being done at the state-of-the-art (and in some cases even understand their contribution).

Rather than ask you to take this assertion at face value, I would like to encourage you to "do it" and thus "see it to believe it" as it were ;-) [Plus, this is a rather cheap way for me to figure out which WWW papers to read.]



  1. 4.1 Cross-Domain Sentiment Classification via Spectral Feature Alignment [http://www.cse.ust.hk/~sinnopan/publications/%5BWWW10%5DCross-Domain%20Sentiment%20Classification%20via%20Spectral%20Feature%20Alignment.pdf]
    4.2. and 4.3 I would like to read it because it talks about classifying the polarity of sentiments in one domain using the annotation provided in a related-domain. This is interesting for me because I use biomedical and clinical text. The source for biomedical text is the PubMed. There has been an extensive research on sentiment classification on PubMed because it is available as open source and the topics are already manually annotated using Mesh terms -- so, it was easy for researchers to just find the sentiments, the second step. Finding sentiment polarity on clinical data is tough because the data is limited and is not annotated. It is interesting idea that sentiment polarity on PubMed can used to find the sentiment polarity of clinical notes. This paper related to text clustering and LSI lecture(s) of the course since SFA(the method they propose) is essentially a novel way of classifying text in LSI-like vector space.

  2. A Characterization of Online Search Behavior
    Authors: Ravi Kumar and Andrew Tomkins, Yahoo Labs

    4.2 and 4.3: Since most of the discussion in the course has been centered towards search and the incorporation of users into search results.
    I think this paper is worth reading as it touches upon an important aspect of online search, namely the users. The paper studies and describes the characteristics of users in terms of their search behavior, like the fraction of multimedia search and general web search. The authors use the search data collected using the Yahoo toolbar to conclude that search queries cover 1 in 5 pageviews online, which is a significant proportion of web traffic. The data is collected with the permission of the users using the Yahoo! toolbar for a period of one week between March 18 to March 24 2009. This is paper is more related to the first few discussions of the course and helps us understand the significance of the search domain and the potential of research in the area. Some of the other findings include 1) Longer sessions correspond to general web and not particularly. 2) Searches for direct references to SO(structured objects, that is those objects for which there is sufficient metadata to identify it eg, Florida, Television) formed 52.9% of the searches.

  3. The Anatomy of a LargeScale Social Search Engine
    Author:Damon Horowitz,Sepandar D. Kamvar

    4.2) Unlike regular search engine Keyword Query this is making us free to ask question in our comfortable format and actually we can ask the question in complete sentence. The result we are getting is from experienced person.
    4.3) The key challenge in this type of search engine is to find the trust worthy people to get the answers. From IR Social Network point of view: Six degree of separation we can ask questions to any one in the world and it is not restricted to the friend circle. The finding of true links or weak links is the big tedious tasks. But IR people can help for finding the TRUST based ranking for the particular user and then suggest the answers from those trustworthy people for the quest of that user.

  4. Diversifying Web Search Results

    4.2) Often user’s query may not give any clear idea on what is their underlying information need. This paper provides a sensible strategy of diversifying the retrieved search result’s ranking in the hope that the users will find at least one of the documents to be relevant to their information need on the first page of the search result. This paper discusses the importance of diversification and how to diversify the search results. Also queries and documents generally belong to more than one category of information. Hence diversifying search results helps in satisfying multiple users.
    4.3)The approach is based on the concept of correlation for providing the optimization model for this issue. It provides the importance of correlation in diversifying the search results. They have also addressed the subjective dependency of correlation (difficulties of finding the correlation between the pages) and various approaches to the same. This paper will help us get more insight on the concept of correlation and the ways to determine them.

  5. Title: How useful are your comments?- Analyzing and Predicting YouTube Comments and Comment Ratings
    Conference: 19th International World Wide Web Conference, WWW 2010, Raleigh, USA (to appear)
    Link: http://www.l3s.de/~siersdorfer/sources/2010/wfp0542-siersdorfer.pdf

    Main concepts involved in this paper related to what we studied: Application of relevance feedback (by studying previous comments on a video). Context based text classification (by analyzing the sentiments of a comment). Training the classification model and prediction based on the classifier.

    This paper is about prediction of comment rating
    (community acceptance or community feedback) for a particular unrated comment on a Youtube video. It suggests a classifier which learns by studying the influence and sentiments expressed in a comment (using the SentiWordNet thesaurus, a lexical WordNet-based resource containing
    sentiment annotations.)and predicts the
    community acceptance of a new comment (unrated) using this classifier.
    In addition to the rating of the un-rated comments, the analysis of comments and associated ratings constitutes a potentially interesting data source to mine for obtaining implicit knowledge about users, videos, categories and community interests.

    I find this Paper interesting and recommend for a read because it involves the direct application of important concepts (as mentioned above) that we studied in this class.

  6. Title: Predicting Positive and Negative Links in Online Social Networks

    pdf link: http://www-cs.stanford.edu/people/jure/pubs/signs-www10.pdf

    4.2 & 4.3:
    This paper talks about identifying the positive and negative links in Social Networks which we have talked about a lot in the class. I would like to read it because it is very interesting to know how we can interpret the possible relationship between two persons with more accuracy. It may be extremely useful for improving the quality of currently existing social network sites such as facebook, orkut in terms of suggesting new friends to a newly entered person in their system.

  7. 4.1 - Optimal Rare Query Suggestion With Implicit User Feedback. [http://research.microsoft.com/pubs/118641/wfp0763-song.pdf]

    4.2 and 4.3 - This paper talks about Query Suggestion Techniques for "Rare Queries". Since these type of Queries possess very less information than the popular queries in the query logs, it is very difficult to suggest other relevant queries by only analyzing the logs. This paper suggests a new technique, that leverages the implicit feedbacks(for eg-clicks) from users in the query logs.

    This approach is similar to Pseudo-Relevance feedback techniques that we discussed in the course, the difference being that in this method the clicked URLs and the skipped URLs (By the Users) are not treated in the same way, since they contain different levels of information. So, the query correlation is made by combining both the click and the skip informations and using a random walk model for optimizing the same.

    This paper would be a good reading for people who liked the idea of Pseudo Relevance Feedback and the Rocchio Algorithm for Query Elaboration.

  8. What are the Most Eye-Catching and Ear-Catching Features in the Video?

    Implications for Video Summarization


    The would recommend this paper because as the paper talks about the most eye/ear catching features in the video,it is essentially in a way talking about finding the 'most relevant' data from the video clips and then summarizing it (called as video retrieval).It provides an insight into various video classification techniques that are being used now a days.

    It touches upon most of the concepts we covered in the class about the text retrieval
    that are also being used for video retrieval.The use of indexing,K means clustering for
    identifying repeated slots is pretty interesting.

    It also talks about summarizing news videos by using 'anchor audio' and classifying shots into special and normal events.The use of (1) tf-idf measure and audio analysis based on audio amplitude, and
    (2) audio analysis combined with image analysis based on face/text detection and camera motion is very fascinating.

    I would recommend this paper to my fellow students who want more information regarding the similarity analysis techniques for video data
    since we were not able to talk a whole lot about it in the class.

  9. Diversifying Web Search Results

    4.2 One of the interesting ideas I learned from the class is that the results from the search engine should not only be based on how similar it is to the query, but should also be diversified. Users come from different backgrounds, therefore they might be looking for quite different topics. This paper turns the need for diversify the results into an optimization problem, taking both precision and diversity into account. The authors claims that they can acquire more diverse results than Google, with similar precision. Experimental results are often the most tricky part in a paper, but given the fact that two of the authors in this paper actually comes from Google, it is very interesting to see that how they outperform their own company.

    4.3 In the search engine part of this course, one of the essence is how to model the "relevance" using computable measurements such as "similarity", "importance" or the combination of them. The issue addressed in this paper add a new measurement into consideration: the diversity of the results. It is also interesting to investigate the connection of this idea to "user preference" or clustering.

  10. 4.1
    Cross-Domain Sentiment Classification via Spectral Feature Alignment

    I thought the idea of sentiment classification is very interesting. This is one step closer to making the web more based on the user U. There are so many uses for this, social networks, reviews and blogs. How about search results that are positive for a certain political topic.

    This reminded me a lot to the Bayes Classifier we discussed in class and did in our homework. It would be interesting to see if they had a bayes foundation and what advances they have made in text classification.

  11. 4.1
    Scalable Techniques for Document Identifier Assignment in Inverted Indexes.


    In our class we learned how to improve the efficiency of naive retrieval in tf(idf) search by using an Inverted Index to quickly identify which documents contain what word.

    In this paper, they discuss a method to further improve the performance and scalability of inverted indexes. I thought this paper would be interesting because it builds on something that we have already learned and used. It is interesting to learn about new methods people have come up with to further increase the performance of current technologies.

  12. 4.1
    Modeling Relationship Strength
    in Online Social Networks

    This paper addresses the issue of relationships typically having a greater granularity than just connected or not connected. This is interesting because its another way of looking at the weighting issue that has come up over the course of the semester in many different forms. In this case, the authors look at the interactions between the users in the social networks in determining the strengths of their connections. This sort of strength would be very important in the social problems/games discussed in class(ie: finding a path to bill gates) as having a link full of best friends to your recipient will result in a higher probability of success than a chain of acquaintances. These weights could also come into play when looking at computer networks, as stronger links will be faster, either due to physical proximity/connections or regular communications that keep the port ready for use when needed(as opposed to closing after being idle).

  13. This comment has been removed by the author.

  14. 4.1 Modeling Relationship Strength in Online Social Network [http://www.cs.purdue.edu/homes/neville/papers/xiang-neville-www2010.pdf]

    4.2 The point they make about the low cost of link creation is dead on. "Friending" somebody on a social network has almost no meaning. Casual acquaintences are mixed uniformly with best friends and relatives. As such, without giving these links any sort of varying weight, they are essentially meaningless.

    4.3 This ties into both social networking and recommendation engines. The former is extremely literal: in order to build better social network models, analyzing link weight through measuring user interaction is going to be crucial. But the latter case is more extrapolated. In systems that take social networks into account in their recommendations, giving a higher weight to things both similar and dissimilar to "stronger" friends should produce better results. Further, shared preferences could be used inversely, to indicate stronger links, or even to recommend new friends.

  15. 4.1 "Time is of the Essence: Improving Recency Ranking Using Twitter Data"

    Source: http://delivery.acm.org/10.1145/1780000/1772725/p331-dong.pdf?key1=1772725&key2=3121853721&coll=portal&dl=GUIDE&CFID=15151515&CFTOKEN=6184618

    4.2 & 4.3. I felt this paper was more useful because it improves the importance of page from the twitter data. Whenever a query is given, it sees the webpage/company and calculate its trust-worthiness based on the number of followers it has. Previously it was based on link analysis. As twitter is getting more popular and people use use more frequently than blog or a webpage any new event is update in twitter which can be considered to give higher ranks to recently created pages. This is was done by tweaking the reset matrix before.

  16. 4.1 "Tracking the Random Surfer: Empirically measured teleportation parameters in PageRank"

    source: http://www.stanford.edu/~dgleich/publications/2010/gleich-2010-teleportation-www2010.pdf

    4.2 & 4.3: This paper covers the calculation and importance of the random surfer element of PageRank. This value is considered one of the "magic numbers" and its interesting to note how an accurate value is obtained. It is also interesting to note that the normally distributed raw data of surfer data establishes how the average user surfs the web. This paper breaks down and focuses on the importance of a single value in PageRank calculations. The course discussed the calculation and function of PageRank as a whole, but the random surfer element was only briefly mentioned. Read this for for a deeper understanding of the PageRank formula and also a greater appreciation for the effort that goes into calculating it.

  17. 4.1 Context-aware Citation Recommendation

    4.2, 4.3 There are two aspects to this paper. Given a global context (title and abstract) of a research paper, return a list of citation recommendations that talk about the various ideas covered in the paper. Given a place holder in a paper, recommend papers which are relevant and authoritative to the topic being discussed in the context. What makes the paper interesting is the approach that it uses for creating the candidate set for this task and a different vector-based similarity measure. For creating a local context for a place holder, the paper considers 50 words before and after the place holder. In-link context of a paper p, is the local context of p' which cites p. For candidate set generation it uses a query specific approach similar to A/H computation. It creates a root set of top K papers whose in-link context is similar to the given place holder and then generates a candidate set using citation links. For computing similarity between a local context and paper p, it uses Gleason's theorem specialized to finite dimensional vector spaces.

  18. 4.1 The Anatomy of an Ad: Structured Indexing and Retrieval for Sponsored Search. [http://ciir.cs.umass.edu/~bemike/pubs/2010-2.pdf]

    4.2 , 4.3 This paper is very related to what we have been studying this semester. The paper covers how many of the IR topics we have learned can be applied for retrieving sponsored ads. The approach being discussed includes exploiting data structure (ads are arranged in a hierarchical way) in indexing the information. I would like to read this paper, and likely will when the final is over, because it shows another application of the principles that we have learned in this class.

  19. I suggest: Optimal Rare Query Suggestion With Implicit User Feedback. Here: http://research.microsoft.com/pubs/118641/wfp0763-song.pdf

    If you liked relevance feedback like the Rocchio Algorithm and others and are interested in improving relevance feedback in situations where it would be more useful than ever (rare queries), then this is the paper for you. This gives feedback based on the clicked and skipped links. This has been shown to be superior to the random walk method.

  20. The paper I would like to share with each one of you is:

    Competing for Users’ Attention: On the Interplay between Organic and Sponsored Search Results by Cristian Danescu-Niculescu-Mizilz et.al.

    Link: http://research.yahoo.com/files/www10-interplay.pdf

    Have you ever spend time reading half of an article in a magazine just to find that it's an AD with a slightly different page layout and "promotion" in fine font on the page corner? The reason I found this paper worth reading is it deals with a very thin line between designated advertisement and "real" search results. I think the trend to embed commercial content in searches will certainly increase and the question is how to prevent that from spamming our search results.

    The paper used several distance metric introduced in the class. For instance, Jaccard coef for titles, cosine similarity for snippets, and edit distance for domains. If you would recap about the property of these different measurements, it may make sense to compare the target the author applied them to.

  21. This comment has been removed by the author.

  22. 4.1
    What are the Most Eye-Catching and Ear-Catching Features in the Video? Implications for Video Summarization


    4.2 This paper stood out to me, because I think video and image searching technologies is such a fascinating field. The authors of this paper talk about the different methodologies of video summarization. Video summarization essentially tries to capture the most important parts of a video and present it to the user so they can decide whether they want to watch the whole video or not.

    4.3 If you are interested in creating the automatic cluster summaries in project part 3 (as I was), then this paper might be of interest to you. It would also be of interest to you if you took a liking to alternative search mediums or strategies. If video search is something that you would like to know more about, this paper tries to represent the current state of the field.

  23. 4.1.

    title: "Empirical Comparison of Algorithms for Network Community Detection"

    link: http://www-cs.stanford.edu/people/jure/pubs/communities-www10.pdf

    4.2. , 4.3.

    I think that this paper is interesting because presents a combination of two concepts with no apparent relationship between them: clustering and social networks. Here the concept of cluster is no longer based on the distance between objects in the space of features; it is based on the connectivity between nodes (a community/cluster as a set of nodes with better internal connectivity than external connectivity). The connection with what we have done in class is straightforward: for social networks this means find the sub-graphs with strong internal conectivity (group of friends, groups of interest, etc) which can lead us to discover interesting relationships between group of nodes.

  24. Liquid Query: Multi-Domain Exploratory Search on the Web

    I found this paper interesting, as it talks about a new problem of doing exploratory search over multiple vertical domains and then combining the results. This I think is a novel and useful idea, for example: If I want to plan for a concert trip for my vacation, I would be interested to get search results from multiple vertical domains such as "music", "hotels", "food", "travel" etc integrated together. Such results obtained from concentrated search over individual domains also needs integration to present to the user some useful results based on which he can plan the vacation.

    I did not manage to completely read this paper yet, but I am sure this paper deals with whole lot of problems we discussed in the class, such as query processing, information extraction, integration, deep web, exploiting structured data etc. I think search over structured data on web and presenting users with new kind of search results rather than just showing pointer web pages has lot of scope for new research ideas, and if you are pursuing to do something in that direction, the application presented in this paper is an interesting one.

    Link: http://portal.acm.org/citation.cfm?id=1772708&dl=GUIDE&coll=portal&CFID=88016521&CFTOKEN=85145801

  25. Factorizing Personalized Markov Chains
    for Next-Basket Recommendation


    This paper discusses how recommendations can be made by thinking of a user to be transitioning from one item to another, as he/she buys/rates them. A graph can be constructed based on the user's behavior patterns. This is modeled as a Markov chain and the next likely item the user might land on is computed. This takes a different approach compared to the ones that we discussed in class. It might be interesting to connect this with the PageRank discussion.

  26. Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors
    I find this paper interesting because it talks analyzing large amounts of data using machine learning algorithms and builds a different kind of sensor to detect events like earthquakes based on tweets. Twitter is a microblogging site. The good part is it real time and millions of tweets are posted each day. Many applications can be built over the twitter posts, for example news feed, live traffic info. In this case, event detection in real time.
    This paper touches lot of topics covered in the course like social networks, classification, feature extraction and semantic analysis of tweets. It also covers signal processing concepts like Kalman filtering and particle filters. The challenging part is there are tons of tweets and also each tweet can be at max 140 char long. Extracting features from the short tweets is interesting problem to solve. It also talks about diffusion of information obtained from tweet sensor over the social network. Overall, it’s a good paper to read and get ideas for tweet related applications.

  27. 4.1 Stop Thinking, Start Tagging:
    Tag Semantics Emerge from Collaborative Verbosity
    4.2 & 4.3 This paper caught my eye because of its seemingly strange conclusion. The paper discusses collaborative tagging and folksonomies, where users tag the information themselves in order to classify and search. Apparently, even though a wide variety of users will have different categorizations of pictures and blog topics, a stable language will emerge within the community. One would think that the people who remain the most concise and consistent in their tags would contribute the most to the growth of the semantic structure, but it is actually the more descriptive users that do so. While the paper doesn't propose any new algorithms for extracting semantics from folksonomies, it makes for an interesting read if you want to learn about different web 2.0 users and which ones matter. As far as 494 goes the paper relates loosely back to the class in that it discusses classification and information extraction, but without a predefined syntax like XML.


Note: Only a member of this blog may post a comment.