Tuesday, April 27, 2010

Mandatory Blog qn (You should post your answer as a comment to this thread )

Here is the "interactive review" question that I promised.

"List five nontrivial ideas you came to appreciate during the course of this semester"
(You cannot list generic statements like "I thought pagerank was cool".  Any cross-topic connections
you saw are particularly welcome. )

Note that you must post your response as a comment to this particular message.
All comments must be done by the end of 4th May (next Tuesday--last class).

If you have forgotten what we have done, see the class home page for a description.



  1. 1. Computing primary eigenvector efficiently through power iteration - rather than the cubic SVD.

    2. Ability to be able to compute the most active collaborating group using authority-hub: I actually want to apply this idea in a SNA company which studies author-author collaborations

    3. While Pagerank is cool and vector similarity is great, the real difference in web-search is made by the server farms -- this means distributed computing is such a crucial issue. It makes me feel happy for taking a full course on this topic in the past.

    4. The idea of power law - especially the zipf's law because I have distributional semantics(understanding the meaning of text by how the text and content around statistically occurs in a giant unannotated corpus like PubMed)as my research interest. This makes me confident that I actually get almost 100% of the information about biomedical sentences by just using the abstracts(millions of them though) in scientific text.

    5. The idea of how clustering and classification are so close to each other -- this shattered my conception that there is some sort of iron curtain separating clustering from classification. The different clustering algorithms taught will be helpful

    6. The realization of how useful linear algebra is for practical applications like web search -- this is the core of major things like page rank, A&H, LSI,etc. This inspired me to study the application of other abstract mathematical theories that were not yet fully explored.

  2. 1. The most coolest,most applicable idea is the vector space model (tf-idf). Whenever you don't have any idea to work on just start with this one. tf-idf is a panacea for IR people.
    2. For providing suggestions for query or query expansion the best thing is to go for Query log and calculate term-term correlation.
    3. LSI is the another approach to look over correlation either between doc-doc or term-term. Query is a special type of document.
    4.Random Surfer Model: It is the best idea to go for page rank. M*=c(M+Z)+(1-c)K. K can be reset Matrix. We can play with K making sensitive to topics, trust, recency. Topic specific page rank is a nice idea for more research.
    5.HITS model: The best way to study whether content of page is important or the links of pages we are getting from that page are important.
    6. Conclusion:
    ( hub<auth<pageRank<paegRank+VectSimilarity<VectorSimilarity
    7. Social Network: Small world phenomena. The social network is power-law network based. Copy paste model makes Hubs in the network. More robust to disruption and less for attacks.
    8. Clustering and CF: Topic specific search can be viewed in the light of clustering. NBC is a nice idea for getting missing values. Also the recommendation system can viewed as User centered or Item centered.

  3. 1. How simple ideas, like representing documents as bags of words and creating inverted indices etc are so effective in the extraction of useful information from the web. Representing Documents as vectors in the space of the terms was one more concept that i liked a lot. It is something that made us think beyond 3-D and more importantly visualize the same.

    2. The TF-IDF way of handling the user queries. More specifically the way we can reduce the effect of more common words in documents by using their IDFs to give them lesser weights. Also, the Normalization techniques by means of which we can ensure that size of document does not affect the query results.

    3. I realized how useful Linear Algebra can be. We have practically used it everywhere from Page Rank/Authority-Hub computations to Dimensionality Reduction Techniques like LDA and LSI. It is very interesting to see how the LSI techniques can find the actual dimensions in the data(something which is not so intuitive), also interesting to know about the power iteration technique (as a fastest way for finding the Eigen values).

    4. I liked the simple idea, of using Anchor Texts and Links for retrieving documents that does not have any text or which is not labelled (eg-Image, Audio file etc), which i always felt would be a very complicated task.

    5. The way "Users" are taken into account while giving the query results (Relevance Feedback) is interesting, as i never thought we could get the user's feedback without actually asking them explicitly, and also we can now give results specific to different users which would be like customizing the search engine according to the individual's needs.

  4. 1. The vector space model learned in class, reaffirm the idea of representing objects as vectors of measurements over the objects' descriptive features. Thus, images, documents and other objects can be represented as single points in the n-dimensional space (n = # of features) and the similarity between them arises naturally with the spatial distance between the objects. This approach to represent measurable objects is also applied in other areas such as image processing and retrieval, machine learning, etc.

    2. The concept of inverted index to take advantage of the inherent sparsity of the document-term matrix.

    3. The power iteration method to find the eigenvectors of symmetric matrices and which represent the solution to the authorities-hubs computation and the page rank calculation. This method represent an easy way to find the eigenvectors of a matrix algorithmically speaking in comparison to the inherent high complexity of an eigen decomposition algorithm. Thus I find it very useful.

    4. The view of the web as a set of web pages with different probabilities of being visited acording to the links between them and their stationary probabilities and then modelled as a markov chain from the point of view of a random surfer.

    5. The utility of SVD as a linear algebra method to find orthonormal basis for a given set of vectors and to reduce dimensionality discarding the less relevant dimensions. Then, SVD allow us to address the "curse of dimensionality" educing the original space into a subspace with less dimensions loosing a measurable degree of variance. Applications of dimensionality reduction are common in (again) image processing, machine learning, etc.

    6. Classification and Clustering methods such as K-means and hierarchical clustering to agroup objects based on their distance in the space described by their features-dimensions. Beyond than text classification and text documents clustering, those methods are applied to different type of objects that can be represented as vectors defining a space.

  5. 1 How LSI was able to find the latent details? And the explanation of LSI as the special case of LDA.

    2. I liked the concept of LSI in reducing non-linear dimensions. From a two dimensional plane such as a circle and increasing the dimensionality and come back to the required dimensions.

    3. The concept explaining k-means clustering algorithm cannot find the optimal clustering,
    and under what case it would provide the optimal clustering. (with the example of a group of democrats moving together)

  6. 4. How Naives-bayes-classifier makes a huge assumption that attributes are independent of each other and its still manages to predict the right class.

    5. How XML helps in the information retrieval to create a specific search engine such as google scholar, airticket websites (and still manages to support the database users).

  7. 1. Vector Similarity: How documents can be represented as a bag of words and further can be viewed as vectors in space of terms. Even though such representations are highly lossy, it was surprising to see that it performed reasonably well in finding similar documents.

    2. Association & Scalar Clusters: How correlation between terms can be captured by computing the t-t matrix. And how transitive dependencies can be found by doing vector similarity on the t-t matrix. And how if 2 terms never occur together, it doesn't mean they are independent, but they are correlated negatively.

    3. **LATENT SEMANTIC INDEXING**: I liked the introduction of how LSI can defeat a very mean and malicious oracle. How picking the top-k dimensions is as easy as picking the dimensions corresponding to the K largest singular values, as against feature selection where picking one dimension affects what the next best dimension is.[The example of how if GRE Score is picked as a feature, GPA becomes more or less meaningless, made the idea very easy to understand.]

    I also enjoyed the readings about how LSI is used in collaborative filtering, and how it can be used to compare people on dimensions which cannot always be explained in words.

    4. How the reset distribution in PageRank can be used to personalise search results.
    Tyranny of Majority and how stability to random perburbations can be increased by considering the Normal to the plane which contains the primary and secondary eigen vectors.

    5. Power Law and it's polynomially reducing Long-tail, and how this can be used to explain why amazon continues to sell Iranian classical music etc.
    How Trust propagates in a network where as distrust doesn't.

    5. XML only ensures syntactic correctness and does not guarantee semantic correctness. The example of the CV in mandarin (I presume) and tags in greek was amazing and helped in conveying the message.

  8. Everything covered in this course was very very interesting.
    1)Concept of finding 'similarity' itself without any human intervention using trivial ideas like 'bag' of words.significance of inverted index.
    1)The use of linear algebra,especially matrices.Never knew they could be such an effective tool for calculating similarity.Finding correlations between terms using clusters.Finding the similarity in terms based on their co-occurence.Tackling synonymy and polysemy.Tolerant dictionaries.Taking user input on the go by using relevance feedback.
    2)Concept of high-D apples all peel no core .Reducing dimensionality using LSI,its amazing how you can recover the actual dimensions using SVD.
    3)Six degrees of seperation and the small world phenomenon idea.Also the long tail phenomenon.
    4)XML and Information Extraction.Concept of RDF schema.

  9. Of course non-trivial is the key word here :)
    In order of ideas I found to be the most intriguing weighted with their how non-trivial they are to me:

    1. Two things: finding the dimensions of the data that can be used to capture the most variance in the data, and calculating the primary eigenvector by repeated multiplication. Not really that trivial to me previously, but it is very relevant to the work I have been doing in the mathematics department on weather simulation. We took a weather simulation with 40-50 million dimensions and were able to use both concepts to determine that the dimension of the subspace of most error growth was around size 22-24.

    2. Pagerank using a reset matrix to ensure convergence to a unique eigenvector. I was familiar with pagerank and primary vector multiplication before, but I would have thought that using a reset matrix would make the results worse, as there is no reason to assume that a user would visit any random page. However, I think that by using a non 0, non-uniform reset matrix there would be a big improvement on using a uniform reset matrix.

    3. The web is not an exponential distribution, but a power law distribution. My naive thought was that it would be exponential, but now it makes sense that it is power.

    4. Latent semantic indexing: Sort of used in number 1, but I would have never thought to reduce the dimensionality of a matrix representing the web, nor would I have thought that this dimensionality could have been some linear combination of various words. But after seeing the results, there is no denying that it works.

    5. Authorities/Hubs calculation. It makes some sense to distinguish the web into two types of pages, and it certainly gives results. The conclusion that I draw from this leap of insight though is that maybe there are more than just two types of pages? Maybe there are really 3,4,5 or more types of categories and that the scores can be improved by integrating these other types.

  10. 1) LSI as a special case of LDA. The concept of LSI itself was new to me and I found the discussion very interesting and useful in understanding how the dimensionality of the documents can be reduced. The idea of thinking LSi as a special case of LDA was unique and interesting.
    2) Before this class I had a vague idea of how the feedback of a user can be incorporated in the search results and the discussion on the Rocchio method helped me understand the idea of Relevance Feedback.
    3) The discussion on different forms of information extraction. DOM tree patterns. parse trees. etc. was very useful. The talk in the extra class on Friday cleared many questions I had in terms of wrappers and their alternatives.
    4) Tolerant dictionaries and the different edit distances helped me understand the benefits of using tolerant disctionaries having used the feature many times in Google Search.
    5) One of the more interesting topics that really took me by surprise was the Benford's law that can be used to detect forgery in financial documents.

  11. 1. Pagerank was not completely new and unexpected. The field of IR already had a large body of knowledge that was just poorly implemented, and Brin and Page made the incremental improvements necessary and figured out a way to implement it on a scale that would stand the test of time.

    2. An impressive amount of understanding can be gleaned from queries and documents without any NLP. However, I wonder if NLP will overcome the statistical methods of IR over time.

    3. I found LSI to be a very nice technique for detecting associations between vector terms. More generally, it is very useful to think of old things in new ways (when SVD was invented I doubt anyone thought it would be used for IR).

    4. IR is a mix of common-sense ideas and more subtle insights. Lots of the common-sense ideas won't work without that "extra little bit" (i.e. reset matrix in PR, IDF in vector space).

    5. Application of IR techniques to the emerging semantic web. Many people think the semantic web is the future. If it is, adapting IR to it will be important.

  12. 1. I learned how search engines work! Everything learned in this course was new to me. I learned how exciting and complex the topic is. I had never even considered how search engines might work and was pleasantly surprised to find such an interesting science behind it. I loved the orderliness of the math behind everything, and doing it on a small scale made me feel cool. ☺

    2. I learned that what matters is that the user is satisfied. We can measure accuracy in precision and recall, but in the end, the result is best when the user is most satisfied. As a result, shortcuts can be made that may even lessen the accuracy, but return results faster, and that may be more successful than a solution that has perfect precision and recall.

    3. I learned that there are real world applications for the math I have learned over the last 4 years. I learned that even in computer science fields advanced math is important. I had naively assumed that many of the math courses were a formality for an engineering degree. It was refreshing to know that there was a good reason for learning things like linear algebra and discrete math.

    4. I learned that IR really involves many other complex topics such as psychology, databases and AI. I feel like through this course I have also had an introduction to all of those topics. I learned how understanding human psychology is important when developing for the web. It applies not only in search engines but also in ratings prediction, in combating malicious behavior, in social networks, and in almost everything on the web. I learned how artificial intelligence is important for web based information retrieval. How if the search engine can learn it can become better at returning the results that a user is interested in.

    5. I learned about the small world phenomenon and social networks. This topic really ventures into sociology. Learning how people work is integral to learning how web-based social networks work. This topic, like so much of what was taught in this class, is bigger than just data mining and information retrieval. Through this course, we were provided an introduction to understanding the web and its inner workings, and in the process I found myself learning about many other far-reaching concepts.

  13. The course provides a platform to apply the theory learnt in Machine Learning and Linear algebra. The problem of deriving meaning, which are understandable by humans,from decisions made machine learning algorithms was interesting to think about. Similarly, physical meaning of eigenvectors and eigenvalues was good to understand.

    I found classes on Recommendation Systems, Collaborative filtering and Social networks most interesting.
    The idea of expanding labeled data based on small labeled dataset and the related paper was good to read.
    It difficult to take on Google in search but many ideas like aardvark can be implemented. We can build interesting research tools using recommendation systems. Citations are good way to recommend research papers, but a system like aardvark can work better as it provides us access to community of people who have worked in the particular area.

  14. 1. One of the most innovative ideas in ir that has remained in my memory is the concept of representing objects of one class as vectors in the space defined by objects of another class. This idea makes many concepts of linear algebra available as tools ("hammers") for extracting more meaningful information from data.

    2. LSI makes use of one such tool - SVD. Though its not a panacea, it does provide a common solution to problems like dimensionality reduction, correlation analysis and similarity measurement. What amazes me is how lsi makes it possible to map data from n dimensions to say 2 dimensions with minimum loss of variance and this loss can be exactly estimated.

    3. Although authoritiy/hubs and page rank are more or less similar, the existence of reset distribution matrix makes the concept of page rank more appealing and better suited for link analysis on web. Starting as a mandatory requirement, it turns out that reset distribution matrix can very well be exploited to obtain different kinds of page ranks. In case of authority/hubs too there is this novel idea of - if you have lemons make lemonade - which turns the limitation of authority/hub computation into an application for identifying communities.

    4. It was quite interesting to know that small world phenomena exists in our social networks and infact that it follows power laws. Notion of rare events being not so rare provides an explanation as to why it isnt difficult to find many pages with high authority/page rank on the web even though small world phenomena exists in web.

    5. In the lecture series on information extraction what I particularly liked was the motivation behind the origin of xml and how its being viewed by ir and db folks. Jargons like RDF and OWL became much more clearer after knowing their usages are a step towards acheiving some form of NLP.

  15. This course turned out to be more interesting than I thought it would be. Not only did I learn a lot of great information, but being able to implement all these different concepts for the project is pretty sweet as well.

    1. tf-idf using inverted index is awesome. It's pretty much the backbone of our search engine. Not only did we learn how to find the top documents to a query, but also how to store them so that the search process is fast enough to be usable in the real world using an inverted index.

    2. I really enjoyed reading google's rejected publication in the homework assignment. It is really interesting to understand everything that goes in to a complete/deployed search engine beyond the simple IR algorithms and covering the distributed computing structure and so on.

    3. As an undergraduate, there haven't been many classes that really go into how concepts are used in the real world. Learning about how social networks are manipulated in the real world was interesting (facebook stuff, aardvark, analyzing user trust...).

    4. XML in IR - Taking advantage of the structure to improve precision/recall, and the future with NLP.

    5. Recommendation Systems. Goes kind of with #3 as it is a widely used in the real world. Recommendation systems are a great example of putting feedback to good use to make predictions for the user.

  16. 1. The first thing that impress me is the calculation of vector similarity using inverted index instead of doing the inner product directly. This inspired me of how to exploit the sparse structure of the matrix.

    2. The power iteration gives a distinguishing example of how the real world application differ from theory. 3 lines of code in matlab brings a lot more implementation problems when the size grows to the web-scale.

    3. Another very "non-trivial" idea is, page rank is not the reason google is google. One key reason is the computing resource they already have and their ability to manipulate such a huge computing resource.

    4. In this class, we learned that how the structure of the web will help the quality of the search results. By considering different components such as title, anchor texts, we can make the search much more reliable, since we are converting the simple collection of words into a semi-structured one.

    5. The power law distribution is another mind blowing idea. The way that web connects together makes it possible that one particular node will receive lots of incoming links. Then, web sites like Google are not that rare in this setting.

  17. 1. The concept of association clusters - and how the concept of finding similar search queries is similar to the idea behind finding correlated genes possibly responsible for disease.
    2. Marginal relevance - and how important it is to consider other results while displaying new results.
    3. Using SVD to obtain linearly independent 'terms'.
    4. Difficulty in cluster evaluation in the absence of class labels and how semi-supervised techniques / self-training could help.
    5. How k-means does not find an optimal clustering and how it could be made to find one.

  18. 1. I liked LSI concept the most as it makes use of Linear Algebra concepts to provides extremely useful interpretation of noisy data.
    2. Social Networks has multiple interesting concepts such as Small World Phenomena.
    3. I liked Aardvark's work in Social Network in which people can answer to another person's query.
    4. Recommendation system based on user profile is quite good which basically captures the psychology of user and tries to recommend things accordingly.
    5. Use of XML for providing structures in a way which can either be used by IR people or by database people.

  19. 1. I had a vague understanding of what PageRank was before the class, but using statistical analysis to guess at what a random surfing user would do to gauge importance is clever and non-intuitive, but highly effective. Surprisingly so.
    2. "Dimensional reduction through the amortization of variance" sounds neat on its own, but the mechanisms behind it are also really cool. I honestly never thought I'd have a use for the SVD section of my linear algebra course.
    3. Small changes to k-means, such as multiple initial random guesses, or hybridization with hierarchical clustering, can produce profound impact on the usefulness of the results. That such small implementation details can have such enormous results is really incredible.
    4. I never knew that TF-IDF was such a useful generic measure of similarity. It's the glue that holds almost everything else together. If there's one thing in this class I could count on using again, it'd be TF-IDF.
    5. Having used lucene and various other search libraries in the past, it's always been a passive experience, where I simply took it on good faith that they worked optimally. Now that I've dealt with and studied the background for search and classification to such great extent, I feel like I not only understand what they're doing, but am in a position, at least to a moderate extent, to expand and modify the original functionality. Whether or not I would ever need to do this isn't really important, so much as that I'm capable now.

  20. 1. Learning on how Mashups work and how they integrate different websites having structured data. Keyword search on strutured data on web/RDF was interesting as it could be done to improve search on the web to serve queries that require precise answers, such as "actor of movie avatar", "winner of ipl 2010" etc. running such queries on google does not gives satisfactory results as they only provide pointers to pages where the users have to search themselves. Search on structured data available in web would greatly improve the results fetched. I also found an application by google - "Google squared" which I think uses the structured data on the web to fetch results and integrate them, and I found it interesting.

    2. TFIDF and learning that the idea of TFIDF is applicable to any problem in IR to get a reasonable result was awesome. It is thought provoking to find that a simple statistical measure could infact solve some of the most intricate problems quite reasonably.

    3. Colloborative filtering on recommendation systems, It was very interesting to know what are the actual processes involved in making a recommendation system such as the one offerred by Amazon, and the idea about computing user-user/item-item correlation using correlation matrix and The Application of LSI on colloborative filtering to categorize users to people who like certain kinds of movies was interesting.

    4. The reset distribution in page rank and how it can be exploited to serve the user's interests.

    5. Assoiation clusters & scalar clusters: The idea that how scalar clusters helps in identifying the hidden association between 2 terms even though the two terms do not occur together.

  21. 1) First and foremost I learnt “how search engine works” through this course study. Vector space model being the fundamental topic of this course helped in picturing the web search results (the urls) and the query in the form of vector in the multidimensional keyword space, also how tf-idf helps in finding the documents that are relevant to the queries. It also helped in identifying the importance of the terms in each document as they help in determining weights for the document as Boolean weights are no longer being used.
    2) Importance of correlation analysis in determining the strength of the relationship between documents. As semantic web is the essential concept for the modern information retrieval system and LSI being one of the most effective approaches in this field of semantic web, learning and understanding LSI was very essential. LSI also helped in making us understand the fact that some terms are more important than the other in any given text and also how implementation of SVD helped in identifying the relationship between the terms and concepts present in the document. Since LSI is completely mathematical we learnt that as long as the information can be represented in the form of d-t matrix, it can process any information.
    3) Page rank was another important concept that helped in indicating the importance of a website. Page rank may be a small factor in judging the importance of the website as it might be sometimes misleading, but still learning the method of implementing this technique was very useful.
    4) Authority hubs helped in learning the way to locate the high ranking pages and how it helps in improving the performance of the search engine. It also helped in understanding how the importance of one page determines the importance of the other.
    5) Clustering was another important concept that we got out of this course. Web being the home of very large data set, learning about how mining these data for learning patterns and how these patterns helps in providing relevant results to the query was very useful. Also we learnt through the project, various extension of a simple k-mean algorithm that helped in identifying the best clusters.

  22. 1. Before I joined this class, I really was happy or at least ok about search engines, as they are kind of our lifeline. But when it came to think about what more the search engines can do, and getting the feeling of what contributions we can make towards it, presented an altogether different perspective.
    2. As part of the subject, the first idea that I liked was how and especially “why” to find Eigen Vectors using power iteration. As in the beginning I thought why to do the power iteration when we can directly find the Eigen vectors.
    3. I was amazed by the idea of documents being represented in n-dimensional space as vectors, and found a sudden new passion and respect towards mathematics.
    4. I found LSI concepts overall and especially the transitive relationship (Saddam being related to Bush) between documents very interesting.
    5. The concept of Reset matrix in Page Rank and how the randomness in surfing is taken into account while calculating Page Rank was really cool. :)
    6. Also, finding the correlation between terms using scalar clusters interested me a lot.
    7. All in all it has been an adventure and I feel privileged to be a part of this class.

  23. 1. I really was surprised at the fact that the crudeness of bag of words approach actually yields decent search results. This also goes to show the state of relative infancy web IR is at.
    2. While the simplicity of bag of words surprised me, the complexity of feature selection had me confused at first. As the course went on and SVD was taught in Linear Algebra, I came to appreciate the value of determining the important features of documents.
    3. Social Networks were interesting in their broad usefulness. I figured things like 'the Kevin Bacon' game were possible using social networks, but I hadn't thought of network security/robustness as a potential application.
    4. Likely the most interesting topic to me was recommendation systems. I liked the unique challenges stemming from sparse matrices, personal rating habits(only rating things high, low, etc), and the fact that this is a thriving section of IR, as can be seen by the Netflix competition.
    5. Finally, I was intrigued by the problem of finding structure in the web. While bag of words is OK, the discussions on XML and database integrations were interesting in actually finding an answer as opposed to finding a resource for the answer. I am curious about the applications of natural language in deriving structure from sentences/header/etc(ie: distinguishing between a title of something involving trash and a description of something as trash).

  24. Several points I am impressed in this class
    1. SVD decomposition of the document-term matrix

    2. Different ways to understanding the document-term relations: documents can be treated as a bag of words; and a term can be described by documents or other terms.

    3. The concept of PageRank and Authority/Hub, and its understanding in graphs. This gives me some ideas to a new paper I am working on.

    4. Scale-free networks could be understand in a very different way which I am not familiar with drives me to thinking deeply.

    5. Information integration/extraction, trust on graphs are helpful for my current work in crowd-sourcing project.

    6. Collaborative filtering is something new for me. I completely understand after this class.

    7. I don't interested in the clustering (especially kmeans, hierarchical clutering) and XML related stuff of this class. They are too old, everybody knows that.

    8. I feel like the professor can offer more recent publications on Information Extraction, Information Integration, and Social Networking related stuff.

    All in one, this class is very exciting for me.

  25. This is a very interesting class, thank you!

    1. The first thing that I leaned is about LSI. I like the idea behind LSI that takes a term-document matrix and decompose it into three parts by SVD. I also like the idea of how to pick the top-k dimensions corresponding to the top singular values according to the loss threshold.

    2. I also like the link analysis part all about PageRank and HITS. The power iteration behind pageRank and HITS gives simple but reasonable view of how people browse the web and the topology of links on the web. Based on what we learnt that pagerank measure the importance of the page and vector military measures the similarity of a page to the query, we can actually combine them for the better results.

    3. I learned a lot of social networks such as small world phenomenon, power law distribution, etc and collaborative filtering. These ideas are hot and in particular, I like Aardvark's way that ask ans answering questions within one's social network circle.

    4. I also learned that how the structure data or snippets can be helpful to enhance the search quality by considering different components on the web such as anchor texts, tags and etc. We can make the web more structured.

    5. Clustering on the search result is also very useful, I especially appreciated the discussion about the limitation of K-means (local optimal) and the motivation behind Buckshot algorithm that is using HAC to find the initial centrioids.

    6. The information extracting and integration part are also very exciting. From IR's perspective, we can make the web more structured by extracting some information and put them into database. Different information extraction methods were discussed and I like the wrapper/unwrappers that use path expressions on DOM for templates document . For integration, LAV vs. GAV is important issue that I learned from today's class.

  26. 1)I really enjoyed LSI and SVD and how to remove dimensionality. It makes a lot of sense and can be applied in many problems. I just used it last week in CSE 494 High Performance Computing to remove noise from a wav file. Very helpful.
    2)I really enjoyed learning Vector Similarity. It is a really good thing to learn because it can be a base for so many document classification algorithms. It can be applied to many different web applications for different uses.
    3)Collaborative filtering on recommendation systems is something that will stick with me. We see it every day and sometimes wonder why did they tell me I would like that? Understanding how they come up with what we would like and how they use our reviews to help other people decide what they like is good to know.
    4)Every project was very helpful. Being a web developer and having to display information to users, each project showed me how to implement document similarity, getting authorities and high valued pages and clustering information together. I am excited to take this knowledge and apply it to web applications I develop in the future.
    5)Learning about power iteration to find the principle eigenvector of a symmetric matrix and then how it can be applied to find the authorities and hub values as well as pagerank.

  27. 1. I was amazed at how effective tf-idf was at getting good results based on bags of words. It is amazing that so much meaning can come from such a seemingly trivial idea.

    2. I love how you explained Google's use of pagerank and use it all the time now. For example I'm building a music sharing application and part of it involves building trust over time. Like Google boosts new page's pagerank, I will boost songs trust initially to help them 'catch up' with the others already in the system.

    3. I thought dimensionality reduction was very interesting. I like the idea of being able to take complex data and reduce it to two or three dimensions. As part of that I thought it was fascinating how in high dimensional space the apple is all peel.

    4. The project was huge for me too. I'd never used Java before, so your class taught me Java. More importantly it taught me how to implement jsp and ajax between the server and client, which is something that will be extremely useful for me. Also the hands on building of the actual algorithms helped me understand them in a way that I couldn't have before.

    5. I love the idea that you ended class on, i.e. that this is a brand new field and there are all sorts of things to discover. I love that I can already read a lot of the leading research being done in the field and understand it. To me that is a very exciting idea, I would like to pioneer some great ideas.

  28. 1. What easily had the largest "wow" effect for me was the straightforward TF and TFIDF ranking methods. I hadn't looked into data mining at all before this class, so this introduction served as a tangible way to apply the early ideas.
    2. Like everyone else, I was very impressed with the concept of SVD, and reducing dimensions and determining how much variety was lost. And actually being able to apply what I've learned in the linear algebra class felt good on its own.
    3. Page rank was the algorithm I looked forward to the most out of the topics in the class, and understanding how random surfers are taken into account. But I think I was still more impressed by how linear algebra methods are used in Page Rank to get the Rank vector using power iteration. Maybe I'm just easy to impress...
    4. At first I was confused as to why Social Networks were included in this class, but its applications in dealing with page "trust" measures and trust propagation turned out to be very interesting.
    5. I was also very suprised at how easy and intuitive the K-Means clustering algorithm works. I know another student wasn't too impressed with clustering, but not having seen it before it was an impressive but straightforward idea for classifying data.

  29. This comment has been removed by the author.

  30. This comment has been removed by the author.

  31. This comment has been removed by the author.

  32. [part I]

    Note: Maybe the post is too long, finally I split my comments into two posts, so please be ware the next post is a continuation of this one.

    The course is in general very thought-invoking, the discussion in class and questions are very helpful in retrospect when I actually understand the materials. Here are are several points that I have found especially meaningful and innovative.

    1. I found the discussion on the relationship/difference between supervised and unsupervised learning (clustering vs. classification) very insightful. I have always been thinking about the relationship between the two. From a data dimension point of view, the treatment can be roughly categorized into two types:

    Approach 1: LSI-> k-means, dimension reduction before clustering. Get useful dimensions and not overwhelmed by irrelevant dimensions. (even though normalization of the data along each dimension might help to a certain extent.)

    Approach 2: SVM, increase dimension before classification. To achieve linear decision boundary in high dimensional space.
    And I realized in real life, this choice is very much domain dependent.

    2. It is interesting to learn that large search engines are using simple scheme such as NBC, The distributed storage scheme determines that some global feature extraction might not be easily applicable for distributed data. Given the trend of data storage ( thanks to silicon prices, we could afford to store tons more data these days). I believe that data mining and text classification on large scale and distributed architecture will be a research/engineering topic that is of great importance. In fact, during the past decade, the following data mining techniques have been explored within the context for distributed architecture:

    Classification, Decision tree. Association rule analysis, Clustering, Stream mining.
    A full list of distributed data mining works can be found in this paper.


  33. [part II] continuing from previous post..

    3. Social network is powerful. This course revealed to me the fact that the Internet is changing our social structure in a very fundamental manner. I am interested in seeing how the web evolves as a platform to provide collaboration, hassle, greater communication channel as well as privacy concerns.

    As we are busy celebrating the blooming of social media, Mark Bauerlein, an English Professor from Emory University, wrote a book called . Criticizing how the social media is misleading kids' attention and hinders their education.
    It's a fun read, to learn more about this book:


    4. Information Integration is a topic of particular interest to me. The discussion regarding a whole spectrum of view (from the data base point of view to the NLP point of view) certainly demonstrates the way we look at data on the web in a whole new angle. I enjoyed the discussion a lot. On the one hand, I think it will be an interesting balance as of if we should spend more computation power to restructure data that already exist or create content in a more structured manner. From an HCI perspective, it is unlikely to request end users to create content using XML schema or RDF. On the other hand, we couldn't count on any immediate breakthrough in NLP in a short term. (Even though people are working really hard on this, both linguists and computer scientist. Check out this special track about designing automated discovery of entities in text in the TAC2010 conference


    ). I think one middle ground is to provide a more "moderate" content creation. For instance, instead of asking people to "write as you wish" or "fill every field in this form", ask them to "write in a form" which is not totally different from "writing in a blank field". The form could have predefined categories, people could either follow or not follow it, depending on the ease of interaction. On the server side, we can make use of structures that the user picked up and do our best NLP for those we didn't have control over. While in the front end, the HCI folks could try to design the interfaces to be more "organic" or "usable" where the form fields minimize users' cognitive load and at the same time get them to fill in most content in a desirable way.

    5. This one is simple. Even though initially I struggled with the project and end up spending a lot of time in it. The idea of doing all of these by hand is an excellent idea. I strongly believe this practice should be encouraged in the future instruction of this course.


Note: Only a member of this blog may post a comment.