Tuesday, April 6, 2010

[Thinking Cap--Easter Resurrection] on Classification/clustering

Comment on this on the blog

At the end of today's class, we saw that classification is in some sense a pretty easy extension of clustering--training data with different labels can be seen to be making up the different clusters. When test data comes, we just need to figure out which cluster it is closest to and assign it the label of that cluster. 

1. If classification is so darned straightforward, how come the whole entire field of machine learning is obsessed with more and more approaches for classification? What can be possibly wrong with the straightforward one we outlined? Can you list any problems our simple approach can run into? (Alternately, it is fine to just decide that Jieping Ye and Huan Liu cannot leave good enough alone... :-)

2. If you listed some problems in 1 (as against casting aspersions on Ye and Liu), then can you comment on the ramifications of those problems on clustering itself? Or is it that clustering is still pretty fine as it is?



  1. 1. dimensionality curse
    2. want of linear time models
    3. want of sparse models

    Clustering as we discussed in the class doesn't address any of the above.

    1. PCA can also be used in clustering
    2. This is a tougher problem - I can't think of an answer without going for less accurate solutions like K-means
    3. Random indexing is the clustering alternative for the SVD based sparse models used in classifications.

  2. 1. Classification could produce non isotropic decision boundary. The "Clustering Scheme for classification" relies on the distance between a new data point and the cluster centroid, assuming shape of clusters are homogeneous in the feature space. Sample problem such as a "ring" shaped class is not supported in this case.

    2. In SVM the "ring" problem can be mapped into a higher dimensional space to find linear classifier, we could potentially do the same for k-means, the problem here is how to find the extra dimension to guarantee the separability of data points.

  3. I want people to try answering without the jargon (you can bring the jargon in after you pointed out what the problem is).

  4. Problems:
    Decision as Problem 1:
    Decision for Test is based on proximity measure. So After making badly use of class labels (Training) we generated a model that is for test part is going to do the same thing. Calculating the distance from each cluster. Instead of telling exact label it still has to look over other clusters.

    Labeling usage Problem 2 :
    Labeling lost in clustering: As we are using labels so after training we must have a model that can tell exact label of new Test instead of telling it is closest to this cluster so it belongs to this group. We are losing the value of labels in grouping. As proximity for one view can be different from another view. So over all decision will be vague.

    Instead of making clusters with the use of labels we must make a good function or model that can precisely tell that exact Label of New Test instead of vague Proximity.
    I think the Value of using labels in a right way (TRAINING PART) is the only difference that can tell whether it is clustering or classification. So making a model that is telling proximity as Decision parameter is clustering whereas making a model that is telling Exact Label as Decision Parameter is classification.

  5. 1. The research on text classification is popular may be because, to improve the scaling and speed of text classification.
    2. When the data is huge, it may become highly impossible to manually label documents, so automatic labeling of documents might be of research focus.
    3. Single labels for data may not be good enough, automatic techniques to find multi-labels for data could be a highly researched topic.
    4. Selecting the right training data itself, such that the data covers all different kinds of classifications could be a hard problem which requires expert knowledge of the domain, automatic techniques to acheive this might be a topic of research interest.

  6. 1) Classification leverages the information from the available training examples. This is however not always possible in the real world and here clustering is very useful.

    2)Clustering relies on the overall distance/similarity measures between objects to create clusters Whereas classification can utilize a more fine grained similarity of objects to perform the task. For example, in decision trees the similarity of most discriminating attributes is given preference over other attributes while determining the class of the data point.

  7. Classification problem is actually not straightforward because of the fact that even if the training data is given, a new data point may not belong to any one of the existing clusters! I was just thinking if HAC would be able to handle this. It would be easy to insert the new data point in existing HAC as a separate cluster at any level if required. Also, it can be restructured later on and classification would become simpler once this cluster becomes massive with sufficient information.

  8. Classification can be difficult because discerning which characteristics are telling about data is difficult when you only have a limited amount. You don't know whether it will correlate with the data or whether it's just noise clouding the picture.

    Also in order to classify you have to have an idea of what to classify it as from the beginning. If you don't choose good classifications, you probably won't get very good results.

    Clustering can also require you to make decisions up front (like choosing the k in k means), but you can use other algorithms that don't require that information (like density based clustering).

  9. The significant issues we face with classification using clustering scheme are proximity measure, assessment of the results, data preparation. First of all lack of information in the case of unsupervised way to classify will make it difficult to decide with what algorithm to be used. Clusters do not necessarily correspond with the desired information classes. Assigning the label based on the cluster may not be correct all the time as we are finding the closer match of the class label for the test data from the set of clusters. There is also an issue of how an algorithm can handle different types of attribute and irregular shapes of the cluster.

    To overcome these issues one of the way is the selection of best algorithm that can provide exact classification through clustering scheme. Hierarchical clustering can handle any forms of distances or similarity and can be applied to any attribute types. Probabilistic clustering can also handle complex structure and results in easily interpretable system.

  10. No, the classification cannot be considered as a straight-forward extension of clustering.If the classification model had predicted a wrong class to the data with which we see the closeness measure to the current test data, it would also affect the current test data. Thus error at one place would propagate.

    So, we have lots of classification algorithms which actually identifies the classification of a test data independent of the other predictions.

    If you need to consider classification as an extension of the clustering, instead of seeing the closest cluster, a better metric which takes care of the mis-classified data should be considered


Note: Only a member of this blog may post a comment.