Improving the Setwise Stream Classification Problem

Srishti and I worked on Setwise Stream Classification Problem. Here are a few quick details.

A few of the drawbacks that we found, and the improvement strategy that we proposed to address these drawback, were as follows:

  1. Selection of min_stat parameter has not been discussed

     The original algorithm uses min_stat to determine when to classify an entity (it classifies an entity only if the more than min_stat number of points have been received exceeds min_stat number of points). However, it doesn’t mention anything about how to obtain min_stat. In our proposed method, min_stat is determined in a way so that it returns a class profile as soon as possible, but with likelihood of it having predicted the final class profile that it will predict, given all points in the entity, being beyond a certain amount (say, 80%). This is done by using the entire training data, and determining after how many points are used to obtain the fingerprint of an entity, the class profile the said entity is nearest to doesn’t change very frequently. In other words, after using how many points to determine the fingerprint can the class profile be determine with a certainty more than a certain amount.

  2. k-means was used for clustering the initial sample, which is affected by choice of initial seeds

    To reduce the dependency on seeds chosen initially, we proposed using bisected k-means

  3. The anchors remain constant throughout

    Anchors play a crucial role in determining fingerprints, which in term determines which class profile an entity is assigned to. However, in the original algorithm, anchors remain constant with time. In our algorithm, we propose updating the anchors incrementally with time, offline. While this would cause some overhead, it would make the system less prone to concept drift. This improvement would be of particular interest when parallelizing the algorithm, since the parallelized implementation of the update of anchors would reduce the overhead significantly.

  4. The problem of concept drift has not been dealt with

    Our improved implementation proposes using a distance-based method to measure concept drift, and those entities which cross the distance threshold are classified into a separate “concept drifted” class (with a possibility of being classified in one of the classes when sufficient data becomes available, although we haven’t looked into this in detail).

Our strategy for parallelizing the original algorithm is as follows:

  1. Parallelize the K-Means run on init_sample
  2. Parallelize the process of determining the closest anchor (Required only for a large number of anchors, and/or for high dimensional data)
  3. Parallelize updating the fingerprint of the appropriate entity (this is required if there are a very large number of anchors)
  4. Determine class profile that is closest to the fingerprint in question (this is required only if there are a very large number of class profiles)

And finally, here is a quick few steps on how to get OpenMPI running:

  1. Install cygwin for windows
  2. Select “openMPI”, “libopenmpi”, “libopenmpicxx1” from lib, “gdb”, “gcc-core”, “gcc-g++” from Devel, and “openmpi-debuginfo” from Debug option (1.8.2 at the time of writing)
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: