Improving the Setwise Stream Classification Problem
Srishti and I worked on Setwise Stream Classification Problem. Here are a few quick details.
A few of the drawbacks that we found, and the improvement strategy that we proposed to address these drawback, were as follows:
The original algorithm uses min_stat to determine when to classify an entity (it classifies an entity only if the more than min_stat number of points have been received exceeds min_stat number of points). However, it doesn’t mention anything about how to obtain min_stat. In our proposed method, min_stat is determined in a way so that it returns a class profile as soon as possible, but with likelihood of it having predicted the final class profile that it will predict, given all points in the entity, being beyond a certain amount (say, 80%). This is done by using the entire training data, and determining after how many points are used to obtain the fingerprint of an entity, the class profile the said entity is nearest to doesn’t change very frequently. In other words, after using how many points to determine the fingerprint can the class profile be determine with a certainty more than a certain amount.
To reduce the dependency on seeds chosen initially, we proposed using bisected k-means
Anchors play a crucial role in determining fingerprints, which in term determines which class profile an entity is assigned to. However, in the original algorithm, anchors remain constant with time. In our algorithm, we propose updating the anchors incrementally with time, offline. While this would cause some overhead, it would make the system less prone to concept drift. This improvement would be of particular interest when parallelizing the algorithm, since the parallelized implementation of the update of anchors would reduce the overhead significantly.
Our improved implementation proposes using a distance-based method to measure concept drift, and those entities which cross the distance threshold are classified into a separate “concept drifted” class (with a possibility of being classified in one of the classes when sufficient data becomes available, although we haven’t looked into this in detail).
Our strategy for parallelizing the original algorithm is as follows:
- Parallelize the K-Means run on init_sample
- Parallelize the process of determining the closest anchor (Required only for a large number of anchors, and/or for high dimensional data)
- Parallelize updating the fingerprint of the appropriate entity (this is required if there are a very large number of anchors)
- Determine class profile that is closest to the fingerprint in question (this is required only if there are a very large number of class profiles)
And finally, here is a quick few steps on how to get OpenMPI running:
- Install cygwin for windows
- Select “openMPI”, “libopenmpi”, “libopenmpicxx1” from lib, “gdb”, “gcc-core”, “gcc-g++” from Devel, and “openmpi-debuginfo” from Debug option (1.8.2 at the time of writing)