CS 395 Assignment: Agglomerative Clustering
This time around, you will code up a version of agglomerative clustering.
The approach is inspired by, but simpler than, the CURE approach.
You should use the same datasets that you used on the last assignment, and
code up a version of agglomerative clustering with the following characteristics:
- Distance between two points: Euclidean.
- Distance between two clusters: Single link (closest distance between
points, one from each cluster).
To make this efficient, you should use a heap as the CURE paper does to keep
track of which clusters to merge next. The k-d tree that CURE uses is not
necessary here, as we will not decide which points are representative and
which are not. Instead, we will keep all points ("all points are representative").
This means that when you merge two clusters together, any other clusters
that had either of these as the previous nearest will have the new cluster
as the nearest.
I would encourage you not to code up your own heap, but rather to use one
already available. Any of the languages that you are using should have libraries
available.
As last time, you should vary the number of final clusters, and choose an
appropriate number. Turn in on paper an explanation of the methodologies
that you used, the results that you found, and a clear description of the
final clusters that you discovered.