Hierarchical clustering using C++

Our aim is to implement hierarchical clustering algorithm with O(N*N*log(N)) complexity using STL.

The idea of hierarchical clustering is merging two nearest clusters on every step starting from N one-element clusters. To find two nearest clusters on step K we need (N-K+1)x(N-K+1) matrix of distances between current clusters. We need, in worst case, N times of updating and finding minimum in sequence of NxN, (N-1)x(N-1), (N-2)x(N-2), …, 1×1 matrix. So, in usual realization of HC algorithm it should cost you O(N*N*N) operations.

To improve algorithm complexity up to O(N*N*log(N)) we can use vector V of similar to priority queues structures. V[i] should consist of sorted by distance points P(i,j) where distance is distance between clusters i and j. Our similar to priority queues data structure should support: ability to extract minimal element with complexity O(1), ability to delete element with complexity O(log(n)), and ability to add and element with complexity O(log(n)).

Standard STL priority_queue allows us to extract minimum and add element with required time, but it has no ability to find quickly an element, so we can’t delete elements with complexity O(log(n)). I propose to use set or multiset STL structure instead. Set and multiset uses “red-black tree” data structure with O(log(n)) complexity of deletion and insertion. Also, in STL sets all elements stored in sorted order and we can easily access to minimum (maximum) element [using standard begin() or end() methods].

In my realization I used following structure to store distances and cluster’s indexes:

struct distances
{
	double dist;
	int index;
};

Also I defined appropriate comparing operator so that elements in my set always sorted in right order.

struct Cmp{
	bool operator()(const distances d1, const distances d2) const
	{
		if(d1.dist == d2.dist)
		{
			return d1.index < d2.index;
		}
		return d1.dist < d2.dist;
	}
};

Then declaration of our vector of sets should look like this one:

std::vector> P;

Here is source codes with sample input file and some comments:

sources.zip

Tags: , ,

AddThis Social Bookmark Button

4 Responses to “Hierarchical clustering using C++”

  1. moss Says:

    Hey i have same problem can u pls send me the whole code ?

  2. aldo Says:

    would you like to send me your code? i really need it.
    thanks

  3. Atuk Says:

    Hello,
    I’ve added .zip with sources to the end of the post. Good luck.

  4. alvarisi Says:

    Hello, thanks for sharing..
    could you mention what IDE you use?
    cz i’m using codeblocks and it’s appear error.
    The error is in
    std::vector<std::multiset> P;
    std::vector<std::vector> C;
    std::vector II;
    std::vector<std::vector> A;
    and the codeblocks say that
    ‘>>’ must be ‘> >’

    i have changed the code by it but the code while i run is can not run again..
    thanks…

Leave a Reply