Hierarchical clustering using C++

Our aim is to implement hierarchical clustering algorithm with O(N*N*log(N)) complexity using STL.

The idea of hierarchical clustering is merging two nearest clusters on every step starting from N one-element clusters. To find two nearest clusters on step K we need (N-K+1)x(N-K+1) matrix of distances between current clusters. We need, in worst case, N times of updating and finding minimum in sequence of NxN, (N-1)x(N-1), (N-2)x(N-2), …, 1×1 matrix. So, in usual realization of HC algorithm it should cost you O(N*N*N) operations.

To improve algorithm complexity up to O(N*N*log(N)) we can use vector V of similar to priority queues structures. V[i] should consist of sorted by distance points P(i,j) where distance is distance between clusters i and j. Our similar to priority queues data structure should support: ability to extract minimal element with complexity O(1), ability to delete element with complexity O(log(n)), and ability to add and element with complexity O(log(n)).

Standard STL priority_queue allows us to extract minimum and add element with required time, but it has no ability to find quickly an element, so we can’t delete elements with complexity O(log(n)). I propose to use set or multiset STL structure instead. Set and multiset uses “red-black tree” data structure with O(log(n)) complexity of deletion and insertion. Also, in STL sets all elements stored in sorted order and we can easily access to minimum (maximum) element [using standard begin() or end() methods].

In my realization I used following structure to store distances and cluster’s indexes:

struct distances
	double dist;
	int index;

Also I defined appropriate comparing operator so that elements in my set always sorted in right order.

struct Cmp{
	bool operator()(const distances d1, const distances d2) const
		if(d1.dist == d2.dist)
			return d1.index < d2.index;
		return d1.dist < d2.dist;

Then declaration of our vector of sets should look like this one:

std::vector> P;

Here is source codes with sample input file and some comments:


Friday, January 23rd, 2009

Problems with SSE2 in OpenCV

When I tried to use some methods from OpenCV library (for example CvResize()) in my MS Visual Studio 2005 project I encountered with errors in cxtypes.h header file near

#if CV_SSE2
__m128d t = _mm_load_sd( &value );
int i = _mm_cvtsd_si32(t);
return i - _mm_movemask_pd(_mm_cmplt_sd(t,_mm_cvtsi32_sd(t,i)));

These errors were related with SSE2. The problem is that my AMD Sempron 2200+ family 6 (x86 family 6 model 8 ) processor do not support SSE2 instructions. SSE – Streaming SIMD Extensions 2 are extends on MMX instructions to operate on special registers to increase computational speed in some cases.

To fix these problems you need to set CV_SSE2 variable to 0 in cxcore’s cxtypes.h (near 65 line).
For example like here:

//  #if defined WIN32 && (!defined WIN64 || defined EM64T) && 
//      (_MSC_VER >= 1400 || defined CV_ICC) 
//      || (defined __SSE2__ && defined __GNUC__ && __GNUC__ >= 4)
//    #include 
//    #define CV_SSE2 1
//  #else
    #define CV_SSE2 0
//  #endif

Tuesday, January 6th, 2009

XML serialization for Dictionary generic

For some reasons Microsoft made their generic Dictionary<TKey, TValue> class not XML-serializable. But making your own Dictionary that supports XML serialization is very simple. All you have to do is to write an implementation of IXmlSerializable interface.

Tuesday, July 24th, 2007

Generic Pair .NET class

If you ever worked with C++ STL library you might remember pair template, which sometimes was very useful. .NET Framework has similar generic KeyValuePair<TKey, TValue>. But as it is designed to work with Dictionary class KeyValuePair has one sufficient constraint: it does not have public setters for Key and Value properties.

So I wrote my own generic Pair class and extended it with some useful static routines.

Monday, July 16th, 2007