fastcluster: Fast hierarchical clustering routines for R and Python

Copyright © 2011 Daniel Müllner
<http://math.stanford.edu/~muellner>

The fastcluster package is a C++ library for hierarchical (agglomerative) clustering on data with a dissimilarity index. It efficiently implements the seven most widely used clustering schemes: single, complete, average, weighted, Ward,centroid and median linkage. (The “weighted” distance update scheme (Matlab, SciPy) is also called “mcquitty” in R.) The library currently has interfaces to two languages: R and Python/SciPy. The interfaces are designed as drop-in replacements for the existing routines. Once the fastcluster library is loaded at the beginning of the code, every program that uses hierarchical clustering can benefit immediately and effortlessly from the performance gain.

See the author's home page <http://math.stanford.edu/~muellner> for more information, in particular a performance comparison with other clustering packages.

The fastcluster package is licensed under the GNU General Public License (GPL) Version 3. See <http://www.gnu.org/licenses/gpl.html>


Installation
‾‾‾‾‾‾‾‾‾‾‾‾
See the file INSTALL in the source distribution.


Usage
‾‾‾‾‾
1. R
‾‾‾‾
In R, load the package with the following command:

    library('fastcluster')

The package overwrites the function hclust from the “stats” package (in the same way as the flashClust package does). Please remove any references to the flashClust package in your R files to not accidentally overwrite the hclust function with the flashClust version.

The new hclust function has exactly the same calling conventions as the old one. You may just load the package and immediately and effortlessly enjoy the performance improvements. The function is also an improvement to the flashClust function from the “flashClust” package. Just replace every call to flashClust by hclust and expect your code to work as before, only better (see the Warning 1 below).

The agnes function from the “cluster“ package does a bit more than hclust. If you do not need the extra functionality, you may also wish to replace agnes by fastcluster's hclust for higher speed.

If you need to access the old function or make sure that the right function is called, specify the package as follows:

    stats::hclust(…)
    fastcluster::hclust(…)
    flashClust::hclust(…)

WARNING 1
‾‾‾‾‾‾‾‾‾
The “flashClust“ package has a bug in its clustering algorithm. The clustering methods “centroid” and “median” produce wrong results.

(Here is a proven, rough, lower bound on the error rate: If there is a so-called “inversion” in the dendrogram, flashClust produces wrong results in at least 1/3 of all cases. The actual error rate is much closer to 1, and errors occur also if no inversions are present.)

WARNING 2
‾‾‾‾‾‾‾‾‾
R and Matlab/SciPy use different conventions for the “Ward”, “centroid” and “median” methods. R assumes that the dissimilarity matrix consists of squared Euclidean distances, while Matlab and SciPy expect non-squared Euclidean distances. The fastcluster package respects these conventions and uses different formulas in the two interfaces.

If you want the same results in both interfaces, then feed R with the entry-wise square of the distance matrix, D^2, for the “Ward”, “centroid” and “median” methods and later take the square root of the height field in the dendrogram. For the “average” and “weighted” alias “mcquitty” methods, you must still take the same distance matrix D as in the Python interface for the same results. The “single” and “complete” methods only depend on the relative order of the distances, hence it does not make a difference whether one operates on the distances or the squared distances.

The code example in the R documentation (enter ?hclust or example(hclust) in R) contains an instance where the squared distance matrix is generated from Euclidean data.

2. Python
‾‾‾‾‾‾‾‾‾
The fastcluster package is imported as usual by

    import fastcluster

It provides the following functions:

    linkage(D, method='single', metric='euclidean', preserve_input=True)
    single(D)
    complete(D)
    average(D)
    weighted(D)
    ward(D)
    centroid(D)
    median(D)

The argument D is either a compressed distance matrix or a collection of m observation vectors in n dimensions as an (m×n) array. Apart from the argument preserve_input, the methods have the same input and output as the functions of the same name in the package scipy.cluster.hierarchy. Therefore, I do not duplicate the documentation and refer to the SciPy documentation for further details:

    http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html
    http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

The additional, optional argument preserve_input specifies whether the fastcluster package first copies the distance matrix or writes into the existing array. If you generate the distance matrix only for the clustering step and do not need it afterwards, you may save half the memory by saying preserve_input=False.

Note that the input array D contains unspecified values after this procedure. You may want to write

    linkage(D, method="…", preserve_distance=False)
    del D

to make sure that you do not accidentally use the matrix D after it has been used as scratch memory.

The single linkage algorithm does not write to the distance matrix or its copy anyway, so the preserve_distance flag has no effect in this case.
