Constrained K-means, controlling group sizes

fmarthoz
3 min readJul 8, 2021

--

Photo by Patrick Schneider on Unsplash

Previously, we have seen a very simple implementation of K-means and a method to choose the number of clusters. In some instances, we also want to avoid having empty clusters, or clusters of very different sizes. Here comes constrained optimisation to the rescue.

The algorithm is based on a paper by Bradley et al. and has been implemented by Joshua Levy-Kramer: https://github.com/joshlk/k-means-constrained, using the excellent Google OR-Tools library (https://developers.google.com/optimization/flow/mincostflow)

The algorithm uses ideas from Linear Programming, in particular Network Models. Networks models are used, among other things, in logistics to optimise the flow of goods across a network of roads.

An example of network flow model (Source)

We can see in the simple figure above that we have 5 nodes with directed arcs (the arrows) between them. Each node has a demand (negative) or supply (positive) value and the arcs have flow and cost values. For instance, the arc 2–4 has a flow of 4 and a cost of $2. Similarly, node 1 supplies 20 units and node 4 requires 5 units.

Those problems can be solved by using optimisation algorithms such as the simplex algorithm.

At a very high level, we can represent this constrained K-means problem with a network, where we want to minimise the sum of the distances between each point and the centroid of its cluster, with some constraints on cluster sizes.

We can see what it would look like in the figure below. Every point x(i) is a supply node with a value equal to one and a directed arc to every single cluster C. The cost for those arcs is the distance between the point and the centroid of the corresponding cluster. The clusters C1, C2, etc. are demand clusters with values equal to the minimum size required. Finally we add an artificial demand node to make sure that the sum of the supplies equal the sum of the demands.

The optimisation problem seen as a network (Source)

A more detailed explanation can found in the original paper.

A Python implementation

A package has been developed by Joshua Levy-Kramer and others, it is available here.

It can be installed with pip

pip install k-means-constrained

And then easily applied on a dataset, we’ll reuse the one from the previous articles but we need to transform our pandas dataframe into an array.

X=np.array(X).transpose()

And then we can fit the KMeansConstrained method to the data with the number of clusters we want (n_clusters), the minimum and maximum size of the clusters (size_min and size_max)

from k_means_constrained import KMeansConstrained

clf = KMeansConstrained(
n_clusters=4,
size_min=8,
size_max=12,
random_state=0
)
clf.fit_predict(X)
print(clf.cluster_centers_)
print(clf.labels_)

We can then access the centroids with clf.cluster_centers_ and the clusters with clf.labels_

As we can see below, we obtain 4 clusters with 8 elements each.

Results with 4 clusters of minimum size=8

So far we have used a very small dataset with only two features so we can easily visualise the results on a 2D plot. In the next article, we’ll see how to select variables when we are dealing with datasets containing a large number of variables. Reducing dimension usually helps find better clusters.

--

--