Previously, we have seen a very simple implementation of K-means and a method to choose the number of clusters. In some instances, we also want to avoid having empty clusters, or clusters of very different sizes. Here comes constrained optimisation to the rescue.
The algorithm is based on a paper by Bradley et al. and has been implemented by Joshua Levy-Kramer: https://github.com/joshlk/k-means-constrained, using the excellent Google OR-Tools library (https://developers.google.com/optimization/flow/mincostflow)
The algorithm uses ideas from Linear Programming, in particular Network Models. Networks models are used, among other things, in logistics to optimise the flow of goods across a network of roads.
We can see in the simple figure above that we have 5 nodes with directed arcs (the arrows) between them. Each node has a demand (negative) or supply (positive) value and the arcs have flow and cost values. For instance, the arc 2–4 has a flow of 4 and a cost of $2. Similarly, node 1 supplies 20 units and node 4 requires 5 units.
Those problems can be solved by using optimisation algorithms such as the simplex algorithm.
At a very high level, we can represent this constrained K-means problem with a network, where we want to minimise the sum of the distances between each point and the centroid of its cluster, with some constraints on cluster sizes.
We can see what it would look like in the figure below. Every point x(i) is a supply node with a value equal to one and a directed arc to every single cluster C. The cost for those arcs is the distance between the point and the centroid of the corresponding cluster. The clusters C1, C2, etc. are demand clusters with values equal to the minimum size required. Finally we add an artificial demand node to make sure that the sum of the supplies equal the sum of the demands.
A more detailed explanation can found in the original paper.
A Python implementation
A package has been developed by Joshua Levy-Kramer and others, it is available here.
It can be installed with pip
pip install k-means-constrained
And then easily applied on a dataset, we’ll reuse the one from the previous articles but we need to transform our pandas dataframe into an array.
X=np.array(X).transpose()
And then we can fit the KMeansConstrained method to the data with the number of clusters we want (n_clusters), the minimum and maximum size of the clusters (size_min and size_max)
from k_means_constrained import KMeansConstrained
clf = KMeansConstrained(
n_clusters=4,
size_min=8,
size_max=12,
random_state=0
)
clf.fit_predict(X)print(clf.cluster_centers_)
print(clf.labels_)
We can then access the centroids with clf.cluster_centers_ and the clusters with clf.labels_
As we can see below, we obtain 4 clusters with 8 elements each.
So far we have used a very small dataset with only two features so we can easily visualise the results on a 2D plot. In the next article, we’ll see how to select variables when we are dealing with datasets containing a large number of variables. Reducing dimension usually helps find better clusters.