A few useful tweaks for K-Means

  • Our dataset has too many variables and the K-Means algorithm struggles to identify an optimal set of clusters

Constrained K-Means: controlling group size

The algorithm is based on a paper by Bradley et al. and has been implemented by Joshua Levy-Kramer: https://github.com/joshlk/k-means-constrained, using the excellent Google OR-Tools library (https://developers.google.com/optimization/flow/mincostflow)

An alternative approach used among others by Moody’s to rate companies

  1. The silhouette method
sample_silhouette_values=silhouette_samples(X, clusters,metric='euclidean')

If the Olympics in Japan are not postponed, what are the possible models we could use to predict how nations will perform? The results of the analysis will probably be affected by the pandemic as countries struggle with the virus and with some athletes already suffering from Covid-19. This is a good example of a Nassim Taleb’s black swan event and of the limitations of forecasting. It looks like the world is more Taleb than Tetlock.

