A few useful tweaks for K-Means

Photo by Patrick Schneider on Unsplash
  • Our dataset has too many variables and the K-Means algorithm struggles to identify an optimal set of clusters

Constrained K-Means: controlling group size

The algorithm is based on a paper by Bradley et al. and has been implemented by Joshua Levy-Kramer: https://github.com/joshlk/k-means-constrained, using the excellent Google OR-Tools library (https://developers.google.com/optimization/flow/mincostflow)

An alternative approach used among others by Moody’s to rate companies

Photo by Dylan Gillis on Unsplash

Photo by Edu Grande on Unsplash

Photo by Patrick Schneider on Unsplash

Image source: Giphy (https://giphy.com/gifs/l41YtZOb9EUABnuqA)
  1. The silhouette method
sample_silhouette_values=silhouette_samples(X, clusters,metric='euclidean')

Photo by Gertrūda Valasevičiūtė on Unsplash

Image Source: https://www.nytimes.com/2019/10/01/travel/the-tokyo-2020-olympics-what-you-need-to-know.html


If the Olympics in Japan are not postponed, what are the possible models we could use to predict how nations will perform? The results of the analysis will probably be affected by the pandemic as countries struggle with the virus and with some athletes already suffering from Covid-19. This is a good example of a Nassim Taleb’s black swan event and of the limitations of forecasting. It looks like the world is more Taleb than Tetlock.

Frederic Marthoz

Msc in Mathematics and Statistics, Data Analytics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store