When using K-means, we can be faced with two issues:

- We end up with
**clusters of very different sizes**, some containing thousands of observations and others with just a few - Our dataset has too
**many variables**and the K-Means algorithm struggles to identify an optimal set of clusters

The algorithm is based on a paper by Bradley et al. and has been implemented by Joshua Levy-Kramer: https://github.com/joshlk/k-means-constrained, using the excellent Google OR-Tools library (https://developers.google.com/optimization/flow/mincostflow)

The algorithm uses ideas from Linear Programming, in particular Network Models. …

Bankruptcy prediction has been a very active field of research for many years. Important papers include Edward Altman’s 1968 Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy which gave birth to his famous *Z-score*, still used today, and James Ohlson’s 1980 Financial Ratios and the Probabilistic Prediction of Bankruptcy and its *O-score*, also still applied today. Those papers, and many after them, are based on **accounting data** and financial ratios. …

When we are dealing with high dimensional datasets, we can run into issues with clustering methods. Feature selection is a well-known technique for supervised learning but a lot less for unsupervised learning (like clustering) methods. Here we’ll develop a relatively simple greedy algorithm to perform variable selection on the Europe Datasets on Kaggle.

The algorithm will have the following steps:

0. Make sure the variable is numeric and scaled, for example using StandardScaler() and its fit_transform() method

- Choose the maximum of variables you want to retain (
*maxvars*), the minimum and maximum number of clusters (*kmin*and*kmax*) and create an…

Previously, we have seen a very simple implementation of K-means and a method to choose the number of clusters. In some instances, we also want to avoid having empty clusters, or clusters of very different sizes. Here comes constrained optimisation to the rescue.

The algorithm is based on a paper by Bradley et al. and has been implemented by Joshua Levy-Kramer: https://github.com/joshlk/k-means-constrained, using the excellent Google OR-Tools library (https://developers.google.com/optimization/flow/mincostflow)

The algorithm uses ideas from Linear Programming, in particular Network Models. Networks models are used, among other things, in logistics to optimise the flow of goods across a network of roads.

There are two popular methods:

- The elbow method
- The silhouette method

We’ll focus on the silhouette method in this article. The silhouette method proceeds as follows:

For every single data point i in the dataset we calculate:

a(i) = the average distance from that point to all other points in its cluster

b(i) = the average distance from all points in the nearest cluster

And the silhouette value for this point is then:

(b(i)-a(i))/max{a(i),b(i)}

Those silhouette values can then be plotted in a silhouette plot for every point in the dataset, grouped by clusters:

`sample_silhouette_values=silhouette_samples(X, clusters,metric='euclidean')`

y_lower=0

for…

In this series, we’ll cover the K-means algorithm, starting with a simple Python implementation to understand the logic behind it. We’ll see how we can choose the optimal number of clusters and then we’ll move to a constrained version, where we want to impose minimum or maximum size limits to the clusters. Finally, we’ll build a simple greedy algorithm for variable selection for K-means, especially useful when we are dealing with datasets with many features.

**The idea behind K-means**

K-means is an unsupervised learning clustering algorithm, it helps identify groups with certain similarities within a dataset (note: the dataset should…

If the Olympics in Japan are not postponed, what are the possible models we could use to predict how nations will perform? The results of the analysis will probably be affected by the pandemic as countries struggle with the virus and with some athletes already suffering from Covid-19. This is a good example of a Nassim Taleb’s black swan event and of the limitations of forecasting. It looks like the world is more Taleb than Tetlock.

For this project we will use Olympic Games data from 2000 to 2012 to predict the total number of medals and the number of…