data science – The Clueless Data Scientist

How many clusters?

In unsupervised learning, choosing the right number of clusters can have a dramatic impact on clustering results (and their usefulness). Here, I will discuss some techniques that can be used to determine the optimal number of clusters to use in our analysis. Obviously, I am not going to consider algorithms that can determine the number automatically (like density-based algorithms, hierarchical algorithms, or GMM).

Silhouette Score

The silhouette score quantifies how well each data point fits into its assigned cluster compared to other clusters. The score ranges is defined in the interval $[-1; 1]$, a value close to 1 indicating that the cluster assignment is appropriate. The optimal number of clusters maximises the average silhouette score across all data points.

In code it can be calculated as:

from sklearn.metrics import silhouette_score

silhouette_scores = []
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(data)
    score = silhouette_score(data, kmeans.labels_)
    silhouette_scores.append(score)

plt.plot(range(2, 11), silhouette_scores)
plt.title('Silhouette Score')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.show()

In this case the optimal number of clusters is 8, but 9 and 10 are also close to that.

The silhouette score provides an intuitive measure of cluster cohesion and separation, considering both intra-cluster similarity and inter-cluster distance. This makes it a robust metric for determining the optimal number of clusters, which has the additional benefit of being less sensitive to the shape and density of clusters compared to other methods.

Davies-Bouldin index

This index measures the average distance between each cluster’s data points and the centroid of its closest cluster. It considers both intra-cluster similarity and inter-cluster distance, with lower values indicating better clustering.

In code this calculation can be performed like this:

from sklearn.metrics import davies_bouldin_score

db_scores = []
for i in range(2, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(data)
    score = davies_bouldin_score(data, kmeans.labels_)
    db_scores.append(score)

plt.plot(range(2, 11), db_scores)
plt.title('Davies-Bouldin Index')
plt.xlabel('Number of Clusters')
plt.ylabel('Davies-Bouldin Index')
plt.show()

Similarly to the Silhouette score, the Davies-Bouldin index identifies the optimal number of cluster for the dataset (the same as above) as 8.

The Davies-Bouldin index also provides a measure of intra-cluster similarity and indication of inter-cluster distance.

Elbow Method

The elbow method involves plotting the within-cluster sum of squares (WCSS) against the number of clusters. As the number of clusters increases, the WCSS typically decreases. However, beyond a certain point, the rate of decrease slows down, resulting in an “elbow” point on the plot. This elbow point signifies the optimal number of clusters.

Working on the same dataset as the previous cases, we can apply the Elbow method as follows:

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(data)
    wcss.append(kmeans.inertia_)

plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

The graph shows the results of applying the Elbow method alongside its biggest weakness: sometimes it’s difficult to obtain definitive results out of it, and it’s best used in conjunction with some other methods.

Even though it’s not always enough, the elbow method offers a visual representation of the trade-off between the number of clusters and the within-cluster sum of squares (WCSS) and it can be helpful in some situations.

Gap statistics

Gap statistics compare the total intracluster variation for different numbers of clusters with their expected values under a null reference distribution. The optimal number of clusters is where the gap statistic is maximized. This method takes into account both the dispersion within clusters and the overall dataset’s structure.

In code it can be implemented like this:

from gap_statistic import OptimalK

optimalK = OptimalK(parallel_backend='multiprocessing')
n_clusters = optimalK(data, cluster_array=np.arange(1, 11))
print(f'Optimal number of clusters: {n_clusters}')

The value returned by this method in our situation is 9, which is in line (although not perfectly) with previous analyses.

Here we compare the observed within-cluster variation with a null reference distribution, aiming to identify the optimal number of clusters where the gap statistic is maximized. This approach can provide valuable insights into cluster structure at the price of computational intensity and sensitivity to the reference distribution. These factor can have limiting effects on the usefulness of this technique in some scenarios.

Average Distance to Centroid

This technique involves calculating the average distance between each data point and its assigned cluster centroid. The optimal number of clusters corresponds to a significant decrease in the average distance, indicating tight and well-separated clusters.

It can be implemented and visualised like this:

from scipy.spatial.distance import cdist

avg_distances = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(data)
    centroids = kmeans.cluster_centers_
    avg_distance = np.mean(np.min(cdist(data, centroids, 'euclidean'), axis=1))
    avg_distances.append(avg_distance)

plt.plot(range(1, 11), avg_distances)
plt.title('Average Distance to Centroid')
plt.xlabel('Number of Clusters')
plt.ylabel('Average Distance')
plt.show()

In this case the optimal number of centroids is identified to be 10, which somewhat aligns with previous techniques.

The average distance to centroid offers a straightforward approach to evaluating cluster tightness and separation, even though it may not capture the full complexity of cluster structures. This is especially true in datasets with irregular shapes or varying cluster densities. Another potential issue is that the interpretation of average distance metrics can be less intuitive compared to other techniques, and that the choice of the distance metric should depende on the dataset, its dimensionality and characteristics, and it can have a strong impact on the final result.

Analysis of sequences with High Order Markov chains

NOTE: This post is based off a talk I recently (2018/05) gave at Pycon9 in Firenze.

I have been working as a Data Scientist for a while now, and I can safely say that I got to know a few things. The main thing I learned is that, when you can abstract a problem you are working on and derive generic tools to solve similar problem out of it, you should DEFINITELY go ahead and do it. Even if you won’t be able to develop your solution for your current project, it will almost certainly be useful for something else in the future (or for someone else).

This is what drove me to spend quite some time implementing a library to help during a project I was working on a few months ago.

The problem

Clearly, I can’t say too much about the project, the data, the issue, or the specific solution we ended up using. I’ll have to get creative and keep everything extremely vague and mysterious… Sorry…

Essentially: we had a system used by internal and external users. Users would go through a sequence of steps (we’ll call these ‘states’ or ‘nodes’) in sequence; each step in the sequence was associated with an outcome. Depending on how each session ended, we had sessions that were problematic and sessions that were not.

Our project was the analysis of the sequences or sessions in order to determine what drives good vs. bad final outcomes.

I know it sound complex when described like this, but it actually was pretty hard to get…

The side project

While working on that project, we used Markov Chains to describe some sequences and to determine which nodes where responsible for bad user outcomes. The problem was: we were using $1^{st}$ order Markov Chains, whereas our problem would have been better defined by a $2^{nd}$ order Markov Chain (more details below).

That’s when I decided to start working on a general library that could implement Markov Chains of any order.

But first things first…

What is a Markov Chain

A Markov chain is collection of random variables $X_t$ having the property that, given the present, the future is conditionally independent of the past:

$P\left(X_t=j|X_0=i_0,X_1 = i_1,...,X_{t-1}=i_{t-1}\right) = P \left(X_t=j|X_{t-1} = i_{t-1}\right)$

That is, the probability of being in a certain state depends ONLY on the previous state.

This sounds like a big approximation, that would have nothing to do with real life… and yet it works quite well and it has a lot of practical applications.

Anyway, given this definition, we can construct a simple example to show how exactly a MC works:

Suppose we have a sequence:

$1,2,1,2,1,2,3,1,2$

Based on these numbers we can say that:

If the current state is 1, there is a 100% probability of moving to state 2;
if the current state is 2, there is 66% probability of evolving to state 1 and 33% of evolving to state 3;
if the current state is 3, there is a 100% probability of evolving to state 1.

And this is, in essence, all there is to it. We can represent this evolution using different techniques: we can use a matrix to represent probabilities of evolving form one state to another (useful for calculations), or we can use a graph to visualise all the paths.

Transition matrix

Transition graph

The two figures and the show the exact same representation of the numeric sequence.

How is this useful?

Well…

Now we have a representation of probabilities of evolution and an initial state. We ca derive the (probabilistic) future state.

The next state in the sequence can be represented as:

$S_{t+1}=S_t \times T$

where $T$ is the transition matrix.

This can be extended to an arbitrary number of steps by taking the matrix power of the transition matrix:

$S_{t+N}=S_t \times T^N$

Another extension to this definition is to include subsequences of states in the transition matrix, so that you’re not only considering the current state when predicting the next, but also the previous (or more).

A bit of code

NOTE: the code shown below is not complete and lacks a lot of necessary features (error handling, docstrings, ...) for better clarity.
You should NOT run it as is...

Class initialisation:

class MarkovChain(object):

def __init__(self, n_states, order=1):
self.number_of_states = n_states
self.order = order
self.possible_states = {
j: i for i, j in
enumerate(
itertools.product(range(n_states),
repeat=order)
)
(len(self.possible_states), len(self.possible_states))
), dtype=np.float64)

Essentially, we generate the number of all possible states (which depends on the order of the MC), and we initialise an emprt `dok` matrix from `scipy.sparse`.
We use a `dok` matrix because it’s fast to initialise and update, but it’snot the most efficient structure for computation, as we’ll see below.

Now that we have the initialiser for the parameters and the transition matrix, we need a method that updates it when a state change is detected in a sequence:

def update_transition_matrix(self, states_sequence):
visited_states = [
states_sequence[i: i + self.order]
for i in range(len(states_sequence) - self.order + 1)
]
for state_index, i in enumerate(visited_states):
self.transition_matrix[
self.possible_states[tuple(i)],
self.possible_states[tuple(visited_states[
state_index + self.order
])]
] += 1

def normalise_transitions(self):
self.transition_matrix = preprocessing.normalize(
self.transition_matrix, norm="l1"
)

This snippet of code takes a sequence of states as argument and constructs the sequence of states that needs to be updated. Then it adds `1` to the transition matrix at the appropriate position. This is basically a count of all the times the sequence goes from state `i` to state `j`.
Now, since the transition matrix holds probabilities, we need to make sure all the rows are normalised. That is the purpose of `normalise_transitions`.

We have all the building blocks to start making predictions now!

def predict_state(self, current_state, num_steps=1):
_next_state = sparse.csr_matrix(current_state).dot(
np.power(self.transition_matrix, num_steps)
)
return _next_state

And there we go. We have all the pieces in place to start using this.

A quick demo

I built a small interactive tool to help me use this library (not included in the repo :p ).

The library requires the sequences to be numeric. Each number corresponding to a unique state. To achieve best performance, I relabeled the data so that it spanned from 0 to number of states.

Relabeled data

Using the library

Having fit the model with the sequences, I built a small tool with NetworkX, Jupyter widgets and some good old Python magic to let me inspect the future of a generic state.

In this case I used a $2^{nd}$ order MC (seen in the `inital_state` box, as it contains two values).

And that’s all folks!

If you liked it or if you think it might be useful to you, the code is open source (link below).

Slides and code snippets are available on my Github page: http://bit.ly/2H8giAW

A clueless data scientist is born!

I have worked as a data scientist for a few years now, and I have finally decided to share some of my experience (and, more often than not, my cluelessness…).

I will do my best to add contents here, in the hope that it might prove useful for some people who want to become data scientists themselves. Or for all my friends and family who ask me what I do every time we meet and still haven’t got it after all these years…

I will include examples and tales of some of my (dis)adventures.