coding – The Clueless Data Scientist

The background

It all started when two separate teams were working on a very similar problem within the same domain. The structure behind the challenges we were facing was basically the same but we decided to tackle it in two different ways (we didn’t know we were looking at similar problems yet).

My team took a strongly functional approach, whereas we would implement specialised functions to work on all possible cases and we would pass them around as operators to our routines.

For example:

def specialised_operator(data: pd.DataFrame = None) -> pd.DataFrame:
    #  perform very specific task


def data_prep_operator(data: pd.DataFrame = None) -> pd.DataFrame:
    ...


def execution_routine(*operators, **kwargs):
    for operator in operators:
        operator(**kwargs)

This is obviously an extremely simplified version to give you an idea…

The other team decided to define a class hierarchy using a Factory pattern to allow for specialised classes that catered to different use cases. Something like:

class BaseOperator(abc.ABC):
    # a bunch of abstract methods and properties
    # a bunch of concrete methods


class OperatorCase1(BaseOperator):
    # overriding what needs to be overridden
    # implementing what needs to be implemented


class OperatorCase2(BaseOperator):
    # overriding what needs to be overridden
    # implementing what needs to be implemented

Again, a very simplified version of the actual implementation.

This was the situation when we realised that we were doing essentially the same thing and that it would be a damn good idea to merge our approaches and not duplicate our efforts.

Given the complexity of the code and the expectation of growth for our application, there was no immediate reason to prefer one method over the other… The main strength that the class based approach had (and I tend to not favour using classes if I don’t need to) was the fact that the team had also implemented a full fledged pipeline orchestration suite based on the interface defined by the BaseOperator class…

Now, that was a compelling reason to use the class approach… We still had one problem though: converting the whole functional pipeline into the class hierarchy…

A terrible idea

I consider myself to be a decent developer, and as many other decent developers I am constantly struggling to write as little code as possible… Therefore, my very first thought was: “Can I dynamically encapsulate a function as part of an existing class and return and instance of said class, for which the function gets promoted to a method?”.

Or in simpler terms: can I turn the function into a method of a class and dynamically add BaseOperator as a parent?

This would be a reasonably simple task, which we could solve like this:

class OperatorAdapter(BaseOperator):

    def __init__(self, func):
        self._func = func
    
    def execute(self, *args, **kwargs):
        """
        Overriding the execution of BaseOperator with our function
        """
        return self._func(*args, **kwargs)

The problem was that the pipeline didn’t just inherit from a single BaseOperator… The hierarchy looked a bit more like this:

And every Base*Operator would have multiple classes inheriting from it. The pipeline depended on having the right operator at the right stage… To do this manually, for every function would mean rewriting most of our application and it would have been almost the same type of effort as adapting everything into classes inheriting from the appropriate base. I wanted to be able to do this dynamically and quickly!

I read many blogs looking for the answer to this, and the most relevant answer I found was, and I quote, “Why the hell would you want to do this??”

It’s a strong argument…

A crazy implementation

At least until I remembered how the Python types work… It turns out you can do exactly what I was trying to achieve by using the `type` function: it turns out you CAN add a parent class to a Python object at runtime using just the language primitives!

# the 'type' call with three arguments is essentially a dynamic call to 'class'

# this
class X(Parent):
    a = 1

# is equivalent to this
X = type('X', (parent, ), dict(a=1))

We then used this to build a generic operator we used to inject our code where necessary…

The final adapter looked something like this, similar to the above definition, but without explicitly inheriting from anything:

class Operator:

    def __init__(self, func):
        self._func = func
    
    def execute(self, *args, **kwargs):
        return self._func(*args, **kwargs)

The instantiation of each operator then looked like this:

# instantiate an operator with the embedded function
op = Operator(function)

# create a dictionary of the properties we need in the new class
properties = {
    name: getattr(name) for name in dir(op) if name in properties_we_need
}

# dynamically add inheritance as:
new_op = type("op", (TheRelevantParentClass, ), properties)

# and now new_op inherits from whatever BaseOperator we required...

The next step to make everything more obfuscated was to embed this process in a decorator so that the code changes on our part amounted to sprinkling decorators on top of our function definitions and nothing more…

## old code

def do_something():
    ...


## new code

@inherit(BaseDataPrepOperator)
def do_something():
    ...

This allowed us to adapt an entire codebase with just a handful of code and a few decorators here and there.

All in all a fun experience which taught us a lot of fun things, including that dynamic inheritance can get complex FAST!

NOTE: This post is based off a talk I recently (2018/05) gave at Pycon9 in Firenze.

I have been working as a Data Scientist for a while now, and I can safely say that I got to know a few things. The main thing I learned is that, when you can abstract a problem you are working on and derive generic tools to solve similar problem out of it, you should DEFINITELY go ahead and do it. Even if you won’t be able to develop your solution for your current project, it will almost certainly be useful for something else in the future (or for someone else).

This is what drove me to spend quite some time implementing a library to help during a project I was working on a few months ago.

The problem

Clearly, I can’t say too much about the project, the data, the issue, or the specific solution we ended up using. I’ll have to get creative and keep everything extremely vague and mysterious… Sorry…

Essentially: we had a system used by internal and external users. Users would go through a sequence of steps (we’ll call these ‘states’ or ‘nodes’) in sequence; each step in the sequence was associated with an outcome. Depending on how each session ended, we had sessions that were problematic and sessions that were not.

Our project was the analysis of the sequences or sessions in order to determine what drives good vs. bad final outcomes.

I know it sound complex when described like this, but it actually was pretty hard to get…

The side project

While working on that project, we used Markov Chains to describe some sequences and to determine which nodes where responsible for bad user outcomes. The problem was: we were using $1^{st}$ order Markov Chains, whereas our problem would have been better defined by a $2^{nd}$ order Markov Chain (more details below).

That’s when I decided to start working on a general library that could implement Markov Chains of any order.

But first things first…

What is a Markov Chain

A Markov chain is collection of random variables $X_t$ having the property that, given the present, the future is conditionally independent of the past:

$P\left(X_t=j|X_0=i_0,X_1 = i_1,...,X_{t-1}=i_{t-1}\right) = P \left(X_t=j|X_{t-1} = i_{t-1}\right)$

That is, the probability of being in a certain state depends ONLY on the previous state.

This sounds like a big approximation, that would have nothing to do with real life… and yet it works quite well and it has a lot of practical applications.

Anyway, given this definition, we can construct a simple example to show how exactly a MC works:

Suppose we have a sequence:

$1,2,1,2,1,2,3,1,2$

Based on these numbers we can say that:

If the current state is 1, there is a 100% probability of moving to state 2;
if the current state is 2, there is 66% probability of evolving to state 1 and 33% of evolving to state 3;
if the current state is 3, there is a 100% probability of evolving to state 1.

And this is, in essence, all there is to it. We can represent this evolution using different techniques: we can use a matrix to represent probabilities of evolving form one state to another (useful for calculations), or we can use a graph to visualise all the paths.

Transition matrix

Transition graph

The two figures and the show the exact same representation of the numeric sequence.

How is this useful?

Well…

Now we have a representation of probabilities of evolution and an initial state. We ca derive the (probabilistic) future state.

The next state in the sequence can be represented as:

$S_{t+1}=S_t \times T$

where $T$ is the transition matrix.

This can be extended to an arbitrary number of steps by taking the matrix power of the transition matrix:

$S_{t+N}=S_t \times T^N$

Another extension to this definition is to include subsequences of states in the transition matrix, so that you’re not only considering the current state when predicting the next, but also the previous (or more).

A bit of code

NOTE: the code shown below is not complete and lacks a lot of necessary features (error handling, docstrings, ...) for better clarity.
You should NOT run it as is...

Class initialisation:

class MarkovChain(object):

def __init__(self, n_states, order=1):
self.number_of_states = n_states
self.order = order
self.possible_states = {
j: i for i, j in
enumerate(
itertools.product(range(n_states),
repeat=order)
)
(len(self.possible_states), len(self.possible_states))
), dtype=np.float64)

Essentially, we generate the number of all possible states (which depends on the order of the MC), and we initialise an emprt `dok` matrix from `scipy.sparse`.
We use a `dok` matrix because it’s fast to initialise and update, but it’snot the most efficient structure for computation, as we’ll see below.

Now that we have the initialiser for the parameters and the transition matrix, we need a method that updates it when a state change is detected in a sequence:

def update_transition_matrix(self, states_sequence):
visited_states = [
states_sequence[i: i + self.order]
for i in range(len(states_sequence) - self.order + 1)
]
for state_index, i in enumerate(visited_states):
self.transition_matrix[
self.possible_states[tuple(i)],
self.possible_states[tuple(visited_states[
state_index + self.order
])]
] += 1

def normalise_transitions(self):
self.transition_matrix = preprocessing.normalize(
self.transition_matrix, norm="l1"
)

This snippet of code takes a sequence of states as argument and constructs the sequence of states that needs to be updated. Then it adds `1` to the transition matrix at the appropriate position. This is basically a count of all the times the sequence goes from state `i` to state `j`.
Now, since the transition matrix holds probabilities, we need to make sure all the rows are normalised. That is the purpose of `normalise_transitions`.

We have all the building blocks to start making predictions now!

def predict_state(self, current_state, num_steps=1):
_next_state = sparse.csr_matrix(current_state).dot(
np.power(self.transition_matrix, num_steps)
)
return _next_state

And there we go. We have all the pieces in place to start using this.

A quick demo

I built a small interactive tool to help me use this library (not included in the repo :p ).

The library requires the sequences to be numeric. Each number corresponding to a unique state. To achieve best performance, I relabeled the data so that it spanned from 0 to number of states.

Relabeled data

Using the library

Having fit the model with the sequences, I built a small tool with NetworkX, Jupyter widgets and some good old Python magic to let me inspect the future of a generic state.

In this case I used a $2^{nd}$ order MC (seen in the `inital_state` box, as it contains two values).

And that’s all folks!

If you liked it or if you think it might be useful to you, the code is open source (link below).

Tag: coding

Class adapters using dynamic inheritance: a terrible idea turned into a crazy implementation