Introduction

Having some free time these past weeks, I spent time exploring resources that were in my backlog. And I was astonished. So many cool — and not necessarily well known — packages on machine learning that didn’t make it to the top of the charts. Their popularity looked like niche although they cope with fundamental challenges in machine learning.

So here they are. 9 really useful packages in ML to learn and implement at work: interpretMLdtreevizCatBoostplaidMLmlflowkedrosklearn_pandas, Streamlitpandas_profiling.

They are presented within the following topics:

  • Model interpretation (interpretML, dtreeviz)
  • Model building (CatBoost, plaidML)
  • Model industrialization (mlflow, kedro, sklearn-pandas)
  • Apps creation (Streamlit)
  • Data auditing (pandas_profiling)

Model interpretation

1-InterpretML

Developed by Microsoft as an open source project, InterpretML is “a toolkit to help understand models and enable responsible machine learning”. 4 packages are part of this project, among which 2 are my favorite:

interpret (2.7k stars on Github) allows you to build a GA2M model (coined “Explainable Boosting Machine” or “EBM”) for classification or regression problems. GA2M is short for “Generalized Additive 2 Model” which is a family of models that are both highly accurate and highly interpretable.

Image for post
Generalized Additive Models (or GAM) equation. The model’s final prediction is the sum of the impact of each variable taken independently. It is labelled a ‘white-box’ model. The impact of each variable can be plotted individually and understood simply by a human.
Image for post
Generalized Additive 2 Models (or GA2M) equation. It is the natural evolution of GAM models, with multiple combinations of 2 variables. This model remains ‘white-box’ and can be interpreted and understood by humans more easily than classic Gradient Boosted Trees models.

This family of models are almost as accurate as classic Gradient Boosted Trees models (like LightGBM, CatBoost or XGBoost) but are way more interpretable. In most business problems, interpretation is key but is also synonym to less accuracy than what could be achieved with advanced models. These models allow you to keep a high accuracy while keeping a relatively high degree of interpretability. These will surely become a standard in the machine learning community, standard in the sense that data scientists will try and implement GA2M instead of XGBoost.

Image for post
Output of a GA2M (“EBM”) trained on the Adult Income dataset (url). You can see that the impact of the variable ‘Age’ is additive based on the shape of the curve.

More on the topic of GA2M: https://blog.fiddler.ai/2019/06/a-gentle-introduction-to-ga2ms-a-white-box-model/

interpret-community (only 123 stars on Github) is a project bringing AzureML’s capabilities to the open source community. Let’s see the most interesting feature of this package: the fabulous TabularExplainer function that allows you to:

  • plot 2 variables with colour coding on a 3rd variable (a variable can also be the prediction of the model)
  • plot the overall most predictive variables
  • plot the local explanations (reasons for individual predictions taken separately)
  • run simulations using a what-if scenario logic

And this on any given model[Note the UI is very Microsoft-friendly, which is what you get from open-source developed by private companies 🙂 ]

Image for post
Main dashboard created by TabularExplainer allowing you to visualise your data, the model’s predictions and the variables importances altogether. I use the titanic dataset and plot the Probability to die with the variable ‘Age’ (scaled) and use gender as the colour coding. You can see that my model outputs 5 different probabilities to die (with a few exceptions), females are usually below 50% chance to die (red dots) and oldest people have >50% chance to die (‘Age’ above 0.8)
Image for post
You may select an individual data point and visualise the relative variable importances for its prediction.
Image for post
You may select an individual data point and run different scenarios by changing manually the input.

The last 2 packages of the interpretML project, interpret-Textand DiCE, help respectively to interpret Text classification models and create “Diverse Counterfactual Explanations” (which are perturbations on a data point that give an opposite prediction). Both look really cool, even though I didn’t have time to investigate them on a deep level yet.

Image for post
With interpret-Text you can visualise the impact of individual words on a binary classification model (url)

2-dtreeviz

Whether you use XGBoost, Random Forest or CatBoost, your model has trees built in it. Stop using sklearn’s ugly plot_tree function to display them. Instead, use dtreeviz[Note I love sklearn nonetheless]

Image for post
This is what sklearn’s plot_tree function displays. Aside from a poor colour coding, you have to read box by box to distinguish between big and small leaves.
Image for post
Output of dtreeviz on a Decision Tree built on the Titanic dataset. Incredibly transparent visualisation.
Image for post
dtreeviz also has a specific visualisation for regressions, showing scatter plots and dotted lines for thresholds.

It can be a bit tedious to install it and make it work but it’s worthwhile. You won’t go back from it.

Model building

3-CatBoost

Developed by Yandex, this package is just mind-blowing in terms of accuracy, speed, and interpretability. I’ve written an article which summarises its capabilities. Data scientists usually know about XGBoost and may not be keen to learn about other similar boosting libraries like CatBoost or LightGBM. But CatBoost has so many cool additional features (visualisation of training, interpretability tools…) that it would be a shame to stick to XGBoost by default.

4-plaidml

Image for post
Logo of PlaidML

Who said you had to buy expensive GPUs from Nvidia to do Deep Learning? As a Deep learning newbie myself, I don’t want to buy expensive GPUs just for the purpose of learning. And if Google Colab provides a good alternative, I still don’t want to rely on my internet connection and Google to do Deep learning.

So here’s PlaidML. Developed by Intel, PlaidML aims at making deep learning accessible to people and organisations who don’t have expensive GPUs at hand. You’ll be able to run Deep learning models with your MacBook Pro (Yay! That’s what I’ve got!)

One important note, the community using it doesn’t seem big today and there’s a bit of a challenge to set it up which I detail below:

I’ve tried to set it up for my Jupyter Notebook and had to overcome some difficulties (like this). Basically, I’ve managed to make it work by creating a custom environment and then launching a Jupyter Notebook from it. Here are the detailed steps:

1. From JupyterLab’s Terminal, create an environment for plaidML and activate it.

conda create --name plaidML
conda activate plaidML

2. Install plaidML package

pip install plaidml-keras plaidbench

3. Set up your preferences for your existing hardware

plaidml-setup

You will have to choose between Default or Experimental devices. I chose one hardware among the Default devices.

4. Open a Jupyter Notebook from your regular Jupyter Notebook screen, or using JupyterLab.

Image for post
From Jupyter Notebook starting window, click on ‘New’ and click ‘plaidml’. This will open a Notebook with your plaidml environment you just set up.

You can also change the environment (or Kernel) in a notebook that is running and select your newly created plaidml env.

Image for post

5. Now you can use plaidML in your notebook using the 2 lines of code:

import os
os.environ[“KERAS_BACKEND”] = “plaidml.keras.backend”

Test it with

import keras

Which should output something like this:

Image for post

And here you go! Happy Deep learning 😉

Model industrialization

5-mlflow

“An open source platform for the machine learning lifecycle”, ML Flow allows data scientists to run experiments, deploy their models with APIs and manage their models in a centralised registry. Tailored towards the engineering side (you have to use command lines a lot), it’s still quite simple to use for data scientists.

The next step is definitely to set it up on AWS or a cloud provider to see if I — a data scientist with no computer science background — can make it work.

I was able to test this library running it locally on my machine. Two features that I love because it’s so simple to do compared to how I used to do it manually:

  • running a model with varying parameters and visualising the accuracy with a heatmap / “contour plot” to identify the best combination of parameters:
Image for post
Comparing 10 models (same model but different set of parameters) with the “Contour Plot” chart
  • deploying your model as an API:
Image for post
Deploying your model as an API is as simple as using a command line: > mlflow models serve. I test the API with a fake data point and it returns the prediction instantly, like magic!

6-Kedro

Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.

Image for post

I’ve spent a bit of time on it but I haven’t used it for professional projects. To my understanding, Kedro is mostly aimed at engineers who are focused on designing maintainable data science code.

Image for post

Github repo is available here: https://github.com/quantumblacklabs/kedro

7-sklearn-pandas

Have you tried using sklearn with pandas and realized (the hard way, trying to solve weird errors) that sklearn has been designed for numpy? Then sklearn-pandas is for you!

I’ve talked about it in a previous article. Basically, it helps you not waste any more time converting dataframes to numpy arrays and back again to dataframes to make your sklearn pipeline work.https://towardsdatascience.com/media/4a5381d6a406aa4df06fc42ef8cb63d0

ML Apps

8-Streamlit

Streamlit is, how to put it, redefining the “soft skills” a data scientist should have 🙂

Indeed, it allows a data scientist to create beautiful apps in Python that non data scientists will use to have a try with a machine learning model.

If you haven’t tried Streamlit before, I reckon going on their website and starting using it ASAP. The gallery section is highly inspirational.

Image for post

To me, this is the new standard of delivering a ML model. The data scientist should not only hand out the model file to the data engineer team, but most importantly hand out a demo app to the business team to figure it out.

The best feature in Streamlit to me? The extension with Spacy! You can create beautiful apps interacting with a NER and other NLP models. And it’s a piece of cake 🍪

Data Auditing

9-pandas_profiling

I’ve talked about it in a previous article so I won’t deep dive too much. This is now the second package I load just after running import pandas as pd.

Image for post

Note you may run into some trouble installing it or running it properly, since it relies on various other packages.

Thanks for reading!

Author Bio:

Félix Revert
PM @Doctolib after 5 years as data scientist. Worked for DataRobot, Capgemini & Accenture

Write A Comment