Having some free time these past weeks, I spent time exploring resources that were in my backlog. And I was astonished. So many cool — and not necessarily well known — packages on machine learning that didn’t make it to the top of the charts. Their popularity looked like niche although they cope with fundamental challenges in machine learning.
So here they are. 9 really useful packages in ML to learn and implement at work: interpretML, dtreeviz, CatBoost, plaidML, mlflow, kedro, sklearn_pandas, Streamlit, pandas_profiling.
They are presented within the following topics:
- Model interpretation (interpretML, dtreeviz)
- Model building (CatBoost, plaidML)
- Model industrialization (mlflow, kedro, sklearn-pandas)
- Apps creation (Streamlit)
- Data auditing (pandas_profiling)
Developed by Microsoft as an open source project, InterpretML is “a toolkit to help understand models and enable responsible machine learning”. 4 packages are part of this project, among which 2 are my favorite:
interpret (2.7k stars on Github) allows you to build a GA2M model (coined “Explainable Boosting Machine” or “EBM”) for classification or regression problems. GA2M is short for “Generalized Additive 2 Model” which is a family of models that are both highly accurate and highly interpretable.
This family of models are almost as accurate as classic Gradient Boosted Trees models (like LightGBM, CatBoost or XGBoost) but are way more interpretable. In most business problems, interpretation is key but is also synonym to less accuracy than what could be achieved with advanced models. These models allow you to keep a high accuracy while keeping a relatively high degree of interpretability. These will surely become a standard in the machine learning community, standard in the sense that data scientists will try and implement GA2M instead of XGBoost.
More on the topic of GA2M: https://blog.fiddler.ai/2019/06/a-gentle-introduction-to-ga2ms-a-white-box-model/
interpret-community (only 123 stars on Github) is a project bringing AzureML’s capabilities to the open source community. Let’s see the most interesting feature of this package: the fabulous TabularExplainer function that allows you to:
- plot 2 variables with colour coding on a 3rd variable (a variable can also be the prediction of the model)
- plot the overall most predictive variables
- plot the local explanations (reasons for individual predictions taken separately)
- run simulations using a what-if scenario logic
And this on any given model! [Note the UI is very Microsoft-friendly, which is what you get from open-source developed by private companies 🙂 ]
The last 2 packages of the interpretML project, interpret-Textand DiCE, help respectively to interpret Text classification models and create “Diverse Counterfactual Explanations” (which are perturbations on a data point that give an opposite prediction). Both look really cool, even though I didn’t have time to investigate them on a deep level yet.
Whether you use XGBoost, Random Forest or CatBoost, your model has trees built in it. Stop using sklearn’s ugly plot_tree function to display them. Instead, use dtreeviz. [Note I love sklearn nonetheless]
It can be a bit tedious to install it and make it work but it’s worthwhile. You won’t go back from it.
Developed by Yandex, this package is just mind-blowing in terms of accuracy, speed, and interpretability. I’ve written an article which summarises its capabilities. Data scientists usually know about XGBoost and may not be keen to learn about other similar boosting libraries like CatBoost or LightGBM. But CatBoost has so many cool additional features (visualisation of training, interpretability tools…) that it would be a shame to stick to XGBoost by default.
Who said you had to buy expensive GPUs from Nvidia to do Deep Learning? As a Deep learning newbie myself, I don’t want to buy expensive GPUs just for the purpose of learning. And if Google Colab provides a good alternative, I still don’t want to rely on my internet connection and Google to do Deep learning.
So here’s PlaidML. Developed by Intel, PlaidML aims at making deep learning accessible to people and organisations who don’t have expensive GPUs at hand. You’ll be able to run Deep learning models with your MacBook Pro (Yay! That’s what I’ve got!)
One important note, the community using it doesn’t seem big today and there’s a bit of a challenge to set it up which I detail below:
I’ve tried to set it up for my Jupyter Notebook and had to overcome some difficulties (like this). Basically, I’ve managed to make it work by creating a custom environment and then launching a Jupyter Notebook from it. Here are the detailed steps:
1. From JupyterLab’s Terminal, create an environment for plaidML and activate it.
conda create --name plaidML
conda activate plaidML
2. Install plaidML package
pip install plaidml-keras plaidbench
3. Set up your preferences for your existing hardware
You will have to choose between Default or Experimental devices. I chose one hardware among the Default devices.
4. Open a Jupyter Notebook from your regular Jupyter Notebook screen, or using JupyterLab.
You can also change the environment (or Kernel) in a notebook that is running and select your newly created plaidml env.
5. Now you can use plaidML in your notebook using the 2 lines of code:
os.environ[“KERAS_BACKEND”] = “plaidml.keras.backend”
Test it with
Which should output something like this:
And here you go! Happy Deep learning 😉
“An open source platform for the machine learning lifecycle”, ML Flow allows data scientists to run experiments, deploy their models with APIs and manage their models in a centralised registry. Tailored towards the engineering side (you have to use command lines a lot), it’s still quite simple to use for data scientists.
The next step is definitely to set it up on AWS or a cloud provider to see if I — a data scientist with no computer science background — can make it work.
I was able to test this library running it locally on my machine. Two features that I love because it’s so simple to do compared to how I used to do it manually:
- running a model with varying parameters and visualising the accuracy with a heatmap / “contour plot” to identify the best combination of parameters:
- deploying your model as an API:
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning.
I’ve spent a bit of time on it but I haven’t used it for professional projects. To my understanding, Kedro is mostly aimed at engineers who are focused on designing maintainable data science code.
Github repo is available here: https://github.com/quantumblacklabs/kedro
Have you tried using sklearn with pandas and realized (the hard way, trying to solve weird errors) that sklearn has been designed for numpy? Then sklearn-pandas is for you!
I’ve talked about it in a previous article. Basically, it helps you not waste any more time converting dataframes to numpy arrays and back again to dataframes to make your sklearn pipeline work.https://towardsdatascience.com/media/4a5381d6a406aa4df06fc42ef8cb63d0
Streamlit is, how to put it, redefining the “soft skills” a data scientist should have 🙂
Indeed, it allows a data scientist to create beautiful apps in Python that non data scientists will use to have a try with a machine learning model.
To me, this is the new standard of delivering a ML model. The data scientist should not only hand out the model file to the data engineer team, but most importantly hand out a demo app to the business team to figure it out.
The best feature in Streamlit to me? The extension with Spacy! You can create beautiful apps interacting with a NER and other NLP models. And it’s a piece of cake 🍪
I’ve talked about it in a previous article so I won’t deep dive too much. This is now the second package I load just after running import pandas as pd.
Note you may run into some trouble installing it or running it properly, since it relies on various other packages.
Thanks for reading!
PM @Doctolib after 5 years as data scientist. Worked for DataRobot, Capgemini & Accenture