Manage your Machine Learning Lifecycle with MLflow — Part 1.
Reproducibility, good management and tracking experiments is necessary for making easy to test other’s work and analysis. In this first part we will start learning with simple examples how to record and query experiments, packaging Machine Learning models so they can be reproducible and ran on any platform using MLflow.
The Machine Learning Lifecycle Conundrum
Machine Learning (ML) is not easy, but creating a good workflow which you can reproduce, revisit and deploy to production is even harder. There has been many advances towards creating a good platform or managing solution for ML. Note that t his is not the Data Science (DS) Lifecycle , which is more complex and has many parts.
The ML lifecycle exists inside the DS lifecycle.
You can check some of the projects for creating ML workflows here:
pachyderm - Reproducible Data Science at Scale! github.com
mleap - MLeap: Deploy Spark Pipelines to Production github.com
These packages are great, but not so easy to follow. Maybe the solution is a mix of these three, or something like that. But here I’ll present you the latest solution created by Databricks called MLflow.
Getting started with MLflow
MLflow is an open source platform for the complete machine learning lifecycle.
MLflow is designed to work with any ML library, algorithm, deployment tool or language. It is very easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run. MLflow is also an open source project that users and library developers can extend.
Installing MLflow is very easy, you just have to run:
pip install mlflow
And this is according to the creators. But I faced several issues while installing it. So here are my recommendations (if you can run mlflow in your terminal after installing ignore ):
From Databricks: MLflow cannot be installed on the MacOS system installation of Python. We recommend installing Python 3 through the Homebrew package manager using
brew install python . (In this case, installing mlflow is now
pip3 install mlflow ).
That did not work for me and I got this error:
~ ❯ mlflow Traceback (most recent call last): File "/usr/bin/mlflow", line 7, in <module> from mlflow.cli import cli File "/usr/lib/python3.6/site-packages/mlflow/__init__.py", line 8, in <module> import mlflow.projects as projects # noqa File "/usr/lib/python3.6/site-packages/mlflow/projects.py", line 18, in <module> from mlflow.entities.param import Param File "/usr/lib/python3.6/site-packages/mlflow/entities/param.py", line 2, in <module> from mlflow.protos.service_pb2 import Param as ProtoParam File "/usr/lib/python3.6/site-packages/mlflow/protos/service_pb2.py", line 127, in <module> options=None, file=DESCRIPTOR), TypeError: __init__() got an unexpected keyword argument 'file'
And the way of solving that was not very easy. I’m using MacOS btw. To solve that I needed to update the protobuf library. To do that I installed the Google’s protobuf library from source:
protobuf - Protocol Buffers - Google's data interchange format github.com
Download the 3.5.1 version. I had the 3.3.1 before. Follow these steps:
API for protocol buffers using modern Haskell language and library patterns. google.github.io
Or try using Homebrew.
If your installation works, run
and you should see this:
Usage: mlflow [OPTIONS] COMMAND [ARGS]...
Options: --version Show the version and exit. --help Show this message and exit.
Commands: azureml Serve models on Azure ML. download Downloads the artifact at the specified DBFS... experiments Tracking APIs. pyfunc Serve Python models locally. run Run an MLflow project from the given URI. sagemaker Serve models on SageMaker. sklearn Serve SciKit-Learn models. ui Run the MLflow tracking UI.
Quickstart with MLflow
Now that you have MLflow installed let’s run a simple example.
from mlflow import log_metric, log_param, log_artifact
if __name__ == "__main__":
# Log a parameter (key-value pair)
# Log a metric; metrics can be updated throughout the run
# Log an artifact (output file)
with open("output.txt", "w") as f:
Save that to train.py and then run with
You will see the following:
And that’s it? Nope. With MLflow you have a UI that you can access easily writing:
And you will see (localhost:5000 by default):
So what have we done so far? If you see the code you’ll se we used two things, a log_param, log_metric and log_artifact. The first one logs the passed-in parameter under the current run, creating a run if necessary, the second one logs the passed-in metric under the current run, creating a run if necessary, and the last one log a local file or directory as an artifact of the currently active run.
So with this simple example we learned how to save the log of params, metrics and files in our lifecycle.
If we click on the date of the run, we can see more about it.
Now if we click the metric, we can see how it got updated through the run:
And if we click the artifact we can see a preview of it:
The MLflow Tracking component lets you log and query experiments using either REST or Python.
Each run records the following information:
Code Version:Git commit used to execute the run, if it was executed from an MLflow Project .
Start & End:TimeStart and end time of the run
Source:Name of the file executed to launch the run, or the project name and entry point for the run if the run was executed from an MLflow Project .
Parameters:Key-value input parameters of your choice. Both keys and values are strings.
Metrics:Key-value metrics where the value is numeric. Each metric can be updated throughout the course of the run (for example, to track how your model’s loss function is converging), and MLflow will record and let you visualize the metric’s full history.
Artifacts:Output files in any format. For example, you can record images (for example, PNGs), models (for example, a pickled SciKit-Learn model) or even data files (for example, a Parquet file) as artifacts.
Runs can optionally be organized into experiments , which group together runs for a specific task. You can create an experiment via the
mlflow experiments CLI, with
mlflow.create_experiment() , or via the corresponding REST parameters.
# Prints "created an experiment with ID <id>
mlflow experiments create face-detection
# Set the ID via environment variables
And then you just launch an experiment:
<em># Launch a run. The experiment ID is inferred from the MLFLOW_EXPERIMENT_ID environment variable</em>
Example of Tracking:
A simple example using the Wine Quality dataset : Two datasets are included, related to red and white vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests.
First download this file:
And then in the folder create the file train.py with the content:
# Read the wine-quality csv file
data = pd.read_csv("wine-quality.csv")
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)
# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]
alpha = float(sys.argv) if len(sys.argv) > 1 else 0.5
l1_ratio = float(sys.argv) if len(sys.argv) > 2 else 0.5
lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
predicted_qualities = lr.predict(test_x)
(rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
print(" RMSE: %s" % rmse)
print(" MAE: %s" % mae)
print(" R2: %s" % r2)
Here we will thest MLflow integration for SciKit-Learn too. After running you will see in the terminal this:
Elasticnet model (alpha=0.500000, l1_ratio=0.500000): RMSE: 0.82224284976 MAE: 0.627876141016 R2: 0.126787219728
And then run the mlflow ui in the same current working directory as the one which contains the
mlruns directory and navigate your browser to http://localhost:5000 . You will see:
And you will have this for each run, so you can track everything you do. Also the model have a pkl file and a YAML for deployment, reproduction and sharing.
Stay tuned for more
In the next post I’ll cover the Projects and Models API, where we will be able to run in production these models, also create a full lifecycle.
Make sure to check the MLflow project for more:
mlflow - Open source platform for the complete machine learning lifecycle github.com
Thanks for reading this. I hope you found something interesting here :)
If you have questions just follow me on Twitter
Favio Vázquez (@FavioVaz) | Twitter
The latest Tweets from Favio Vázquez (@FavioVaz). Data Scientist. Physicist and computational engineer. I have a… twitter.com
Favio Vázquez — Principal Data Scientist — OXXO | LinkedIn
View Favio Vázquez’s profile on LinkedIn, the world’s largest professional community. Favio has 15 jobs jobs listed on… linkedin.com
See you there :)