Accelerating MLOps with the power of GPUs

MLOps at a glance

7 min readNov 22, 2021

Source: https://blogs.nvidia.com/blog/2020/09/03/what-is-mlops/

MLOps is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. The word is a compound of “machine learning” and the continuous development practice of DevOps in the software field.

DevOps vs MLOps

ML Systems differ from conventional Software Solutions in the following ways:

CI is no longer only about testing and validating code and components, but also testing and validating data, data schemas, and models.
CD is no longer about a single software package or a service, but a system (an ML training pipeline) that should automatically deploy another service (model prediction service).
CT is a new property, unique to ML systems, that’s concerned with automatically retraining and serving the models.

Benchmarking

To start with we created two notebooks, one with CPU support & another with GPU support; having the entire Data Science Pipeline merged into a single Notebook. Based on the respective notebook run, we found the stats listed below comparing CPU vs GPU performance.

NYC Taxi Fare Prediction Pipeline Times, CPU vs GPU

The GPU wipes the floor with the CPU in this Pipeline Runtime Test. The CPU took over 5 hours for a task that the GPU finished in just over a minute, a whopping 267 times faster. This is the best case scenario for the CPU as the multithreading was utilized for both Data Wrangling & Training. A Data Scientist’s work majorly consists of experimenting with data, features & parameters. Such speedups in development can be the deciding factor in getting your product to the market before your competitor.

Machine : NC24rs_v3 (24 vcpus, 448 GB RAM, 4 × NVIDIA Tesla V100)

Data : https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data

Code : https://github.com/rppol/MLOps_GPU/tree/main/notebooks

Hands On MLOps Implementation

Based on the above notebooks we will now showcase how can we implement MLOps lifecycle for our Data Science Pipeline.

We’ll be using Rapids & Dask here, for GPU based Data Wrangling & Training. Jenkins will be used for CI/CD pipelines & MLflow for Model Tracking.

Ideally there must be separate servers for Development & Production but here we’ll use the same server for everything.

Basic knowledge of Linux & Python is required.

Environment Setup

You will need an Nvidia GPU equipped machine for this which satisfies these prerequisites. I’ve used the NC24rs_v3 VM with Ubuntu 20.04 on Azure.

Setup VM & SSH into it
Install Nvidia Drivers
Install Docker
Install Nvidia Container Toolkit
Install & Configure Git
Install & Start Jenkins
Install & Set up Conda for Jenkins with these steps :

8. Build Directory Structure

Triton Inference Server Directory

Data Directory

Ideally in production systems there will be an API feeding data constantly and one should use tools like dvc for version control. Here we will be using static data from New York City Taxi Fare Prediction competition. The training file is split in a ratio of 9:1 as train.csv & test.csv respectively. Use the script below to download these from Google Drive (~5GB).

10. Install & Setup Triton Inference Server

We’re developing an XGBoost model. Triton has a FIL Backend built with RAPIDS Forest Inference Library for fast GPU-based inference.

11. Open ports on VM

Code

GitHub Repository: https://github.com/rppol/MLOps_GPU

Structure:

notebooks : These notebooks aren’t for deployment, as they are for performance benchmarking & experimentation.

nyc-taxi-fare-cpu.ipynb : An intuitive CPU only notebook built with pandas (Data Wrangling), modin (multithreading), XGBoost (all core training).
nyc-taxi-fare-gpu.ipynb : An intuitive GPU only notebook built with dask_cudf (Data Wrangling), xgboost.dask (distributed GPU based training).

2. src : This folder contains all the source files. We have modularized above notebook to make it easy to configure for MLOps deployment. Every major function is split into intuitively named python scripts. Every script supports independent execution.

read_params.py : Changing hardcoded values can be very annoying & prone to errors. This python script builds a helper functions to read params.yaml file, i.e. a one stop destination to modify parameters.
dask_client.py : Builds the Dask Client, this client is carried through the scripts as creating a new one takes time. Maximum allocated GPU Memory must be specified in the params.yaml file.
load_data.py : Helps in loading both Training & Test data in GPU Memory. Uses dask_cudf to speed up & distribute reading. “test” parameter must be set to true for reading test data. Relative path to the data directory & file names must be provided in the params.yaml file.
feature_engg.py : This is a Data Cleaning, Manipulation & Feature Addition script. It removes outliers and unreasonable data and adds interesting features like distance to various landmarks.
split_data.py : Splits the data as training & validation scripts. The split ratio can be varied using params.yaml file.
generate_Dmatrix.py : Converts data to a format XGBoost can handle.
train.py : Training script for local use & troubleshooting. This isn’t used in the automated pipeline.
train_with_tracking.py : Model training script with MLflow tracking. MLflow server must be running before execution of this script. It tracks all the parameters & model built. It also tests the model on unseen data, and tracks metrics. Visit this for a guide on fine tuning XGBoost models.
log_production_model.py : This script finds the best model out of all versions by comparing rmse. It loads, converts & saves this model in production model folder for further processing.

3. test : Holds testing python scripts for both model & server.

test_and_evaluate.py : Generates predictions on unseen data & compares with actual values. Generates metrics like Root Mean Squared Error, Mean Absolute Error, R-squared Error. Model can be passed as an argument or it’s loaded from production_model directory.
triton_inference_test.py : Tests if Triton Server is working or not with 1% data. Uses grpc for networking.

4. scripts : Contains shell scripts executed by Jenkins.

train_test_log.sh : Trains, Tests & Logs new model.
Productionize.sh : Gets best Model & copies it to Model Repository.
triton_test.sh : Tests Triton after new model is loaded.

5. artifacts : Stores models in MLflow format.

6. Jenkinsfile : Describes CI/CD Pipeline.

7. params.yaml : One stop Destination to control everything, stores configuration, parameters, path, ip, ports.

Building CI/CD Pipeline

Fork https://github.com/rppol/MLOps_GPU
Log into Jenkins ( your_public_ip_here:8080 )

Username : admin

Password can be found using the following command :

3. Add your GitHub credentials

4. Click New Item

5. Choose Pipeline, Enter a valid Name & click OK

6. Choose “Do not allow concurrent builds” as a single build consumes all GPU resources

7. Scroll down to Pipeline & choose following options :

Enter your repository’s URL
Attach your GitHub credentials created in Step 2
Choose main branch

8. Click Apply & Save

This pipeline can be configured to build periodically or trigger builds after actions like git push & availability of new data. To keep things simple, we’ll trigger it manually.

Running CI/CD Pipeline & Deployment

Make sure Jenkins, MLflow Server & Triton are running

2. Servers can be accessed here :

Jenkins : your_ip_here:8080
JupyterLab : your_ip_here:9999/lab?
MLflow : your_ip_here:8003
Triton Inference Server : your_ip_here:8001
Triton Metrics : your_ip_here:8002

3. Executing Pipeline

Go to : your_ip_here:8080/job/your_pipeline_name/
Hit Build Now to start execution.

Hover over each stage to check logs.

Console Output of train_test_log.sh in Build & Test Model Stage

4. MLflow Tracking

Each Pipeline Build generates a new model & MLflow provides an intuitive UI to monitor everything.

Select two or more runs and click compare to compare them

Tuning Hyper Parameters

Go to : your_ip_here:9999/lab?
Edit the values in params.yaml to your heart’s content
Visit this to tune hyperparameters like a pro

Conclusion

Through the above demonstration, it is now evident that the MLOps pipeline can be completely accelerated on GPUs. The GPU VM costs 1.8× the CPU VM but with an acceleration of around 267×, it pays for itself.

Data Scientists can spend more time on Model Optimization & Tuning rather than Deployment & waiting while the CPU crunches number one by one.

LinkedIn : https://www.linkedin.com/in/rutikpol/

References

Data : https://www.kaggle.com/c/new-york-city-taxi-fare-prediction/data
Code : https://towardsdatascience.com/nyc-taxi-fare-prediction-605159aa9c24
General Intuition : https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Similar Video Tutorial : https://youtube.com/playlist?list=PLZoTAELRMXVOk1pRcOCaG5xtXxgMalpIe