Azure ML vs Databricks for deploying machine learning models

Azure Machine Learning (Azure ML) and Databricks Machine Learning (Databricks ML) are two popular cloud-based platforms for data scientists. Both offer a range of tools and services for building and deploying machine learning models at scale. In this blog post, we'll compare Azure ML and Databricks ML, examining their features and capabilities, and highlighting their differences. 


Experimentation

Azure ML
The Python API allows you to easily create experiments that you can then track from the UI. You can do interactive runs from a Notebook. Logging metrics in this experiments still relies on the MlFlow client.

Databricks ML
Create experiments is easy also with the MLFlow API  and Databricks UI . Tracking metrics is really nice with the MLFlow API (so nice that AzureML also uses this client for their model tracking).

Winner
They are both pretty much paired on this, although the fact that AzureML uses MLFlow (a Databricks product) maybe gives the edge to Databricks.

Model Versioning

AzureML
Easy out of the box. Every time you register a model without specifying the version, the release tag is bumped automatically and every model version is stored in the model registry (docs ). In endpoint deployment you can specify the version in the endpoint URL so you can keep previous models running for backwards compatibility.

Databricks ML
Easy out of the box. Just use the MlFlow API inside of your code in order to register models into the MlFlow Model Registry on Databricks . Also endpoints have flexible urls that include versioning for backwards compatibility. Seems easier to stage a model version within the MlFlow environment in the same Databricks workspace, whereas in Azure ML we currently have a dev and a prod service for this

Winner
Pretty similar behaviour besides the small point regarding staging a model in a workspace. However I don't think is bad that Azure ML forces you to have different workspaces for model staging and production, this creates a clear separation of concerns.

Model Serving

Azure ML
Both batch prediction and deployment to an endpoint (Kubernetes or Container Instance). The endpoints are really easily configurable via yaml files . Moreover, it's really easy to deploy the model via CI/CD to with azure-ml.core  package or via the Azure CLI.

Databricks ML
Both batch prediction and deployment to an endpoint. From the UI deploying to an endpoint on a Databricks seems fearly easy  but there's no clear documentation on how to do this programatically via CI/CD. In fact, in the Databricks MLFlow REST API docs , I cannot see any endpoint exposed that would allow to serve this model.

Winner
Here I believe Azure ML has definitely the edge. It allows you to host your model in a much more structured way, offering a wider range of options to host (K8S, Container Instance) and a lot better possibilities to manage this endpoints via CI/CD. Particularly for us, we avoid clicking out things in the UI, this is a big drawback to serve models in Databricks.

Model monitoring

Azure ML
AzureML uses MLFlow for this part and gathers the tracking information and artifacts in their UI (docs ). MLFlow is, however, originally created and still supported by people from Databricks (maybe this makes for a better integration).

Databricks ML
You can track any run with the MLFlow tracking API (also Databricks docs ). This API allows you to log any metric (provided or custom made) which will allow you to analyse your runs performance overtime (both for training and batch scoring).

Winner
The fact that MlFlow is more native to Databricks may give Databricks the edge, but at the end of the day this is an open source product, so anybody can build a nice integration with it. No clear winner in this one.

Supported file formats

Azure ML
Quite limited. If you want to register an Azure ML dataset object, you are limited to this . Moreover, Dataset typing conversions have proved to be problematic (timestamp issue). Of course, you can always provide pointers (URIs) to files in the storage account and then avoid this Azure ML Datasets problem.

Databricks ML
This is probably the biggest advantage of Databricks. Since we already use Delta Lake, reading data from a Delta Table would require no additional effort. Then this table can easily be read with spark and converted into a pandas.DataFrame which is usually the preferred object for Data Scientists.

Winner
This is one of the biggest drawbacks of using Azure ML. The Dataset  object has proven to be problematic and does not support Delta Tables. You can always work around this problem via avoiding Datasets and reading files directly from storage, but anyhow the edge in this one is clearly for Databricks.

Scale

Azure ML
In Azure ML you can easily select different compute targets via the a run configuration . This makes it quite easy to change from a local target compute to a bigger machine in the cloud. However, big compute in clusters using Spark requires connectivity with either Databricks  or Synapse (in preview) .

Databricks ML
In Databricks scale is native (since Databricks is Spark) and configuring big machines or clusters is considerably easy. However, using a big cluster to train models in a single node using Pandas does not make sense, so a specific node should be provided for this workloads with the needed requirements (i.e. single machine, high-mem, GPU if necessary, etc.)

Winner
The edge is for Databricks being Spark native, however this most likely won't be a problem at any point with Azure ML. The only reason why maybe you'd consider using Spark is for batch model scoring of many millions of records. In which case, being able to embed the scoring function in a UDF and use Spark to distribute this in a cluster may be helpful.

Orchestration

Azure ML
Azure Data Factory has direct integration with Azure ML via a Machine Learning Execute Pipeline activity . This makes it pretty easy to use ADF for both training and scoring pipelines. Alternatively, you can use Azure's ML built-in scheduler  although it seems quite limited.

Databricks ML
For Databricks, you need to embed the batch scoring code in a Databricks Notebook and use the ADF integration via Databricks Notebook activity . Alternatively, you can use Databricks Jobs , which even allow you to define dependencies (e.g. between the training pipeline and the scoring pipeline, for example). You can run this jobs on a cron schedule or trigger them via REST API  with ADF.

Winner
Having on mind that ADF is being used for orchestration, the native integration with Azure ML gives Azure ML the edge.

Pipeline / Infra Monitoring

Azure ML
Really good and clear integration with Datadog  that includes not only pipeline monitoring but also model deployment monitoring (e.g. model deployment failures)

Databricks ML
Databricks has cluster integration with Datadog , meaning Datadog will read the logs from either the Driver or all the nodes in the cluster. There's no out of the box metrics and no direct integration with, for example, Databricks Jobs. This means that metrics need to be defined either in the Datadog UI or by using a Datadog client library. Moreover, it is not clear how to implement this in the context of model deployments.

Winner
This is again a big win for Azure ML, that has a clear and meaningful integration with all relevant metrics. For Databricks most things would need to be custom.

Additional Observations

  • MLFlow API seems a bit more complex, developer focused. Azure ML seems to expose a bit more limited functionality which makes it easier to use, but maybe less feature rich. I think for not to complex ML set ups this can be an advantage for Azure ML.
  • MLFlow docs are spread between the Azure Databricks documentation  and the actual MLFlow docs . This makes it a bit confusing for the MLFlow user, since some things may be Databricks/MLFlow integration specific and some others may not.
  • It feels like configuring deployments is way easier in Azure ML (or the documentation around it is just better/easier to find).


Comments