Azure vs. Databricks: When to choose Azure Batch instead of Databricks?

Mon Aug 2 2021

Technology

Azure devops

Databricks

Topic

Data Engineer

In need of compute power? You are more attracted to a serverless and massively parallel processing framework with a dockerized code than porting all your work to spark? Hang on! We’ve been using Azure Batch, and found that in our case, it’s a good bang for the bucks in comparison to Databricks, all while reducing coding efforts.

In the following blog, we will begin with describing what Azure Batch is, and why certain workloads fly better on Azure Batch than on Spark/Databricks. Next, we will provide you with a "Hello World" that uses the Python API of Azure Batch to scale out your containerized workloads in a serverless and distributed fashion.

First things first: What is Azure Batch, and how is it used?

Azure Batch is a cloud platform that you can use to effectively provision a pool of Virtual Machines (VMs) and manage workloads to run on them. It is useful in a variety of high-performance computing scenarios, e.g. machine learning parameter search, 3D rendering using Blender, or big data processing jobs.

Using this cloud platform, you can have thousands of VMs, organized in pools with certain (scaling) properties. Jobs then contain a collection of tasks that get dispatched to a VM. It comes with a handy user interface to monitor what is running. You pay something proportional to cores/minute, depending on the type of VM, but not on the pool size, so one hour on a hundred workers may be just as cheap as 10 hours on 10.

There are different OS choices for Azure Batch. We love docker, so in this blog, we focus on the dockerized lift-and-shift scenario, where we dispatch containers with different parametrization. It somewhat overlaps with provisioning Azure Container Instances (ACIs), but then on a larger scale.

Join the leading data & AI consultancy in the Netherlands. Click here to view our vacancies

Should you use Azure Batch or something else, such as Spark? An example

Spark is an obvious, and often inevitable, solution for big data problems. For many of us, especially the ones on Azure, Databricks is the de-facto road to Spark. But it isn’t always the best option.

Let’s reason about the nature of problems a bit before we see why. For example, consider a biggish-data scenario (less than a trillion rows) that touches upon the following:

Your problem is embarrassingly parallel (i.e., you have a task for many independent parameters/combinations thereof). Your data is organized correspondingly such that intermediate shuffling isn’t necessary.
The problem isn’t a straightforward row-wise thing or map-reduce, but rather a convoluted aggregation that isn’t a great match for spark (example given below). Each chunk is heavy enough to break small-scale serverless solutions (e.g., AWS lambda/Azure functions and the like), but you can fit it on a virtual machine.
You’ve invested in a software package, and you want to keep it for local usage. Porting it or keeping two versions of it isn’t desirable (e.g., because double work, team not being spark-savvy, or too little time)
Your code/codebase is rather complex and packages, such as Koalas, only partly support your purposes while increasing overhead and code complexity.
You’re faced with dependency management and the need for fine-grained control. Fiddling with vendor’s base images to incorporate your libraries can be cumbersome, and dependencies do not always resolve.

We’ve recently worked on a problem that met the above bullets, and initially adopted Koalas so that we could swiftly move from local to cloud (Azure in this case). Using reasonable file size and partitioning, we carried out some custom and unbalanced groupby’s based on different intervaltrees, and then called .apply() on the (up to millions of rows) groups with some time-series complexity that would reduce to a few key figures.

This required pretty big workers, and still then the amount of overhead and reshuffling led to disappointing execution times and a fat bill. Worse, we didn’t spot any low-hanging optimization fruit to mitigate these problems. If you have the expertise, Spark from Scala might be an interesting venue, as its RDDs are optimized. In our case, this was beyond what the team would like to do and maintain.

Containerizing code and dispatching it to a pool of serverless workers can be simpler, faster and cheaper than using Databricks

As we were anyway not using the extra features of Databricks, we gave up on our Spark efforts and moved to Azure Batch. We simplified our codebase back to pure Pandas, implemented our chunking, dockerized the codebase, tested and made a handful of API calls (so far all local) to get the party started on a large scale.

This serverless approach drastically reduced computation time and costs. Moreover, it left the team happy for two reasons: 1) local to cloud being a docker build away, and 2) we ended up with a solution that was better to grasp, easier to debug, and, therefore, easier to maintain.

By now, you should ask yourself ‘what are the caveats’? Well, you need to determine how you want to structure things in an embarrassingly parallel way, including your data, and this choice should be implemented and compatible with what a single Virtual Machine (VM) can do. An example could be to solve an aggregation for thousands of locations, where each problem (per location) fits on a VM. Also, collection of results needs to happen, and you will have to implement this too. Some of the cost reduction is an effect of using low-priority nodes (something like AWS spot instances). These are cheap, but your work is not guaranteed to run. In theory, it may be interrupted in case of capacity issues.

How do I run a job on Azure?

We use the Azure Batch python API, in combination with our own AzureBatchManager. You can use it to dispatch your work in a serverless and massively parallel way. Have the following ready:

A docker image. You’ll pass envvars to containers so they know what they should do.
An Azure Batch resource, contributor rights on it, and it’s access key.
A container registry, holding the aforementioned image, and credentials.

You can find this AzureBatchManager class in our git repository. There is also a configuration file to fill out with relevant properties and secrets. Analogous to the three hierarchical concepts of Azure Batch, the manager has the methods create_pool(), create_job(), and create_task(). A minimal example of how to dispatch your images with different environment variables (that determine the chunking) is in the next section.

Don't miss Xomnia's latest business cases and blogs. Subscribe to our newsletter.

An Azure Batch Service Example

from batch_manager import AzureBatchManager
manager = AzureBatchManager()

docker_img = 'myimage:0.0.7'
pool_id = "mypool"

# The pool
# Using STANDARD_D4_V2 workers
manager.create_pool(
pool_id,
docker_img,
vm_type='STANDARD_D4_V2',
low_prio_nodes=35,
)

## The job
job_id = "what_a_job"
manager.create_job(job_id, pool_id)

Create a pool and a job in Azure Batch using the python API and our AzureBatchManager.

To control the horizontal scaling, you may set different values for either low_priority_nodes and/or high_priority_nodes. The first alternative is cheaper but not guaranteed to be provisioned. But you’ll seldomly have problems with that.

With the pool and the job defined, let's add the actual work:

script = "src/main.py" # in your docker image

for year in range(2000, 2022):
for country in countries_of_the_world:
envs = {
"AZB_YEAR": str(year),
"AZB_COUNTRY: country,
"AZB_CONSTANT": "Hey you",
}

manager.create_python_task(job_id, script, docker_img, envs)
print(f"created task: {envs}", 30*"-", sep='\n')

print("Waiting for tasks to complete...")
manager.wait_for_tasks_to_complete(job_id)
print("Done.")

Create the workloads in a for-loop.

Note that the docker image is specified both upon pool creation and per-task. This is correct, because it has different purposes. Images specified at pool-level are prefetched, and thus readily available for tasks. You may then of course use different images per-task.

Finally, let's clean up the resources:

manager.delete_job(job_id)
manager.delete_pool(pool_id)

Delete the resources.

Where you may want to tweak the code a bit

Note that the provided AzureBatchManager is just a demo, and you’re invited to adapt it to your purposes. For example, it includes the use of a vnet, which is useful if you want/need to connect to something using a private endpoint (Read further on the best practices and on deploying batch in a vnet). It may also be stripped. The same holds for the usage of a keyvault. It’s good practice, but you are free to work without one if you prefer. Using a service principal to identify with the batch contributor role is also a choice, but note that at the time of writing, managed identities aren’t supported for this scenario.

Join our team of top-talent data and AI professionals. Click here to view our vacancies.

Conclusion

For straightforward and embarrassingly parallel workloads, Azure Batch can be an effective way to scale out your code in a serverless way, while keeping you in control of packaging and chunking. For certain aggregations, this distributed computation approach is actually way more effective, and much cheaper, than porting your project to spark and running it on Databricks.

In addition to this, there are the swift transitions from local to cloud, the architectural simplicity, supportability, and the time you save developing. Altogether, Azure Batch is certainly worth considering, and might be preferable if your problem suits it.

Technology

Azure devops

Databricks

Topic

Data Engineer