In need of compute power? You are more attracted to a serverless and massively parallel processing framework with a dockerized code than porting all your work to spark? Hang on! We’ve been using Azure Batch, and found that in our case, it’s a good bang for the bucks in comparison to Databricks, all while reducing coding efforts.
In the following blog, we will begin with describing what Azure Batch is, and why certain workloads fly better on Azure Batch than on Spark/Databricks. Next, we will provide you with a "Hello World" that uses the Python API of Azure Batch to scale out your containerized workloads in a serverless and distributed fashion.
First things first: What is Azure Batch, and how is it used?
Azure Batch is a cloud platform that you can use to effectively provision a pool of Virtual Machines (VMs) and manage workloads to run on them. It is useful in a variety of high-performance computing scenarios, e.g. machine learning parameter search, 3D rendering using Blender, or big data processing jobs.
Using this cloud platform, you can have thousands of VMs, organized in pools with certain (scaling) properties. Jobs then contain a collection of tasks that get dispatched to a VM. It comes with a handy user interface to monitor what is running. You pay something proportional to cores/minute, depending on the type of VM, but not on the pool size, so one hour on a hundred workers may be just as cheap as 10 hours on 10.
There are different OS choices for Azure Batch. We love docker, so in this blog, we focus on the dockerized lift-and-shift scenario, where we dispatch containers with different parametrization. It somewhat overlaps with provisioning Azure Container Instances (ACIs), but then on a larger scale.
Should you use Azure Batch or something else, such as Spark? An example
Spark is an obvious, and often inevitable, solution for big data problems. For many of us, especially the ones on Azure, Databricks is the de-facto road to Spark. But it isn’t always the best option.
Let’s reason about the nature of problems a bit before we see why. For example, consider a biggish-data scenario (less than a trillion rows) that touches upon the following:
- Your problem is embarrassingly parallel (i.e., you have a task for many independent parameters/combinations thereof). Your data is organized correspondingly such that intermediate shuffling isn’t necessary.
- The problem isn’t a straightforward row-wise thing or map-reduce, but rather a convoluted aggregation that isn’t a great match for spark (example given below). Each chunk is heavy enough to break small-scale serverless solutions (e.g., AWS lambda/Azure functions and the like), but you can fit it on a virtual machine.
- You’ve invested in a software package, and you want to keep it for local usage. Porting it or keeping two versions of it isn’t desirable (e.g., because double work, team not being spark-savvy, or too little time)
- Your code/codebase is rather complex and packages, such as Koalas, only partly support your purposes while increasing overhead and code complexity.
- You’re faced with dependency management and the need for fine-grained control. Fiddling with vendor’s base images to incorporate your libraries can be cumbersome, and dependencies do not always resolve.
We’ve recently worked on a problem that met the above bullets, and initially adopted Koalas so that we could swiftly move from local to cloud (Azure in this case). Using reasonable file size and partitioning, we carried out some custom and unbalanced
groupby’s based on different intervaltrees, and then called
.apply() on the (up to millions of rows) groups with some time-series complexity that would reduce to a few key figures.
This required pretty big workers, and still then the amount of overhead and reshuffling led to disappointing execution times and a fat bill. Worse, we didn’t spot any low-hanging optimization fruit to mitigate these problems. If you have the expertise, Spark from Scala might be an interesting venue, as its RDDs are optimized. In our case, this was beyond what the team would like to do and maintain.
Containerizing code and dispatching it to a pool of serverless workers can be simpler, faster and cheaper than using Databricks
As we were anyway not using the extra features of Databricks, we gave up on our Spark efforts and moved to Azure Batch. We simplified our codebase back to pure Pandas, implemented our chunking, dockerized the codebase, tested and made a handful of API calls (so far all local) to get the party started on a large scale.
This serverless approach drastically reduced computation time and costs. Moreover, it left the team happy for two reasons: 1) local to cloud being a
docker build away, and 2) we ended up with a solution that was better to grasp, easier to debug, and, therefore, easier to maintain.
By now, you should ask yourself ‘what are the caveats’? Well, you need to determine how you want to structure things in an embarrassingly parallel way, including your data, and this choice should be implemented and compatible with what a single Virtual Machine (VM) can do. An example could be to solve an aggregation for thousands of locations, where each problem (per location) fits on a VM. Also, collection of results needs to happen, and you will have to implement this too. Some of the cost reduction is an effect of using low-priority nodes (something like AWS spot instances). These are cheap, but your work is not guaranteed to run. In theory, it may be interrupted in case of capacity issues.
How do I run a job on Azure?
We use the Azure Batch python API, in combination with our own
AzureBatchManager. You can use it to dispatch your work in a serverless and massively parallel way. Have the following ready:
- A docker image. You’ll pass envvars to containers so they know what they should do.
- An Azure Batch resource, contributor rights on it, and it’s access key.
- A container registry, holding the aforementioned image, and credentials.
You can find this
AzureBatchManager class in our git repository. There is also a configuration file to fill out with relevant properties and secrets.
Analogous to the three hierarchical concepts of Azure Batch, the manager has the methods
create_task(). A minimal example of how to dispatch your images with different environment variables (that determine the chunking) is in the next section.
An Azure Batch Service Example
To control the horizontal scaling, you may set different values for either
high_priority_nodes. The first alternative is cheaper but not guaranteed to be provisioned. But you’ll seldomly have problems with that.
With the pool and the job defined, let's add the actual work:
Note that the docker image is specified both upon pool creation and per-task. This is correct, because it has different purposes. Images specified at pool-level are prefetched, and thus readily available for tasks. You may then of course use different images per-task.
Finally, let's clean up the resources:
Where you may want to tweak the code a bit
Note that the provided
AzureBatchManager is just a demo, and you’re invited to adapt it to your purposes. For example, it includes the use of a
vnet, which is useful if you want/need to connect to something using a private endpoint (Read further on the best practices and on deploying batch in a vnet). It may also be stripped. The same holds for the usage of a keyvault. It’s good practice, but you are free to work without one if you prefer. Using a service principal to identify with the batch contributor role is also a choice, but note that at the time of writing, managed identities aren’t supported for this scenario.
For straightforward and embarrassingly parallel workloads, Azure Batch can be an effective way to scale out your code in a serverless way, while keeping you in control of packaging and chunking. For certain aggregations, this distributed computation approach is actually way more effective, and much cheaper, than porting your project to spark and running it on Databricks.
In addition to this, there are the swift transitions from local to cloud, the architectural simplicity, supportability, and the time you save developing. Altogether, Azure Batch is certainly worth considering, and might be preferable if your problem suits it.