Thu Feb 10 2022

How we set up Airflow on K8s with Spark? (Hema)

MeetUp


HEMA is a Dutch variety store-chain characterized by relatively low-priced generic household goods, most of which made by and for the chain itself, and often with an original design.

In 2020, after a decade of using Teradata technology, HEMA decided to migrate towards the cloud in AWS to reduce the costs and limitations of big data processing, and improve customer experience. After trying tools like DBT and Lambda functions for data processing, the team decided to make the move to an easily scalable, low-cost PaaS EKS environment that could support analytics requirements using compute SPOT capabilities.

In this edition of Data & Drinks, we will explain how we set up Airflow on K8s from an infrastructure and code perspective, how we integrated it with Spark, and what issues we faced in the process. We will also discuss the added value to HEMA brought on by the scalability of Airflow on Kubernetes when integrated with Spark.

Daniel Galea:
Daniel comes from an academic and professional background in software development. He later transitioned to data engineering following a masters degree in computer science that is focused on big data engineering. He joined Xomnia in 2019 as a data engineer, where he works on designing and building data platforms and data pipelines in the cloud for a variety of use cases. He currently works as a data engineer at HEMA within the team currently migrating from an on-premise cloud platform to AWS. He aims to help HEMA deliver the best customer experience.

⁠Marcus Azevedo:
With over 13 years in big data engineering and architecture, Marcus currently works as the lead DataOps engineer at HEMA. He carries a BS in Database Technology and an MBA that is focused on big data & data science technology. His MBA covered modules about high performance & distributed computing, understanding how computer systems and software work on the lowest levels, and how to make them perform as good as possible, especially across network resources.

About the speakers

Daniel Galea

Data Engineer at Xomnia

Marcus Azevedo

Lead DataOps Engineer at Hema