Mon Feb 24 2020

Why you should dive into Delta Lake

Have you ever missed the Update or Delete method when working with big data? Do you ever think about how much time has been lost since deploying pipelines using cloud/on premise storage and spark for processing with partition rebuilding strategies? What about how many times you’ve had to reprocess partitions since then?

If you can relate to these situations, then here’s some great news for you: the new transaction storage layer called Delta Lake has the potential to resolve these issues.

How it works

Delta Lake is an open-source storage layer that brings ACID (atomicity, consistency, isolation, durability) transactions to Apache Spark™ and big data workloads. Users can download open-source Delta Lake and use it on-prem with HDFS. Users can read from any storage system that supports Apache Spark's data sources and write to Delta Lake, which stores data in Parquet format

Delta Lake offers ACID transactions via optimistic concurrency control between writes, snapshot isolation so that readers don't see garbage data while someone is writing, data versioning and rollback, and schema enforcement to better handle schema changes and deal with data type changes.

Although Delta Lake was first introduced in 2017, we should hear some companies in 2020 that are starting to develop their pipelines or companies of a certain maturity using data pipeline with open source frameworks to start their experiments with this new feature of the open source universe.

Architecture used in Databricks
Architecture used in Databricks

With the use of the Bronze, Silver and Gold layers, it is possible to establish the version control of the data. This is because historical storage of the data makes it possible to identify the data changes, audit it and roll back the data version in the past if necessary.

Key features of Delta Lake

ACID Transactions: data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.

Scalable Metadata Handling: in big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.

Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. It's already designed with GDPR in mind - running deletes concurrently on older partitions while newer partitions are being appended.

Open Format: all data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.

Unified Batch and Streaming Source and Sink: a table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.

Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.

Schema Evolution: big data is continuously changing. Delta Lake enables us to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.

Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.

Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.

100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

In conclusion, Delta Lake enables developers to spend less time managing partition control. And, since the data history will be stored, it is much simpler to use the data versioning control layer to apply updates and deletions to the data, and support the physical data lineage.

photo of Marcus Azevedo, the author of the blog

This blog was written by Marcus Azevedo, one of our data engineers at Xomnia. We’re on the lookout for another data engineer to join our team. Someone with specific knowledge of Hadoop, Spark and Kafka, and the capability to architect highly scalable distributed systems, using different open source tools - like Delta Lake. Does this sound like you? Then check out our vacancy and get in touch with us today.