A robust and scalable data ingestion system for The Ocean Cleanup

To future-proof The Ocean Cleanup's data ingestion infrastructure we built a scalable Azure-based data streaming platform, moving it from prototype to production in under three months.

The data ingestion platform does exactly what we need it to do. It is easy to maintain, requires little effort to add new data sources and scales along with our operations. These guys know what they are doing and are just as motivated to deliver quality as we are.
Maarten van Berkel, Project Manager & Data Architect at The Ocean Cleanup

Case

The Ocean Cleanup designs and develops advanced technologies to rid the oceans of plastic. They are doing this by cleaning up what is already polluting our oceans, and intercepting plastic on its way to the ocean via rivers. Their ocean technology is moved by the ocean forces  – just like the plastic – to passively catch and retain it. The Interceptor™, their solution for river debris, is a 100% solar-powered device which autonomously halts and extracts plastic from rivers before it reaches the ocean.

The Ocean Cleanup believes that to develop the most optimal cleanup technologies it’s essential to truly understand the problem. That is why they have been extensively researching using data ranging from low-tech sources (visual counting from bridges) to high-tech sources (automated camera monitoring). The ingestion of these various sources located all over the world is no easy task for the data engineering team. Over time this has led to some internal challenges: fragmentation in the IT landscape, unclear data lineage and scalability issues.

The Ocean Cleanup needed a data ingestion framework which could scale together with the ever increasing list of sources and which standardized the way data was handled by the rest of the organization.

Solution

Together with The Ocean Cleanup’s data engineering team we sat down to discuss the current architecture and see how the new ingestion framework would fit in. During these brainstorming sessions we set out to design a system that would not only work for the current processes but also for future use cases. This resulted in the following design goals:

  • Standardized: the initial flow of data should be the same for every source
  • Flexible: data should be allowed in a wide variety of formats (files, binary and json)
  • Transparent: Clear trail about the origins of the data
  • Open: Applications should be able to access sources which they are authorized to
  • Maintainable: Tools and libraries already in use and well known to the team

With these design goals in mind Xomnia's data engineers started working on the ingestion framework. Leveraging on our expertise and experience we moved  from an initial prototype to a productionized system in just three months time. During this period Xomnia closely collaborated with The Ocean Cleanup to ensure a seamless integration into the existing Azure Cloud Environment.

The framework provides an API to which data providers can push their payloads and metadata. These are ingested and stored in a raw form before being posted on a messaging bus. This all happens independent of the data source and does not require updating the framework when new sources are added. Once the data is on the messaging bus applications (e.g. image recognition algorithms, business logic, cleansing & parsing) apply further processing in a streaming manner.

The framework uses the following services: Azure Kubernetes Service (AKS), Azure Service Bus and Azure Blob Storage.

Impact

The framework is already live in the production environment and in use by some data providers. The ingestion framework will go fully operational once the corona virus passes and operations all over the world can resume back to normal.

Xomnia has helped overcome the various challenges posed by the wide variety of data sources. It has built a strong foundation which is flexible by nature. A framework that moves with the dynamic environment requiring little to no changes when new data sources are added. Something that  can scale with the volume of data coming in and allows for processing data in a streaming manner.

This framework enables The Ocean Cleanup to keep focusing on the things that matter the most, designing and developing advanced cleanup technology.