A modern data platform is a stack of tools designed to facilitate extracting business value from data, and which can be developed on a cloud platform. In the sections below, Xomnia's data engineers answer the FAQs about data platforms that leaders and decision makers of data and analytics teams need to know.
What is a cloud platform, and how does it differ from a cloud service?
Cloud platforms in general are a suite of cloud services (or an online toolbox), which form the building blocks for implementing business applications on the cloud. Cloud platforms are offered by cloud providers such as AWS, Azure or Google Cloud Platform (GCP). Generally speaking, a company will have their product(s) and their data platform in the cloud.
A mature data and analytics platform built on a cloud platform is composed of various data ingestion services, data storage services, machine learning pipelines and BI layers.
Are all data platforms cloud platforms?
No, they are not. A modern data platform is a stack of tools designed to facilitate extracting business value from data. A data platform can be developed on a cloud platform. Vendors of cloud platforms have developed a large number of services to facilitate development of a data platform. However, it is also possible to build a data platform outside of a cloud platform. Having said that, it’s important to note that most services, especially the open-source offerings, are available to host in a non-cloud data center. For example, one can use Apache Airflow, Kubernetes and Apache Druid to create an on-premise modern data platform.
- Common processes carried out by a data platform are:
- Connecting to data source systems
- Orchestrating data ingestion tasks to make sure that all company data ends up in the unified data platform
- Processing and transforming incoming data streams
- Storing data in blob-storage
- Storing (processed) data in use-case specific storage, e.g. relational databases, NoSQL storage, Big-Data Engine
- Providing components to extract value from data, for example machine learning models or dashboards
- Ensuring the security and authorization of data flows
A major advantage of using cloud platforms to implement your data platform is that cloud services turn managing the cloud infrastructure to an abstract problem. In other words, the difficulty of managing complex infrastructure, like servers and networks, is entirely eliminated. In contrast, managing an on-premise data platform is a laborious task that requires a lot of expertise.
What are the benefits of a data platform?
Data platforms come with many advantages to your business:
A single source of truth of your data: In many enterprises, an uncontrollable sprawl of source systems and integrations dilutes the lineage and validity of your data. By centralizing your data in a data platform, all source systems will push their data to a single source, which will improve:
Data governance: This is due to enforcing standardization and data quality checks. Other benefits of data governance include maintaining KPI and data definitions in an universal data catalog. This will help in ending conflicts on data validity of KPIs, which can be found in reporting dashboards.
Security: This occurs by having a centralized authentication/authorization mechanism required to permit data access. Moreover, a data platform allows managing personal Identifiable Information (PII) information centrally, as well as privacy measures, such as The Right to Be Forgotten (RTBF).
Business velocity: This results from having business users access a single platform instead of manually unifying data from multiple APIs or data lakes (which more often than not become more of data swamps). Data silos can finally be submerged by a single data platform.
More efficiency and less costs: A self-service data platform for all future applications increases efficiency and reduces cost on your business. This makes data platforms necessary as your organization scales, because it will mean that new use-cases built on top of data will not require custom onboarding or complex setups. As a result, new machine learning use cases or new reporting tools can immediately start adding value.
Adhering to platform and infrastructure best practices and reducing operational costs: By building a single data platform that adheres to best practices, such as having proper CI/CD in place and Infrastructure as Code (IaC) platform management, platform failures will become a rarity. This allows your business to focus on extracting value of data instead of focusing on controlling it.
Centralized monitoring of application health, performance and security: In an on-premise data center, many tools have to be used to monitor the health of the various hardware and software components of your platform. Tooling exists that centralizes this, but with the cloud, it’s all built-in.
How do you create a data platform?
Before jumping into creating a data platform, it is essential to first define your business data strategy. Stakeholders need to first clearly determine the answers to questions like:
- How is the volume of data at our organization going to increase over the next 10 years?
- Do we see applicable machine learning use cases?
- What data sources do we need to integrate?
Defining the answers to those questions will allow you to craft a platform that will keep up with your organization’s needs in the future while adding maximum value. Take sufficient time for this process, and iterate on it.
Next starts the journey to creating a data platform, which consists of:
- Landscape Exploration: Creating an overview of all available data sources and the requirements under which they should be incorporated into the data platform, and of any possibly already existing infrastructure.
- Creating the blueprint: Designing the solution architecture on a high level, independent of specific tools.
- Defining technical details: Specifying which tools are to be used, how they should be connected to each other and how to structure the corresponding code.
- Implementation: Building the solution in functional increments.
- Evaluation: Frequently demonstrating the built components to the client and gathering feedback.
Note: When composing your cloud infrastructure, there often is a choice between implementing a system yourself or outsourcing a system to the vendor. It is important to consider the tradeoffs between managed solutions that vendors offer and crafting your own solutions. Managed solutions guarantee an easy setup, easy scalability and integration with other systems. On the other hand, crafting your own solution allows for full control over the bill and all configuration options. Operational cost, managing cost, and the required expertise for managing custom solutions should be taken into account when making this choice.
What components are a part of the cloud infrastructure for the platform?
On a high level, each data platform is powered by data storage and compute resources to transform and move data. In reality, however, a data platform consists of many tightly coupled parts. Data storage in a platform, for example, can manifest itself as data warehouses, databases or raw file storage. The aspect of compute (i.e. the hardware running applications) is responsible for tasks that include but are not limited to:
- Running data ingestion services, i.e. Apache NiFi for batch processing or Apache Kafka for streaming. These move data from source systems to a data storage solution in your platform.
- Running data transformation or analytics services, such as Apache Spark or Databricks. These components transform and analyze data on a big scale.
- Running machine learning infrastructure, such as training machine learning models or MLOps applications that manage the machine learning lifecycle.
- Running and hosting containerized custom applications on Kubernetes, such as APIs, visualization applications or other services (such as any of the aforementioned services).
- Running services that unlock (big) data, such as Apache Druid. These components unlock analytic insights into (semi) structured data, even if it is big data (of terabytes or more).
- A layer of abstraction on top of compute: Most cloud vendors will also offer this. For example, instead of hosting your own Airflow cluster for workflow orchestration, Google Cloud Platform (GCP) offers Cloud Composer, Azure offers Azure Data Factory, while you can also achieve a similar architecture using Step Functions on AWS.
What makes a strong data platform?
A strong platform enables the business to rapidly access valid data, design effective reporting and extract more value from data through, for example, machine learning. This will result from a well-planned platform strategy, which should be centered around business goals, such as cost reduction, time to market, increased security or any other business KPIs.
From an engineering perspective, a strong platform is (just like a strong software), easily scalable, reliable, secure and maintainable. Any platform strategy should (at least) factor in these considerations. Moreover, a strong platform is built on strong engineering principles. To achieve a strong data platform, Xomnia recommends the following starting points:
- Managing infrastructure and security through automation Infrastructure as Code (IaC)
- Automating and testing deployments and security through a rigid process of Continuous Integration and Continuous Deployment (CI/CD). Ideally, no changes should be made to any part of the platform outside of CI/CD pipelines.
- Appointing data stewards that enforce and promote data standards across the business. This should be combined with automated evaluations and reporting on data quality
How to make your data platform future proof?
Any data platform strategy should consider scalability as a top priority. Data volume will increase exponentially, and this warrants extreme care when designing a platform.
To future proof your data platform, Xomnia recommends:
- Making a well-informed choice between cloud-agnostic and cloud-specific tooling, to prevent huge infrastructure overhaul costs in the future.
- Automated deployment and testing of infrastructure using CI/CD and Infrastructure as Code (IaC).
- Automating the documentation to allow future users to easily tap into the platform. What you really want is for data scientists to explore different algorithms and visualizations, not try to explore an outdated data catalog to figure out what data exists in the platform.
- Disassembling monolithic applications and promoting containerized microservices architectures. Make applications responsible for a small piece of work and make them excel at that specific task. This allows testing individual applications more easily and scaling them on any computing platform, while prevernting single-point-of-failure situations after deploying a change. Decoupled services can also follow their own release cycle and scale independently from other services. Moreover, they can make optimal use of different programming languages or tools that specialize in that specific use case.
How can I utilize my cloud data platform for machine learning?
Any data platform that sits on top of modern infrastructure is just seconds away from tapping into its immense potential through machine learning. Most cloud vendors offer managed machine learning services that will ease the complex machine learning lifecycle of training, evaluating and deploying models.
Given that your platform adheres to the principles of scalability, security (i.e. AI is used responsibly) and governance, machine learning engineers and data scientists can start exploring the potential of your organization’s data.
Components of a platform that are required by a machine learning solution are:
- An automated MLOps pipeline: It is responsible for the training, deploying and monitoring algorithms. A solid pipeline will allow data scientists to easily prototype and perform testing. Moreover, they could also monitor model performance and corresponding metadata or model drift. This is the “machine learning in production” feature: The slow - or sometimes fast - changes in the world are reflected in the input data of your model, the model performance and other measures of the models fit.
- Scalable data sources to access data from: These could include data warehouses, data streams or file storage. Machine learning is a data-hungry process, and the more data, the more valuable it can be - as well as more challenging to implement effectively.
- An integrated environment to collaboratively develop machine learning models: Whether this is an integrated environment (such as Databricks, Azure ML, Google Colab or AWS Sagemaker) or a combination of local development and CI/CD pipelines, code management should not be a blocker for data scientists.
- Sufficient computational resources to train memory-hungry machine learning models: The automation of deployment and monitoring of costs are essential for making machine learning profitable for your organization.
- A big data processing engine: It performs large-scale transformations on raw data to prepare it for the machine learning pipeline.