Mon Sep 24 2018

Data security inspired by the Apache Hadoop environment

In this blog, I’ll tell you about security in the Apache Hadoop environment; a subject I have given a tech talk about earlier at Xomnia. The Apache Hadoop environment is made up of all the components that are associated with Apache Hadoop. You have probably heard of the Apache Hadoop distributed file system, but  Apache Hadoop is more than that: Apache Hadoop has brought us the fundamentals for the world of Big Data.

‘Big data’ is a buzzword that came up to describe data that is too large to be handled and stored on one server. All kinds of solutions were invented to tackle this problem. Data analytics tools like Apache Spark and Flink were developed to analyze large quantities of data. These data tools and the tools used to store the data are all heavily tested on performance and reliability. More and more companies have begun using the tools and the development of the Apache Hadoop environment gained more traction.

One thing, however, kept lagging behind: security. Security in the computer world is often a lower priority. Security is nevertheless important to protect, for example, our privacy. With the recent introduction of the GDPR, companies are forced to pay more attention to security. In this blog, I will first give an introduction to the Hadoop environment, and then introduce security methods and a new concept I have developed.

Apache Hadoop environment

The Apache Hadoop environment in short consists of three important layers, as shown in the figure below. Let’s start with the storage and messaging layer: storing big data is a challenge, and several database systems have been developed to store this data. There are a column, key-value, graph, in-memory, document and object databases; even blockchain is a distributed database. To communicate with these databases, there are messaging services, which act as message brokers to distribute the messages coming in. An Example is Apache Kafka.

These message brokers are important to distribute the data coming in across to several servers and to perhaps even store the messages before they are ready to be handled. In order to handle the messages, computing and memory power is required. This is organized by resource managers such as Apache Hadoop YARN.

The messages are then analyzed by big data analytics tools such as Apache Spark and Flink. These tools can get value out of raw data. An example: if you are flying, it can predict based on the itinerary whether your luggage will make it to the next airplane or not. Another example: if you are visiting a video content website, such as NPO Start, then based on the videos you watch it can help you to find new videos you want to watch.


Two important mechanisms in big data security are MIT Kerberos (Kerberos) and Transport Layer Security (TLS). TLS is used to set up a secure connection between two parties. Secure connections are important to stop people from eavesdropping.

A chain of trust is important in public encryption. The root can give intermediate certificates. The certificate authority can make a certificate for the client with the intermediate and root certificate. This creates a chain of trust. The intermediate trusts the client. This all can be verified by verifying  the chain; the chain of trust.

By using this chain of trust, we can establish a connection. Here the client sends its client certificate, which is made by the same certificate authority that the root and intermediate certificates are made by. When the certificate arrives at the server, it gets verified. The server looks in its store and finds the intermediate and root certificate given by the same authority. This way, the chain of trust can be established. After the client is trusted by the server, a secret is exchanged by which the connection is established.

Kerberos is a trusted third party service authentication mechanism. It is developed so that a user only needs to sign in once and can be accurately identified at multiple services. This happens via keytabs, which make single sign-on possible. A keytab is generated when a client signs in with its password. With the keytab, a ticket can be obtained at Kerberos, which can then execute a task. Kerberos is important because its job is to accurately identify clients, which is important if you want to only give specific users access to an application. The ticket can be used to, for example, launch  a Spark job. A Spark job is launched by data engineers and data scientists to analyze the data. A Spark job is orchestrated by a Spark master and executed by Spark workers.

Kerberos is an important service used by many companies, however, it also has a couple of disadvantages. One of those is that it’s not scalable. As soon as your company hits several thousands of employees, the Kerberos server can become a bottleneck. Kerberos was developed in the 1970s; in the seventies, with local area networks, it was not anticipated that people would have the internet over which they would like to authenticate. So back in these days, the assumption of only several thousands of users was right.

Over the years, some other problems came along, such as legacy risks forthcoming from harder to install configurations, which are required to deal with the newest developments. Of course, you can imagine that if you get the configuration wrong, this might leave one vulnerable to an attack. Kerberos requires users to log in with a password. It would be beneficial if multiple authentication factors could be used because one password might easily be captured by an intruder. In Kerberos, it turns out that multi-factor-authentication is hard to set up. An intruder might forge the certificate by which he can gain the same authentication of the user he has forged the certificate from. Such an attack is better known as  the ‘Golden Ticket’ attack. To overcome these issues, new mechanisms and services are wanted by developers.

New developments: BLESS

In searching for a new solution, inspiration can often be gained from the newest security developments. An example of such a development is Bastion’s Lambda Ephemeral SSH Service (BLESS). BLESS has been developed by Netflix to improve the authentication and authorization of developers in using their SSH services. The service is developed to mitigate the vulnerability that comes from developers having to store their SSH keys on their computers. With these SSH keys, the user can gain access to SSH services without a system verifying whether they are really allowed to enter.

In this concept, the user enters over a Bastion. Over this Bastion, the user can ask for access to the SSH services they need. The user provides his credentials, which are then sent to the authorization/authentication service. This server can also be called BLESS, and the user is then verified. If the user is authenticated correctly, then a certificate is generated. This certificate is generated with a private key located on the BLESS server. The certificate is short-lived (ephemeral), which means that it can only be used for a short time to access SSH services.

An advantage of this concept is that all the components are scalable. A short time certificate has the advantage that the user will ask for new access if he wants to enter a new SSH service and his certificate expired. For the systems the user is already authenticated to, the connection remains open for either a certain time or until the user closes its connection. The process of providing the user with certificates, and the history of users accessing the SSH instances, is logged by a logging system. This logging system contains all the logging so that it is not scattered across all components.

A new concept

During my master thesis at the Fraunhofer Institute SIT in Darmstadt, Germany, I developed a new security concept inspired by the major security components of the Apache Hadoop environment. The concept has a number of requirements.

One of those is that all the logging must be in a single accessible place. This means that auditing is centralized. The certificates should be valid for a short time so that the user is more easily tracked through the system, and it is very hard for an intruder to forge the certificate. When the validity of a certificate cannot be denied, it means that so-called non-repudiation is achieved. Multi-factor authentication should be in place so that there is no longer only a simple password to enter the system. Multi-factor authentication consists of something you have, something you are, and something you possess. The authentication should be distributed, to make scaling of the service possible.

So how does the new process work? We will use an example with Apache Spark, which as you remember is a processing engine to analyze data. The process is as follows:

  1. The user sends the authentication data and the job he wants to execute to the Spark master.
  2. The Spark master then contacts a database to verify that the user entered the  correct authentication details and whether he is authorized to execute the job.
  3. The attempt of the user is logged.
  4. The Spark master connects a certificate authority to get the certificate for this specific client. This client certificate is used together with the intermediate and root certificate already located in the Spark master.
  5. The workers are initialized and have the root and intermediate certificate as well.
  6. With the certificates now a connection can be set up with all Spark workers and  Masters. With the connection established, the job is executed.
  7. The result from the job is returned to the client.

An overview is found in the picture here below.

This process is an improvement because Kerberos is replaced by a database which can handle multiple authentication factors. The certificate authority is used smartly to establish a new connection with a certificate that is only valid for a short time. There is a lot of logging to a central place as you can see in the picture depicted above. This means that our requirements are fulfilled. This new concept seems promising to me, and I’d really like to develop it further so that on one day we can really improve big data security.

ARES conference

On the 30th of August, I presented this new concept at the FARES workshop of the ARES conference in Hamburg, Germany. ARES hosts many security experts in different fields and gave some fantastic insights into how the security market is developing. For example, there was an interesting  presentation about SCION, a next-generation network.

My presentation was important to test whether the concept I developed is feasible. Security experts that attended my talk seemed excited about the idea and said they would like to see it implemented. They are especially interested in how this implementation can give a better overview of who is still in the system. Furthermore, they are excited that the auditing information can be stored in a centralized place. The paper published at the conference can be found here.

For me, the goal now is to pursue this idea further. I am looking forward to see if we can make the world of big data a safer place.

Paul Velthuis
Data Engineer at Xomnia