Throughout my experience as a data engineer, I’ve noticed that most data engineers will opt for Redshift, the (familiar) AWS native solution, without thinking about giving its alternatives a chance.
Redshift is one of the first cloud data warehouse services. Compared to on-premise data center solutions, it is fast, cheap, and overall great. However, times change, and the year 2014 is well behind us, which really calls to consider more modern options too, like Amazon Athena
Like Amazon Redshift, Amazon Athena is a powerful tool for managing data in the Amazon Web Services (AWS) cloud. Both tools are popular choices for data warehousing and analytics, and each has its own unique strengths and capabilities.
In this blog, I will compare Athena and Redshift, and explore why Athena may be the superior choice in an AWS data platform. I will also compare Athena to other popular data warehousing solutions, including Google BigQuery, Azure Synapse Analytics, and Snowflake.
What is AWS Athena, and what is it used for?
Amazon Athena is a serverless query engine based on Presto that allows users to run SQL queries on data stored in S3. It is designed to be easy to use and supports popular SQL clients. Athena is ideal for running ad-hoc queries on large amounts of data, exploring and analyzing data without the overhead of setting up and maintaining a separate data warehouse.
Not setting up a separate data warehouse (DWH) is why AWS calls these “ad-hoc queries”. It supports both batch and streaming data sources, making it a good choice for querying constantly changing data. Athena uses the data catalogue created by AWS Glue to discover and access data stored in S3, allowing organizations to quickly and easily perform data analysis and gain insights from their data.
AWS Glue is an ETL (extract, transform, and load) service provided by AWS. It simplifies the process of discovering, categorizing, and cleaning data from various sources, such as S3 and relational databases, and makes it easier to integrate the data into data lakes and data warehouses. The service also enables users to define and enforce schema and data quality rules. AWS Glue provides a scalable and cost-effective way to prepare and transform large volumes of data for downstream processing and analysis.
Together, AWS Glue and Amazon Athena can be used to extract, transform, and load data from various sources into S3, and then run SQL queries on that data using Amazon Athena. While it's common to use Python for ETLs rather than AWS Glue due to cost and flexibility concerns, the Athena + Glue Data Catalog combination is still a good choice for users who need to run ad-hoc queries on large amounts of data or explore and analyze data without setting up a separate data warehouse.
How do I set up AWS Athena?
Accompanying this blog is a repository: Github: Athena vs Redshift. The code snippets mentioned here are from this repository, and you can copy the repository to try it for yourself.
Athena, as mentioned before, uses the data catalogue created by Glue to run. In the following example, I have set up an s3 bucket with data files. These data files are found in the repo, path ./data, and created using drawdata.xyz .
I wanted to show you the easiest use case, which is combining the results of similar data files into a single table. Using a Glue Crawler, we can automatically discover and update the table in the Glue catalogue:
The Glue crawler created above (in AWS CDK) has a number of parameters. It will have a construct name, and a role to use that is created in the same stack. In its configuration, you see that we’ve enabled a TableGroupingPolicy so that it groups (combines) compatible schemas. All tables will have a prefix, _crawler, and we exclude files that have the word input.
How do I access Athena?
Athena supports many database connections when you want to access it from your code. The cleanest way is to use boto3, but many reported success by using a normal database connector. For BI solutions, such as Redshift, connectors are available.
When the use-case is a simple query or exploration, you can access Athena via the AWS web console. If you’re wondering how to set up a user for Athena, the code for creating a role is in the accompanying repository. A user with that role should suffice. Otherwise, it is mostly a normal IAM user that you want to create.
How to set up my S3 bucket?
A word of caution about the Glue crawler and its table grouping policy (which is not particularly intelligent, and I’m hoping AWS updates this):
The exclusion here is done _before_ it combines the tables. Sadly, it does not exclude the tables being crawled. In one of my projects, we had a directory that contained Excel sheets and parquet files. I tried to combine them both, because the pattern matched only selected parquet files but the resulting schema just pointed to the folder on S3. The only way we found to fix this was to separate the files in S3 by putting the parquet files into another folder. We deprecate tables in the database if they are deleted, and update if they’re changed. Finally, we run this crawler every Sunday.
In conclusion, the way to set up your bucket is ideally to create separate folders within the bucket for different tables, and then group them together with a Glue crawler.
What is AWS Redshift?
Initially, AWS offered their database service (RDS), which was primarily designed for transactional workloads. However, this did not fit the need for (cloud) data warehouses, and as a result, AWS designed Redshift, which became generally available in 2013.
Amazon Redshift is a fully managed data warehouse service that allows users to store and query large amounts of data using SQL. Redshift is based on PostgreSQL, and supports both standard SQL and PostgreSQL-specific extensions. It is designed to be scalable, with support for scaling up and down as needed. Redshift has been a staple of AWS data platforms since its release, to the point where alternatives aren’t even considered by many teams.
Redshift is a good choice for users who need to store and query large amounts of data, and who need a high-performance data warehouse solution. When the workloads join many tables together, the “data warehouse service” power will show itself against the columnar storage method of Athena.
However, Redshift is not fit for ad-hoc requests. It is made to build data warehouses, to have elaborate data models and so forth. Another limitation is that its speed is bound by the compute power you reserve for it, although this does limit the total cost of using it (Athena, on the other hand, is paid by query amount).
Advantages of AWS Athena over AWS Redshift
Let’s run over a few of the advantages of AWS Athena:
- Cost: Athena is a serverless service, which means that users only pay for the queries they run. Redshift is a service that you set up, and you will pay for the uptime of that service.
- Simplicity: Athena is a serverless service, which means that users don't need to worry about setting up or maintaining separate compute resources as they would for Redshift.
- Data formats and sources: Athena and Redshift support a wide range of data formats and sources, including CSV, JSON, Parquet, and ORC, and can query data from both batch and streaming sources. However, Athena can directly query the data, whereas Redshift requires loading the data.
- Integration with AWS Services: Athena is fully integrated with other AWS services, such as AWS Glue, AWS Lake Formation, and AWS EMR. Redshift of course integrates into the AWS ecosystem but is more of a separate entity.
- Performance: Athena is based on Presto, which is a distributed SQL query engine that is designed for high performance. This means that Athena can run queries on large datasets in parallel, and can return results quickly. Redshift, on the other hand, is limited by the performance of a single node. An advantage that Redshift holds over Athena, however, is that it is optimized for complex business logic involving many joins. However, when these warehouse operations are not needed, Athena’s columnar engine and automatic scaling has an incredible speed advantage.
How does AWS Athena compare to other data warehousing solutions?
In addition to comparing Athena and Redshift, it's also important to consider how Athena compares to other popular data warehousing solutions, such as Google BigQuery, Azure Synapse Analytics, and Snowflake:
- Google BigQuery: BigQuery is a cloud-based data warehousing platform that allows users to store and query structured and semi-structured data using SQL. Like Athena, BigQuery is based on a distributed SQL query engine and is designed for high performance.
- Azure Synapse Analytics: Synapse Analytics is a cloud-based data warehousing and analytics platform that allows users to store and query structured and semi-structured data using SQL. Like Athena, Synapse Analytics is based on a distributed SQL query engine and is designed for high performance.
- Snowflake: Snowflake is a cloud-based data platform that allows users to store and query structured and semi-structured data using SQL. Like Athena, Snowflake is based on a distributed SQL query engine and is designed for high performance. However, Snowflake has some key differences from Athena. For example, Snowflake supports various forms of User Defined Functions (UDFs) and data governance trickery. This makes Snowflake a better choice than Athena for users who need custom functionality or tight data governance.
While Redshift and other data warehousing solutions may have their own strengths and capabilities, they may not be as well-suited to certain types of workloads or use cases as AWS Athena. For users who need to run ad-hoc queries on large amounts of data, or who need a simple and cost-effective data warehousing solution, Athena may be the superior choice in an AWS data platform.
Athena's serverless architecture, support for a wide range of data formats and sources, integration with other AWS services, and high performance make it an attractive option for users who need to manage and analyze data in the AWS cloud.
When comparing AWS Athena to other data warehousing solutions, such as BigQuery, Synapse Analytics, and Snowflake, we found that each tool has its own unique strengths and capabilities. Depending on your specific requirements and use case, one of these other tools may be a better fit than AWS Athena or Redshift.
It's important to carefully evaluate your needs and requirements, and to choose the data warehousing solution that best fits your specific needs. Xomnia can provide data architects and engineers with architectural help, in addition to our experience of implementing data platforms.
Note: OpenAI’s ChatGPT was used to generate some of the source material for this blog.