AWS EMR in FS: Presto vs Hive vs Spark SQL

Distributed SQL Query Engines for Big data like Hive, Presto, Impala and SparkSQL are gaining more prominence in the Financial Services space, especially for liquidity risk management. Maximum Cumulative Outflow is one of the key analysis techniques to measure liquidity risk. This analysis technique is used to analyze balance sheet maturities and generates cumulative net cash outflow by time period over a 5-year horizon. Maximum Cumulative Outflow analysis is usually dictated by strict SLA, hence most Financial Services Institutions leverage distributed SQL query engine for processing. As the data size grows over time, resources needed for processing also have to be bumped up proportionally to meet the SLA, and it is easier said than done in an on-premise environment where dynamic provisioning of resources on-demand may not be possible.

AWS EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data across dynamically scalable Amazon EC2 instances. In this article, we’ll take a look at the performance difference between Hive, Presto, and SparkSQL on AWS EMR running a set of queries on Hive table stored in parquet format.

Distributed SQL Query Engines benchmarked: Hive (Map Reduce), SparkSQL (In-Memory), Presto (In-Memory)

Input Data Size: 150 GB

Staging Area: AWS S3

AWS EMR Instance Type: 1* Master Node & 3* Task Node — r5.8xlarge

Billing Mode: Spot Instance

Total Resources: 1 TB RAM, 128 vCPU

Table Format: Hive Table with Partitioning

Storage Format: Parquet

Sample Query 1: CFAGG

Sample Query 2: LDCFAGG

Benchmark Results

AWS QuickSight Visualization

Hive leverages MapReduce capabilities to perform distributed querying, while SparkSQL and Presto are in-memory processing distributed processing engines, so it is definitely unfair to compare Hive with SparkSQL and Presto.

Summary:

  1. Presto is consistently faster than Hive and SparkSQL for all the queries.
  2. Presto scales better than Hive and Spark for concurrent queries.
  3. For small queries Hive performs better than SparkSQL consistently.
  4. Increasing the number of joins generally increases query processing time.
  5. Increased query selectivity resulted in reduced query processing time
  6. JOIN operations between very large tables increased query processing time for all engines.
  7. As the number of joins increases, Presto and Spark SQL are more likely to perform best.

Each engine has its strengths: Presto’s and SparkSQL’s concurrency scaling support, SparkSQL’s handling of large joins, Hive’s consistency across multiple query types. Financial Services Institutions might consider leveraging different engines for different query patterns and use cases. As Hadoop matures, FSIs are starting to use this powerful platform to serve more diverse workloads. Hadoop is no longer just a batch-processing platform for data science and machine learning use cases — it has evolved into a multi-purpose data platform for operational reporting, exploratory analysis, and real-time decision support.

An AWS and Google Cloud certified Cloud Professional leading a team of 60+ world class engineers to deliver business value to enterprises across ASEAN