An Introduction to SQL on Hadoop and SQL off Hadoop
You no longer have to choose between betting the farm on one solution for all SQL access or have multiple hardware platforms for different use cases.
Join the DZone community and get the full member experience.Join For Free
Over time, the inexorable effects of data gravity increased the need for SQL on Hadoop, as SQL is the language of data. Initially, having any way to use SQL against the Hadoop data was the goal, but now there is an increasing requirement to connect business users with tools like Tableau to that data, and give them the performance they expect with high levels of concurrency.
Note that to meet this requirement, it is likely that users will need to have structured data stored in Hadoop (along with the original unstructured data), as good performance is more likely if a transformation is done once rather than per-query as noted in this DZone article.
There are a number of open-source solutions for SQL on Hadoop, including Hive (LLAP), Impala, SparkSQL, and Presto.
As most of these products are relatively young, there are still significant improvements being made, as covered in the later section on benchmarking. So if you can live with their functionality and performance today, you can expect things to improve over the next few years.
The greatest strength of these solutions is they were written from scratch for analyzing data in Hadoop. They were intended to run on Hadoop clusters from day one, and interoperate with the growing number of data formats in that ecosystem.
The greatest weakness of these solutions is they were written from scratch for analysing data in Hadoop. People often say good software takes 10 years, and that certainly applies to SQL products, particularly in the area of query optimisation — Hadoop itself is 10 years old, but most of the SQL on Hadoop products are much younger than that; any focus on real-time, high concurrency SQL on Hadoop is younger still.
That is why a lot of proprietary database products are built on the shoulders of giants. For example, here is a list of products that derive from PostgreSQL, including Greenplum, Netezza, ParAccel, Redshift, and Vertica. The resulting products have a great start in avoiding a lot of mistakes made in the past, particularly in areas such as query optimisation.
By contrast, those developing open-source SQL on Hadoop products from scratch have to learn and solve problems that were long-since addressed in other database products.
That is why promising projects like Presto are only starting to add a cost-based optimizer in 2017, and Impala cannot handle a significant number of TPC-DS queries (which is why Impala TPC-DS benchmarks tend to show less than 80 queries, rather than the full 99 from the query set - more on this in the benchmarking section later).
In addition, some open-source solutions are adopted by specific Hadoop distributions. So Hortonworks uses Hive LLAP, whilst Cloudera prefers Impala. Although the projects are open-source, if you are trying to e.g. get Kudu working on Hortonworks as part of Impala deployment, you may struggle as seen in this Hortonworks community topic.
There are proprietary alternatives for using SQL to query data in Hadoop.
Many of these allow you to run what TDWI call "SQL off Hadoop," requiring a separate platform for the SQL engine in addition to your Hadoop cluster. This is unattractive to many companies, as you have the cost of an additional platform, and the auditing concerns when moving data between different platforms. On the other hand, some might perceive benefits in isolating the SQL workload from their Hadoop cluster, and think the operational complexity and extra platform costs are a price worth paying.
Other products, such as Vertica, have an on-Hadoop offering that does not match their off Hadoop product. In Vertica's case, they advise using the Vertica Enterprise Edition when you need "to boost performance" by enabling optimizations such as projections which aren't available in their on-Hadoop offering.
Finally, some products such as Kognitio have been migrated from an off Hadoop product to run on Hadoop with no missing functionality or performance features.
One major impact of open source for SQL on Hadoop is that having a free-to-use product is now a basic requirement. Users expect to evaluate a fully functional version of the product with no limits on the scale or duration of that evaluation, at no cost. They accept that they will pay for support/consultancy if they decide to move into production with the product at a later date.
Everyone knows that vendors are good at constructing benchmarks which suit their own needs!
However, benchmarks are available as a starting point to judge the alternative SQL on Hadoop options for functionality and performance.
AtScale: This was the second time AtScale benchmarked SQL on Hadoop, and one clear message was the massive improvements made by the open-source offerings they tested. This is what one would expect with relatively new products, but was still a good sign. The other finding I noted was that products had strengths and weaknesses for different queries, suggesting this is not a One Size Fits All market. It should also be noted that the benchmark used the relatively small TPC-H query set, whereas the other benchmarks listed here use the newer and more comprehensive TPC-DS query set.
Comcast ran a benchmarking exercise with TPC-DS queries, comparing different SQL on Hadoop products. They are also the only benchmark to compare the performance of different file formats, so worth reading on that basis alone. It's also worth reading the detail on which of the TPC-DS queries were included (66 of the 99 TPC-DS queries), and the scoring mechanism for long-running/failing queries. There was also no concurrency testing (due to lack of time in their tests, given the number of combinations of product and file format that they considered) which would be essential for almost all real-world use cases.
Kognitio TPC-DS query set a benchmark, including all TPC-DS queries, and concurrency testing. There is more detail on how the benchmark was run, and the per-query results here. Although the benchmark linked here does not include Hive LLAP and Presto, there will be further Kognitio blog posts on these products — without giving everything away, Presto seems comparable to SparkSQL in terms of performance and ease-of-use, whilst LLAP is performant, but not stable under concurrent load.
A number of common themes emerge from the benchmarking process:
Open-source products are improving significantly in terms of functionality and performance — both AtScale and Kognitio see these results (for example, Kognitio report the functionality improvements between Spark 1.6 and 2.0, and IBM fellow Berni Schiefer observed similar improvements starting with Spark 1.5).
Product immaturity for a number of the open-source products means they cannot run all the TPC-DS query set, either because they don't support the required syntax or they generate runtime errors.
Some products (particularly SparkSQL and Presto) need significant tuning for concurrent performance. This was observed by Kognitio and in the Berni Schiefer article mentioned above.
Hive by itself is very slow (highlighted in particular by Comcast). Hive LLAP is a significant improvement, although AtScale still ranked it behind Presto and Impala.
Possible reasons for current open source SQL on Hadoop products not being as performant as some proprietary offerings include:
Overhead of starting and stopping processes for interactive workloads. To run relatively simple queries quickly, you need to reduce latency. If you have a lot of overhead for starting and stopping containers to run tasks, that is a big impediment to interactive usage, even if the actual processing is very efficient
Product immaturity. See the earlier commentary on building from scratch, rather than leveraging years of experience built into existing products.
Evolution from batch processing. If a product like Hive starts off based on Map-Reduce, its developers won't start working on incremental improvements to latency, as they won't have any effect. Similarly, if Hive is then adopted for a lot of batch processing, there is less incentive to work on reducing latency. Hive 2 with LLAP project aims to improve matters in this area, but in benchmarks such as this AtScale referenced earlier, it still lags behind Impala and SparkSQL.
Of course, the best benchmark is one that represents your intended workload, bearing in mind future as well as initial requirements, so you should always conduct your own functional and performance testing rather than relying on benchmarks from vendors or other third parties.
No More "One Size Fits All"
As long as you use SQL on Hadoop solutions which run on your Hadoop cluster, you can use the right tool for the job. If you had already have Hive for ELT and other batch processing, but it can't meet your needs for connecting a community of business users to data in Hadoop, you can utilize a different SQL on Hadoop solution for that task, and keep your current Hive workload in place.
You no longer have to choose between betting the farm on one solution for all SQL access or have multiple hardware platforms for different use cases, with the performance and audit trail pains of having to move data between those platforms.
As I work for a company producing an SQL on Hadoop product, I recuse myself from recommending one product. I'd expect most users to try a product from their Hadoop distribution, then look at free-to-use alternatives as and when they find use cases that product cannot handle.
Opinions expressed by DZone contributors are their own.