Hadoop Distributions: Past, Present, and Future
In a world where open-source software can avoid vendor lock-in, are major Hadoop distributors discarding some of that benefit to the detriment of Hadoop users?
Join the DZone community and get the full member experience.
Join For FreeApache Hadoop seems like a common platform for developers and users alike, but reality dictates that a lot of people want a distribution to help them navigate the Hadoop ecosystem. This is comparable to the earlier evolution of other open source offerings, such as Linux distributions.
Distributions
To that end, a number of alternative Hadoop distributions sprang up, Cloudera, Hortonworks, MapR, IBM, Intel and Pivotal being the leading contenders.
Recently, that list has shrunk to Cloudera, Hortonworks, and MapR:
- Intel ditched its Hadoop distribution and backed Cloudera in 2014.
- Pivotal switched to resell Hortonworks Data Platform (HDP) last year, having earlier moved Pivotal HD to the ODPi specs, then outsourced support to Hortonworks, then open-sourced all its proprietary components, as discussed here.
- IBM announced at the recent DataWorks Summit in San Jose that it would stop shipping its own Hadoop distribution and partner with Hortonworks instead, as described by Gartner.
Market Share
Market share figures are difficult to find, although in 2016 a DeZyre article quoted the figures below, which further investigation suggests seem to originate from Cloudera:
- Cloudera: 53%
- Hortonworks: 16%
- MapR" 11%
Given the IBM and Pivotal moves mentioned above, one might expect the Hortonworks share to have increased significantly since then.
Divergence
There is a significant, and increasing, divergence between what the remaining distributions offer, as shown in Table 1 below.
This brings a number of challenges for users and developers alike:
- It is difficult to develop and document products and solutions that work across all distributions.
- It is challenging to move from one distribution to another, and increasingly so as more distribution-specific projects become more heavily adopted. We've certainly seen issues with, for example, users trying to move from Cloudera to Hortonworks but having issues moving their existing Impala workload (although Impala is an Apache project). This is not lost on the distributors, of course.
This divergence between distributions makes it increasingly important to choose your Hadoop distribution carefully when planning a deployment, as the decision is hard to reverse later.
It is questionable whether the current divergence between distributions is serving end users and developers well by letting open-source competition produce, for example, the best authentication solution, or if it's just making adoption harder by having a number of roughly equivalent options to choose from and weakening each of them without producing a clear winner.
Function |
Hortonworks |
Cloudera |
MapR |
File system |
HDFS |
HDFS |
MapR-FS |
Mutable column-based disk storage |
N/A |
Kudu |
N/A |
Authentication and authorization |
Ranger |
Sentry |
Sentry |
SQL |
Hive LLAP / HDB / SparkSQL / Big SQL |
Impala |
Drill |
NoSQL |
HBase |
HBase |
MapR-DB |
Management |
Ambari |
Cloudera Manager |
MapR Control System (MCS) |
Data governance |
Atlas |
Cloudera Navigator |
N/A |
Data streaming |
Kafka |
Kafka |
MapR Streams |
Managing cloud clusters |
Cloudbreak |
Cloudera Director |
N/A |
Cyber security/threat detection |
Metron |
Open Network Insight |
N/A |
Table 1: Divergence between Hadoop distributions.
Note that for Apache projects like Kudu, it is possible to deploy them on, for example, Hortonworks, but they aren't first class citizens in the same way as they are on Cloudera, as you can see from this Hortonworks Community topic. Likewise, it should be possible to use Metron on Cloudera or other distributions, but if you search for information on doing this, you are most likely to be brought back to Hortonworks.
Originally, the differences between Hortonworks and Cloudera were confined to management tools, but more recently, these two distributors have chosen to back different open-source projects in a variety of areas, creating more lock-in to their distributions.
Cloud Providers
The elephant in the room (no pun intended) is another Hadoop distribution not mentioned earlier. Amazon Elastic Map Reduce (EMR) is a cloud-based Hadoop option available on-demand. It reduces the need for expertise in managing a cluster of servers and having a lot of internal Hadoop skills to administer the cluster.
You can run other Hadoop distributions in Amazon as well, of course, but that might be hard to justify on price (the linked contents shows EMR coming out at 25% of the cost of Cloudera on EC2) as well as complexity.
Are Hortonworks, Cloudera, and MapR busy trying to take market share from each other, whilst the biggest threat to all three is really Amazon and the other major cloud providers?
Conclusion
In a world where open-source software is seen as a way to avoid vendor lock-in, are the major Hadoop distributors discarding some of that benefit in their battle for market share to the detriment of those trying to produce solutions on Hadoop or use it within their organization?
Will the market demand a clear winner in such an environment rather than a number of competing projects in different areas which are broadly equivalent, or will developers and users follow the inexorable migration of data to clou, and simply adopt the Hadoop (or other) offerings from Amazon and other cloud providers? What do you think?
Opinions expressed by DZone contributors are their own.
Comments