Big Data Is Growing and Apache Hadoop Is Legion
Big Data Is Growing and Apache Hadoop Is Legion
Apache Spark, Apache Hadoop, and Apache Kafka are trending up together since they're a bundle of awesome big data services and projects.
Join the DZone community and get the full member experience.Join For Free
Some pundits with no experience at major corporations and who haven't written big data code in the last five years may have delusions of Apache Hadoop shrinking or vanishing into the cloud or into some imagined Apache Spark ecosphere.
This is beyond wrong.
Apache Hadoop is evolving to the point where people don't even need to mention it by name. It's an "everyone" platform that is taken for granted. Most major players have adopted it and have been running it. Many are moving all of their legacy data to Apache Hadoop and sunsetting dozens of systems from proprietary data warehouses, legacy relational databases, failed weird NoSQL stores, and a mishmash of various data sources.
Apache Hadoop and Apache Spark are part of the Apache big data environment that work together like peanut butter and jelly. There's really no good reason to not run Apache Spark on top of YARN in Apache Hadoop. You have the powerful nodes, right near the data they need. Apache Spark SQL is great, but by using the Apache Hive context, you get catalogs and access to all your Apache Hive tables. By running Apache Spark inside Apache Hadoop, you get the advantage of row and column-level control with Apache Ranger.
Apache Spark is a popular execution engine that plugs in well to Apache Hadoop. But so do Apache Storm, Apache Flink, Apache Apex, and dozens more. Fortunately, Google has put out Apache Beam to help consolidate this execution engine sprawl.
To run Apache Spark without Apache Hadoop is at perhaps okay for temporary ephemeral data science purposes, but even then I don't think so. Security, data governance, users, groups, execution queues, data catalog, data model management, machine learning model management, and dozens of other real concerns for real enterprise users require more than just Apache Spark. Apache Spark wasn't designed to replace Hadoop since it has no storage. Compute and storage need to work together for real-world applications. You need to run lots of batch and streaming workloads on top of a cluster as well as store petabytes of data. This same environment allows for deep learning, machine learning, IoT, computer vision, and all other big data concerns to be addressed and run at scale.
Apache NiFi also makes Apache Hadoop the core place to store and retrieve all the data you need for all the IoT, mobile, AI, and "real-time" applications that enterprises need.
For amateur developers, maybe you can just run Apache Spark and Apache NiFi on your desktop and not use Apache Hadoop. You would be losing out on things like Apache Zeppelin for notebooks to easily run and develop machine learning and data federation applications.
One has to remember that Apache Hadoop is not one thing — it's a platform of tools, libraries, and services integrated together for NoSQL, SQL, batch, streaming, storage, and many other purposes.
Apache Hadoop is now in people's on-premises data centers, multiple clouds, and hybrid combinations of the two. Apache Hadoop is inside Azure HDInsight, Hortonworks Data Cloud in Amazon, Hortonworks CloudBreak for Every Cloud... it's hard to avoid Apache Hadoop.
Apache Hadoop may not look like the MapReduce-only data of old. It's now a multifaceted distributed compute and storage platform that includes streaming, NoSQL, real-time SQL, batch SQL, batch jobs, Apache Spark jobs, deep learning, machine learning, messaging, IoT, and more.
Apache Hadoop is far from dead, Apache Hadoop is legion. Perhaps MapReduce is on the way out, as most services are running on Apache Tez, Apache Spark, and other engines inside of Apache Hadoop big data platform. The highlighted projects could exist as services on their own but as part of an integrated platform become incredibly powerful and easy to use.
Let's not forget some of the projects:
- Apache Hive (this is the SQL you are looking for)
- Apache Spark
- Apache HBase
- Apache Phoenix
- Apache Atlas
- Apache Ranger
- Apache Storm
- Apache Accumulo
- Apache Pig
- Apache Sqoop
- Apache SuperSet
- Apache NiFi
- Apache Kafka
- Apache Knox
- Hortonworks Streaming Analytics Manager
- Hortonworks Schema Registry
- IBM BigSQL
- Apache HAWQ
- Apache Calcite
- Apache Ambari
- Apache Oozie
- Apache ZooKeeper
- Apache Zeppelin
- IBM DSX
These projects all have huge ecosystems and a large number of users. When we factor all of this together, Apache Hadoop is huge and growing. If we look at Google Trends, we see that Apache Spark, Apache Hadoop, and Apache Kafka trend together since they should be thought of as a bundle of awesome big data services and projects.
Opinions expressed by DZone contributors are their own.