Big Data is exploding and new projects are springing up daily from companies all over the world.
The good news is that all the technology is open source and available for you to start adopting today.
- Hadoop - Solid, enterprise strength and the basis for everything else. You need YARN and HDFS and the infrastructure from Hadoop to be your primary data store and run your key big data servers and applications.
- Spark - Easy to use, supporting all the important Big Data Languages (Scala, Python, Java, R), a huge ecosystem, growing quickly, easy microbatching/batching/SQL support. This is another no-brainer.
- NiFi - The tool out of NSA that allows for easy data ingest, store and processing from so many sources with minimal coding and a slick UI. Dozens of sources from social media, JMS, NoSQL, SQL, Rest/JSON Feeds, AMQP, SQS, FTP, Flume, ElasticSearch, S3, MongoDB, Splunk, Email, HBase, Hive, HDFS, Azure Event Hub, Kafka and more. If there isn't a source or sink you need, it's straight forward Java code to write your own Processor for that. Another great Apache project in your tool box. This is the Swiss Army Knife of Big Data tools.
- Apache Hive 2.1 - Apache Hive has been the SQL solution on Hadoop forever. With the latest release, performance and feature enhancement keep Hive as the solution for SQL on Big Data.
- Kafka - The choice for asynchronous, distributed messaging between Big Data systems. It comes baked into most stacks. From Spark to NiFi to third party tools to Java to Scala, it is a great glue between systems. This needs to be in your stack.
- Phoenix - HBase - BigTable for Open Source with tons of companies working on HBase and making it scale huge. NoSQL backed by HDFS and well integrated with all the tools. The addition of the ever building Phoenix on HBase is making this the go-to for NoSQL. This adds SQL, JDBC, OLTP, and operational analytics to HBase.
- Zeppelin - Easy, integrated notebook tool for working with Hive, Spark, SQL, Shell, Scala, Python and a ton of other data exploration and machine learning tools. It's very easy to work with and a great way to explore and query data. This tool is gaining in support and features. They just need to up their charting and mapping.
- Sparkling Water - H2O fills the gap in Spark's Machine Learning and just works. It does all the machine learning you need.
- Apache Beam - is the unified framework for data processing pipeline development in Java. This allows you to support Spark and Flink as well. Other frameworks will come online, and you won't have to learn too many frameworks.
- Stanford CoreNLP - Natural Language Processing is huge and just growing more. Stanford is continuing to improve their framework.
Obviously, there are a huge set of Big Data projects, so your best option is to start with a base distribution that incorporates and tests the various versions of the projects and ensures they work together with security and management smoothly. I recommending using Hortonworks Connected Data Platforms as your base. There's a few that more projects that I would add if we were doing top 20, notably Storm, SOLR, Apache Oozie, and Apache HAWQ. There's a lot of great technology underneath that, for the most part, you don't see or know like Apache Tez (though you need to configure that when running Hive), Apache Calcite, Apache Slider, Apache Zookeeper, and Livy. These projects are essential for running a Big Data infrastructure.
Interesting frameworks and tools to evaluate: