This week we have SQL, Deep Learning, NiFi, big clusters, and more.
Leveraging Smart Meter Data for Electric Utilities with Spark SQL and Hive. The speed enhancement of Spark 2.0 is enabling Hitachi to process 10,000,000 smart meters in only 30 minutes with four nodes.
Networking for Large Scale Hadoop Clusters. Yahoo Japan has special difficulties due to their amazingly large Hadoop cluster. They are using an IP CLOS network, which is used by other mega Internet companies like Google, Facebook, and Amazon. Yahoo runs 75 Petabytes with 10GB networking NICs. They are researching the new HDFS erasure coding for future clusters.
Coca-Cola East Japan uses Hadoop. They run at a huge scale adding 5GB of data and process over 300GB in memory for one project. They have many data sets from SalesForce, SAP, Teradata, SQL Server, MySQL, Oracle, Flat Files, and others that they transit via Apache NiFi.
Why is My Hadoop* Cluster Slow. A nice overview of the new metrics, monitoring, logging,tracing and analytics available for investigating cluster health.
SQL 2011 Support in Hive. Hive is adding more and more features from SQL 2011 and is nearing full support of the standard including Union, Intersect, and more complete SET operations. This will allow more users, tools, and analytics to run directly on top of Hadoop data clusters.
Streamlining Hadoop DevOps with Apache Ambari. DevOps in Big Data has always been hard, but the new updates to Ambari including Grafana dashboards, enhanced metrics, monitoring views, alerts, rolling updates, management packs, and blueprints.
Speaking of Apache NiFi, here is a quick tutorial on integrating Apache Pig scripts into your Big Data pipeline via the swiss army knife processor (ExecuteProcess). This integration is like peanut butter (Pig) and Chocolate (NiFi), alone they are good, but together they are better than the sum of their parts. Yet another use case for NiFi.
Three Quick NiFi Tutorials
Another interesting Big Data coming from Asia is the Paddle Paddle library. Parallel Distributed Deep Learning comes from Baidu scientists and is yet another awesome Deep Learning library. As if it needs to be said this, supports Python and C++. Seems only DeepLearning4J is really pushing a mega JVM Deep Learning Library. TensorFrames is just a wrapper not a JVM candy bar.
Paddle Paddle Deep Learning Library
Apache Beam for unifying batch and streaming APIs across Flink, Spark, Google Data Flow and More. One API that everyone agrees to and works with would be a very nice thing to consolidate all the options out there and make it easier to try different solutions for specific problems.
There are a lot of things going on in the world of Big Data, Hadoop, Spark, IoT, Deep Learning and Machine Learning. I'll be adding another article soon on interesting developments happening and some information from talks I've had with SAP and others this week.
Another cool library enters the Apache Incubation, HiveMall (GitHub, Wiki, Docs, Intro) and a nice presentation on it. It is a ML library in SQL with Hive UDFs that can be used with SparkSQL, Pig and Hive. A great example to start with is the classic MovieLens.