During the week of February 10, 2017, there was a lot going on in the world of big data all around the Hadoop ecosystem, including updates in cloud, security, deep learning, visualization, and data warehousing at scale.
My two Apache Spark things of the week are this awesome article on Spark 2.1.0 with blacklisting and this GitHub project for a fluent Java API with Spark. It's a very cool way of calling the Spark REST server via a Spark client.
I can highly recommend checking out a deep learning article called Training a Deep Learning Model to Steer a Car. The pace at which self-driving is advancing is amazing. The forecasts of 80-90% of all vehicles being autonomous in the United States in the next 10-15 years seem conservative. Uber, Google, Amazon, Tesla, Apple, all the major car manufacturers, and countless startups are working on this. Public courses are available, and the technology seems almost quaint at this point. TensorFlow has a Dev Summit live streaming on February 15, 2017. Since it's Google's library and it's streaming via YouTube it will be available after the live streaming.
TensorFlow is trying to do everything including art! Magenta is really cool and I am hoping to work with it once I have enough horsepower and disk space to download it all. Combining deep learning and art is very exciting and a personal favorite use case.
Another deep learning library has gained some major focus: MXNet. Amazon is pushing it big time and it just joined Apache in incubation. I have an article coming out tomorrow about using it to process images from a Raspberry Pi camera.
For businesses, data warehouses are a real business need. For many, the scale and performance are adequate, but they are lacking in certain analytics, the speed of data ingest, and support for non-relational data sources like sensors, drones, social media, logs, and the like. It also doesn't hurt that the major enterprise data warehouse (EDW) vendors charge very high premiums for these offerings that companies are looking elsewhere. Previously rolling an EDW on Hadoop was some effort and required working with a few moving parts. Hortonworks has partnered with AtScale and SyncSort to release a bundled solution that includes all the software, services, and support you need to build an affordable, massively scalable forward facing data warehouse. Gone is the heavy lifting and deep knowledge required to custom build your own Hadoop data warehouse — in a few short weeks, you are up and running. The best part is that you still can leverage all the SQL skills and tools that your team has now.
I am a big fan of using Apache Zeppelin as a free data exploration and data visualization tool. The new version really ups the enterprise level reporting features. I am very excited about Helium, which adds new visualizations and opens it up to the community and third-party developers to add more charts and visualizations. You just need to install a JSON file. Can't get much easier than that. Here is one for adding bubble charts:
Quick Apache NiFi tip: If you want to grab all the hashtags from a Tweet, make sure your
EvaluateJsonPath is set to
Return Type: JSON and the attribute mapped to hashtags is
$.entities.hashtags[*].text. You can also do that for other array values in JSON, like
That's it for this week. Check out my other articles and deeper dives into MXNet, TensorFlow, NiFi, and other technologies!
As my daughter says: