Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Cool Projects in Big Data, Machine Learning, and Apache NiFi

DZone's Guide to

Cool Projects in Big Data, Machine Learning, and Apache NiFi

A recap of the week's top news on NLP, Sentiment Analysis, Deep Learning, Big Data, and Streaming Analytic insights.

· Big Data Zone
Free Resource

Learn how you can maximize big data in the cloud with Apache Hadoop. Download this eBook now. Brought to you in partnership with Hortonworks.

This week, data is becoming knowledge as all streams of data are converging and leading me to the conclusion that every business is in need of the same types of data, tools, and results. From payment processing to media to rentals to retail to finance to big pharma, the same problems are coming up of "How do I ingest all kinds of data (variety), constantly changing (agile), often broken (flexible, schemaless or schema flexible), do some transformations in stream and land it in my big data environment (Hadoop with some flavors of NoSQL or data warehouse (SAP HANA or SQL Server+ or Oracle X) on the side)?"

Oh and it's got to be fast, scalable, easy to use, and have a UI that can be used by my intern/data engineers. Some of the data is coming from IoT devices, cameras, beacons, web logs, Twitter, Facebook, 3rd party paid feeds, free feeds from NOAA, government and partner data sources, and legacy systems.  

So what can support text files, JSON, JMS, MQTT, REST, XML, MongoDB, S3 and a host of sources and formats? Only Apache NiFi comes to mind. Think Big Analytics has given us a preview of February's amazing Kylo, an open source tool that works with Apache NiFi to add data wrangling and discovery features on-top. Kylo will be open source and the GitHub directory is built and ready for the first release. I will announce here on DZone when that comes out, as it will be a major boost for enterprises.

Getting started with Hortonworks HDF (Apache NiFi + Storm + Kafka) to ingest these formats is really easy, as I have shown in previous articles. If you are in the New Jersey area, come by our meetup and we'll be doing hands-on training with Apache NiFi. It's 100% open source and open for extension. I am working on a NiFi Processor for doing NLP tasks like Name recognition and Sentiment Analysis. For my current flows, I call Python scripts for NLTK, but in the processor I will be doing that with Apache OpenNLP and Java 8.

Preview header from that open source processor:

@Tags({"nlpprocessor"})
@CapabilityDescription("Run OpenNLP Name Finder and Sentiment Analysis")
@SeeAlso({})
@ReadsAttributes({@ReadsAttribute(attribute="", description="")})
@WritesAttributes({@WritesAttribute(attribute="", description="")})
public class NLPProcessor extends AbstractProcessor {
//...
}

Image title

SentimentAnalysis Parser for USC Data Science, combines Apache Tika and Apache OpenNLP, a really powerful combination, be warned that the maven build will take some serious power to run and will take a while.  Perhaps hours depending on your machine, so run this before heading out to a long lunch.

Oliver Meyn has written a pretty amazing article on using Spark 2.0 Streaming with SSL, Kerberos, Kafka and hosted on HDP 2.4 YARN.   It includes all the build scripts, configuration and source code for you to be able to do this.   Code can easily be adapted to HDP 2.5 and Spark 2.1, very nice.

Great Links of the Week

  • 52 Technologies from 2016, great documenation and code in this GitHub including Sentiment Analysis.

  • Huge Open Data resources from Deep Learning 4 J.

  • You can never have enough data for testing and for a constant stream of data, Twitter is usually your only option.  But what if you want a different schema of data?  Try ACES, Inc.'s JSON Data Generator, well documented with full source code, run it locally and get a nice stream of JSON data.

  • A cool class going on right now CS224n: Natural Language Processing with Deep Learning, course materials will be provided online for free as they are available.

  • Another Stanford class I will be watching closely is CS20SI: Tensorflow for Deep Learning Research (GitHub).

Deep Learning Presentations of Note

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

Topics:
big data ,hadoop ,spark ,machine learning ,deep learning ,nlp ,apache nifi

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}