Over a million developers have joined DZone.

DataWorks Summit Munich Report: MiNiFi, Spark Text Mining, and More

DZone's Guide to

DataWorks Summit Munich Report: MiNiFi, Spark Text Mining, and More

Get some updates about what went on at DataWorks Summit Munich and learn about the new features for MiNiFi and MLlib 2.1.

· Big Data Zone ·
Free Resource

How to Simplify Apache Kafka. Get eBook.

The DataWorks summit was a couple of weeks ago, but the information is still all relevant and interesting! Let's take a look. Day 1 was Wednesday, April 5, 2017.

You can still watch the keynotes here. It's about two hours of free content.

New Items

Hortonworks Hadoop Distribution has been updated to include Spark 2.1, with upgraded Zeppelin and greatly enhanced Livy support for executing Spark jobs via secure REST.

HDP 2.6 is available on HDC for AWS and Azure HDInsight as well as installation anywhere of your choosing — including IBM Power8, Intel, AMD, and your favorite Linux.

MiNiFi Features

  • Event-level access of interactions with flow files.
  • Create.
  • Receive.
  • Fetch.
  • Send.
  • Download.
  • Drop.
  • Expire.
  • Fork.
  • Join.
  • Capture associated attributes/metadata at the time of the event.

MiNiFi Philosophy

  • Go small — Java, write once, run anywhere, reuse core NIFI libraries.
  • Go smaller — C++, write once, run anywhere.

MiNiFi is coming to mobile! If you want to listen, parse, route, transmit, execute, filter, and prioritize, think: edge processing on MiNiFi in a constrained environment, like a car. MiniFi Command and Control!

There's also text mining in Spark by Yanbo and Hortonworks.

Using Spark and Python, you can do sentiment analysis, do fraud detection on a single machine, and scale to clusters of thousands of servers. You can take the same Python libraries like NLTK, and SKLearn for small data and run them at scale.

Examples of data include sensor data, Twitter data, machine-generated data, and logs. The more data you have for Machine Learning, the better your results will be.

For text mining on Big Data, you need the combination of Apache Spark running on YARN with Apache Hadoop storing massive data distributed on HDFS and S3.

Data scientists and data engineers need to work together with better algorithms and models to make predictions on input from streaming data and to collaborate with tools like Apache Zeppelin.

Dataset > feature engineering > model training > model evaluation.

Spark Text Mining Algorithms

  • LDA is used for topic models.
  • Word2Vec puts words into features based on their meaning in an unsupervised way.
  • CountVectorizer documents into vectors based on word count.
  • HashingTF-IDF calculates import words of a document respect to the corpus.

Dataset > regextokenizer > stopwordsdremover > count vectorizer and hashingtf > IDF > string indexer > naivebayes, logistic regression, svm, mlp >

GitHub: extract topics from unstructured documents, unsupervised.

Dataset > regex tokenizer > word2vec > recommendation basic classification/clustering

MLlib 2.1 Features

  • 30 feature transformers (word2vec, tokenizer).
  • 25 models (classification, regression, clustering).
  • Model tuning and evaluation.

Great Resources on Spark

Other Recent Cool Stuff

12 Best Practices for Modern Data Ingestion. Download White Paper.

dataworks ,big data ,spark ,hadoop ,minifi ,text mining ,mllib

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}