DZone
Big Data Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Big Data Zone > DataWorks Summit Munich Report: MiNiFi, Spark Text Mining, and More

DataWorks Summit Munich Report: MiNiFi, Spark Text Mining, and More

Get some updates about what went on at DataWorks Summit Munich and learn about the new features for MiNiFi and MLlib 2.1.

Tim Spann user avatar by
Tim Spann
CORE ·
May. 09, 17 · Big Data Zone · Opinion
Like (5)
Save
Tweet
3.35K Views

Join the DZone community and get the full member experience.

Join For Free

The DataWorks summit was a couple of weeks ago, but the information is still all relevant and interesting! Let's take a look. Day 1 was Wednesday, April 5, 2017.

You can still watch the keynotes here. It's about two hours of free content.

New Items

  • Ambari 2.5.0.3

  • HDP 2.6 available

  • HDP 2.6 Performance

Hortonworks Hadoop Distribution has been updated to include Spark 2.1, with upgraded Zeppelin and greatly enhanced Livy support for executing Spark jobs via secure REST.

HDP 2.6 is available on HDC for AWS and Azure HDInsight as well as installation anywhere of your choosing — including IBM Power8, Intel, AMD, and your favorite Linux.

MiNiFi Features

  • Event-level access of interactions with flow files.
  • Create.
  • Receive.
  • Fetch.
  • Send.
  • Download.
  • Drop.
  • Expire.
  • Fork.
  • Join.
  • Capture associated attributes/metadata at the time of the event.

MiNiFi Philosophy

  • Go small — Java, write once, run anywhere, reuse core NIFI libraries.
  • Go smaller — C++, write once, run anywhere.

MiNiFi is coming to mobile! If you want to listen, parse, route, transmit, execute, filter, and prioritize, think: edge processing on MiNiFi in a constrained environment, like a car. MiniFi Command and Control!

There's also text mining in Spark by Yanbo and Hortonworks.

Using Spark and Python, you can do sentiment analysis, do fraud detection on a single machine, and scale to clusters of thousands of servers. You can take the same Python libraries like NLTK, and SKLearn for small data and run them at scale.

Examples of data include sensor data, Twitter data, machine-generated data, and logs. The more data you have for Machine Learning, the better your results will be.

For text mining on Big Data, you need the combination of Apache Spark running on YARN with Apache Hadoop storing massive data distributed on HDFS and S3.

Data scientists and data engineers need to work together with better algorithms and models to make predictions on input from streaming data and to collaborate with tools like Apache Zeppelin.

Dataset > feature engineering > model training > model evaluation.

Spark Text Mining Algorithms

  • LDA is used for topic models.
  • Word2Vec puts words into features based on their meaning in an unsupervised way.
  • CountVectorizer documents into vectors based on word count.
  • HashingTF-IDF calculates import words of a document respect to the corpus.

Dataset > regextokenizer > stopwordsdremover > count vectorizer and hashingtf > IDF > string indexer > naivebayes, logistic regression, svm, mlp >

GitHub: extract topics from unstructured documents, unsupervised.

Dataset > regex tokenizer > word2vec > recommendation basic classification/clustering

MLlib 2.1 Features

  • 30 feature transformers (word2vec, tokenizer).
  • 25 models (classification, regression, clustering).
  • Model tuning and evaluation.

Great Resources on Spark

  • Spark VLBFGS Github
  • Large Scale ADS With Spark and Deep Learning
  • Scaling Apache Spark MLLib to Billions of Parameters

Other Recent Cool Stuff

  • Graph Transformations for TensorFlow
Data science Text mining Mining (military) hadoop

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • No-Code/Low-Code Use Cases in the Enterprise
  • SSH Tutorial: Nice and Easy [Video]
  • How to Make Git Forget a Tracked File Now in .gitignore
  • Understand Source Code — Deep Into the Codebase, Locally and in Production

Comments

Big Data Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo