The DataWorks summit was a couple of weeks ago, but the information is still all relevant and interesting! Let's take a look. Day 1 was Wednesday, April 5, 2017.
You can still watch the keynotes here. It's about two hours of free content.
Hortonworks Hadoop Distribution has been updated to include Spark 2.1, with upgraded Zeppelin and greatly enhanced Livy support for executing Spark jobs via secure REST.
HDP 2.6 is available on HDC for AWS and Azure HDInsight as well as installation anywhere of your choosing — including IBM Power8, Intel, AMD, and your favorite Linux.
- Event-level access of interactions with flow files.
- Capture associated attributes/metadata at the time of the event.
- Go small — Java, write once, run anywhere, reuse core NIFI libraries.
- Go smaller — C++, write once, run anywhere.
MiNiFi is coming to mobile! If you want to listen, parse, route, transmit, execute, filter, and prioritize, think: edge processing on MiNiFi in a constrained environment, like a car. MiniFi Command and Control!
There's also text mining in Spark by Yanbo and Hortonworks.
Using Spark and Python, you can do sentiment analysis, do fraud detection on a single machine, and scale to clusters of thousands of servers. You can take the same Python libraries like NLTK, and SKLearn for small data and run them at scale.
Examples of data include sensor data, Twitter data, machine-generated data, and logs. The more data you have for Machine Learning, the better your results will be.
For text mining on Big Data, you need the combination of Apache Spark running on YARN with Apache Hadoop storing massive data distributed on HDFS and S3.
Data scientists and data engineers need to work together with better algorithms and models to make predictions on input from streaming data and to collaborate with tools like Apache Zeppelin.
Dataset > feature engineering > model training > model evaluation.
Spark Text Mining Algorithms
- LDA is used for topic models.
- Word2Vec puts words into features based on their meaning in an unsupervised way.
- CountVectorizer documents into vectors based on word count.
- HashingTF-IDF calculates import words of a document respect to the corpus.
Dataset > regextokenizer > stopwordsdremover > count vectorizer and hashingtf > IDF > string indexer > naivebayes, logistic regression, svm, mlp >
GitHub: extract topics from unstructured documents, unsupervised.
Dataset > regex tokenizer > word2vec > recommendation basic classification/clustering
MLlib 2.1 Features
- 30 feature transformers (word2vec, tokenizer).
- 25 models (classification, regression, clustering).
- Model tuning and evaluation.
Great Resources on Spark
- Spark VLBFGS Github
- Large Scale ADS With Spark and Deep Learning
- Scaling Apache Spark MLLib to Billions of Parameters