At Hadoop Summit San Jose the goal of the Data Science, Analytics and Spark track is sure to be packed. Ram Sriharsha – Product Manager Apache Spark, Databricks generalizes the 16 sessions in the track as providing technical guidance around:
Leveraging Hadoop for analytics is a key use case across industries and represents a critical value proposition for Hadoop. This track will include introductory to advanced sessions on applications, tools, algorithms and emerging research topics that extend Hadoop platform for data science. Sessions will include examples of innovative analytics applications and systems, data visualization, statistics and machine learning. You will hear from leading data scientists, analysts and practitioners who are driving innovation by extracting valuable insights from data at rest as well as data in motion.
If there were only 3 sessions to attend what would the committee recommend? Well glad you asked!
Combining Machine Learning Frameworks With Apache Spark
Speaker: Timothy Hunter from Databricks
Machine Learning (ML) workflows involve a sequence of processing and learning stages. Realistic workflows combine specialized libraries with more general data management workflows. Apache Spark is well-known as a powerful platform to perform iterative computations required for ML. This talk presents how to combine the strengths of Spark’s ML library (MLlib) with popular packages such as CoreNLP, scikit-learn, and TensorFlow. CoreNLP is a comprehensive language processing package, scikit-learn is the de facto standard ML library for Python, and TensorFlow is a library for deep learning recently open-sourced by Google. We use a running example of image processing with deep learning models in order to demonstrate how Spark’s flexible API naturally integrates specialized libraries. We also discuss the improvements of MLlib in Spark 2.0 and the future of MLlib’s API. On the roadmap are both more algorithms and features for users, and more utilities and abstractions to aid developers.
Building A Scalable Data Science Platform with R
Speaker: Mario Inchiosa from Microsoft
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Services API. Come learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
Application of Active Learning for Fraud Labeling at PayPal
Speaker: Venkatesh Ramanathan from PayPal
Active learning is a machine learning algorithm that has shown to yield superior performance with less labeling cost. In this talk, I will present how active learning can be applied to fraud labelling. I will present results from experiments conducted on a very large data set containing over 100 million examples and 1000s of features. I will summarize some of the challenges in applying this technique to this real-world application.
Hope to see you at the sessions, but you need to register to attend Hadoop Summit San Jose.