Scala Days NYC 2016: Highlights
Sessions, content, slides from Scala Days NYC 2016 (May 10-11, 2016).
Join the DZone community and get the full member experience.
Join For FreeScala Days started off great, with a smooth check-in and an awesome venue at the New World Stages. A nice breakfast and vendors already available and spreading Scala love. Tapad provided a cool lounge with phone and electric hookups. Snacks, coffee, and drinks were plentiful, top shelf, and readily available. There were sessions on everything Scala, from microservices to Big Data. The vendors were the Scala of Scalaists, real cool vendors that are very active in the Scala community.
One great talk was by Martin Odersky: Scala The Road Ahead. He was also signing his updated book, "Programming in Scala 3rd Edition, Updated for 2.12".
Another great talk was by Dean Wampler and Andy Petrella - “Scala: The Unpredicted Lingua Franca for Data Science” with @deanwampler and @noootsab. (Github) The examples were all in a Spark Notebook, which is Andy's project that is similar to Zeppelin. His toolkit was:
Spark Notebook with Scala 2.11.7, Spark 1.5.2, Hadoop 2.7.1, Hive, Parquet
The talk highlights the new distributed data science on big data enabled by Hadoop and Spark using Scala and Notebooks as the primary interface for data exploration and science experimentation. Spark and Hadoop are Enterprise Ready Open Source Implementation that makes this possible with simple APIs like JavaScript, Ruby, and Python. Scala is perfect for this new distributed science with its basis in functional programming with objects.
Interesting Quotes
"Spark with Java makes no sense"
"In Java it's just noise"
"Spark collections are lazy by default."
"Spark's RDD API is inspired by Scala collections API."
Tuple Syntax common in Spark Scala RDD, key-value pairs
Write expressive DSL with groupBy().count.orderBy() instead of embedding SQL strings for SparkSQL.
Scala is the latest API for Spark with all the features and is enterprise Ready.
The new Big Data world requires a Big Data Team: IT and Data Scientists working together as one team to release something deployable on a production enterprise cluster.
Tooling, Port Models, and Invent New Models
Streaming Clustering: G-Stream, Mean-Shift-LSH, SOM-MR
Spark Packages for Machine Learning
Scala with Devices - Chariot Solutions had a booth showing a really cool demo with a number of Raspberry PIs running JVMs with Scala and Akka talking to a server and had a few dashboards updating real-time.
A Literally-Hands-on Introduction to Distributed Systems Concepts with Akka Clustering
Keywords: RAFT, gossip protocols, Distributed consensus, AKKA Cluster to connect to twitter streams
8 Fallacies of Distributed Systems
(AKKA helps with those)
Network is reliable
Latency is zero
Bandwidth is infinite
Topology doesn't change
Network is homogenous
(AKKA doesn't help here)
There is one administrator
Transport cost is zero
Network is secure
CAP Theorem
Consistency (all nodes, same data, same time), Availability (every request gets a timely response), Partition Tolerance (can't have split brain). Unfortunately, this buffet only lets you pick 2 of 3. You cannot provide all three guarantees, always need P, either pick C or A.
Gossip Protocols
Each node holds the state of the cluster and tells neighbors about it; the options of PAXOS or Centralized State Database don't scale well. Gossip protocols are difficult for real-time and fast data. Convergence: AKKA uses to make decisions on managing the cluster. AKKA detects when a node fails using a Failure detector to determine if neighbors are reachable. What could happen is that you get Network partitions; islands of clustered nodes communicate and spread bad data amongst themselves cut off from the rest of the cluster. You could try Human intervention with monitoring, that doesn't scale well and who wants that job? Check out some more AKKA articles by the authors company HootSuite.
AKKA Insights: Distributed Data (experimental in AKKA 2.4) allows to merge friendly data types to maintain state across a cluster using conflict-free replicated data types / CRDT
Split Brain Resolver in AKKA 2.3 with LightBend Subscription; makes decisions based on quorum Size or keeps oldest nodes. It makes decisions based on last known state of the cluster.
See also:
Willem MeintsTechnical Evangelist @Info Support@willem_meints Microsoft MVP
For doing recommendations in Spark, Alternating Least Squares is your primary option. An interesting Spark tip that Willem gave is to always name your application so you can find it in the 4040 UI or Spark History UI
config.setAppName("MyUniqueSparkAppName")
Cool Quote: "Scala is the most optimum language since you Write once, Read never."
What makes recommendations hard is that you have a Sparse Matrix of reviews (R= U * I) that can be quite massive and constantly growing. There is no way to optimally solve both user reviews and # of items to review. You pick a random number for every cell in the matrix for U, then manipulate Item factors with cost function (least squares - approximate, minimize errors between expected and real value). Then you try again, pick random # for every cell for every cell in the matrix for I, this is an NP Hard problem. You need to run many iterations, which is where Spark with a YARN cluster is great. You need to tweak the number of iterations, Rank and Lambda many times to get the results you like.
(Github)
Step 1: Store recommendations
Step 2: Measure what items the user click on in your system
Step 3: Crunch the numbers
Using Continuous Deployment with many, many iterations and using AKKA Microservices to kick off asynchronous Spark jobs for a constant stream of runs.
Quick Summaries
Intel has some open source benchmarks for Big Data and Scala. One key is that you need to pair you best version of JDK/JVM with the optimal version of Scala for you issue. This requires benchmarking.
(Github)
Other Resources
Opinions expressed by DZone contributors are their own.
Comments