Scala Days started off great, with a smooth check-in and an awesome venue at the New World Stages. A nice breakfast and vendors already available and spreading Scala love. Tapad provided a cool lounge with phone and electric hookups. Snacks, coffee, and drinks were plentiful, top shelf, and readily available. There were sessions on everything Scala, from microservices to Big Data. The vendors were the Scala of Scalaists, real cool vendors that are very active in the Scala community.
One great talk was by Martin Odersky: Scala The Road Ahead. He was also signing his updated book, "Programming in Scala 3rd Edition, Updated for 2.12".
Another great talk was by Dean Wampler and Andy Petrella - “Scala: The Unpredicted Lingua Franca for Data Science” with @deanwampler and @noootsab. (Github) The examples were all in a Spark Notebook, which is Andy's project that is similar to Zeppelin. His toolkit was:
Spark Notebook with Scala 2.11.7, Spark 1.5.2, Hadoop 2.7.1, Hive, Parquet
"Spark with Java makes no sense"
"In Java it's just noise"
"Spark collections are lazy by default."
"Spark's RDD API is inspired by Scala collections API."
Tuple Syntax common in Spark Scala RDD, key-value pairs
Write expressive DSL with groupBy().count.orderBy() instead of embedding SQL strings for SparkSQL.
Scala is the latest API for Spark with all the features and is enterprise Ready.
The new Big Data world requires a Big Data Team: IT and Data Scientists working together as one team to release something deployable on a production enterprise cluster.
Tooling, Port Models, and Invent New Models
Streaming Clustering: G-Stream, Mean-Shift-LSH, SOM-MR
Spark Packages for Machine Learning
Scala with Devices - Chariot Solutions had a booth showing a really cool demo with a number of Raspberry PIs running JVMs with Scala and Akka talking to a server and had a few dashboards updating real-time.
A Literally-Hands-on Introduction to Distributed Systems Concepts with Akka Clustering
Keywords: RAFT, gossip protocols, Distributed consensus, AKKA Cluster to connect to twitter streams
8 Fallacies of Distributed Systems
(AKKA helps with those)
Network is reliable
Latency is zero
Bandwidth is infinite
Topology doesn't change
Network is homogenous
(AKKA doesn't help here)
There is one administrator
Transport cost is zero
Network is secure
Consistency (all nodes, same data, same time), Availability (every request gets a timely response), Partition Tolerance (can't have split brain). Unfortunately, this buffet only lets you pick 2 of 3. You cannot provide all three guarantees, always need P, either pick C or A.
Each node holds the state of the cluster and tells neighbors about it; the options of PAXOS or Centralized State Database don't scale well. Gossip protocols are difficult for real-time and fast data. Convergence: AKKA uses to make decisions on managing the cluster. AKKA detects when a node fails using a Failure detector to determine if neighbors are reachable. What could happen is that you get Network partitions; islands of clustered nodes communicate and spread bad data amongst themselves cut off from the rest of the cluster. You could try Human intervention with monitoring, that doesn't scale well and who wants that job? Check out some more AKKA articles by the authors company HootSuite.
AKKA Insights: Distributed Data (experimental in AKKA 2.4) allows to merge friendly data types to maintain state across a cluster using conflict-free replicated data types / CRDT
Split Brain Resolver in AKKA 2.3 with LightBend Subscription; makes decisions based on quorum Size or keeps oldest nodes. It makes decisions based on last known state of the cluster.
Willem MeintsTechnical Evangelist @Info Support@willem_meints Microsoft MVP
For doing recommendations in Spark, Alternating Least Squares is your primary option. An interesting Spark tip that Willem gave is to always name your application so you can find it in the 4040 UI or Spark History UI
Cool Quote: "Scala is the most optimum language since you Write once, Read never."
What makes recommendations hard is that you have a Sparse Matrix of reviews (R= U * I) that can be quite massive and constantly growing. There is no way to optimally solve both user reviews and # of items to review. You pick a random number for every cell in the matrix for U, then manipulate Item factors with cost function (least squares - approximate, minimize errors between expected and real value). Then you try again, pick random # for every cell for every cell in the matrix for I, this is an NP Hard problem. You need to run many iterations, which is where Spark with a YARN cluster is great. You need to tweak the number of iterations, Rank and Lambda many times to get the results you like.
Step 1: Store recommendations
Step 2: Measure what items the user click on in your system
Step 3: Crunch the numbers
Using Continuous Deployment with many, many iterations and using AKKA Microservices to kick off asynchronous Spark jobs for a constant stream of runs.
Intel has some open source benchmarks for Big Data and Scala. One key is that you need to pair you best version of JDK/JVM with the optimal version of Scala for you issue. This requires benchmarking.