Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Scala Days NYC 2016: Highlights

DZone's Guide to

Scala Days NYC 2016: Highlights

Sessions, content, slides from Scala Days NYC 2016 (May 10-11, 2016).

· Java Zone
Free Resource

Learn how to troubleshoot and diagnose some of the most common performance issues in Java today. Brought to you in partnership with AppDynamics.

Scala Days started off great, with a smooth check-in and an awesome venue at the New World Stages. A nice breakfast and vendors already available and spreading Scala love. Tapad provided a cool lounge with phone and electric hookups. Snacks, coffee, and drinks were plentiful, top shelf, and readily available. There were sessions on everything Scala, from microservices to Big Data. The vendors were the Scala of Scalaists, real cool vendors that are very active in the Scala community.


Image title

One great talk was by Martin Odersky: Scala The Road Ahead. He was also signing his updated book, "Programming in Scala 3rd Edition, Updated for 2.12".

Another great talk was by Dean Wampler and Andy Petrella - “Scala: The Unpredicted Lingua Franca for Data Science” with @deanwampler and @noootsab. (Github) The examples were all in a Spark Notebook, which is Andy's project that is similar to Zeppelin. His toolkit was: 

Spark Notebook with Scala 2.11.7, Spark 1.5.2, Hadoop 2.7.1, Hive, Parquet

The talk highlights the new distributed data science on big data enabled by Hadoop and Spark using Scala and Notebooks as the primary interface for data exploration and science experimentation. Spark and Hadoop are Enterprise Ready Open Source Implementation that makes this possible with simple APIs like JavaScript, Ruby, and Python. Scala is perfect for this new distributed science with its basis in functional programming with objects.

Interesting Quotes

  • "Spark with Java makes no sense"

  • "In Java it's just noise" 

  • "Spark collections are lazy by default."

  • "Spark's RDD API is inspired by Scala collections API."

  • Tuple Syntax common in Spark Scala RDD, key-value pairs

  • Write expressive DSL with groupBy().count.orderBy() instead of embedding SQL strings for SparkSQL.

  • Scala is the latest API for Spark with all the features and is enterprise Ready.

  • The new Big Data world requires a Big Data Team: IT and Data Scientists working together as one team to release something deployable on a production enterprise cluster.

Tooling, Port Models, and Invent New Models

Scala with Devices - Chariot Solutions had a booth showing a really cool demo with a number of Raspberry PIs running JVMs with Scala and Akka talking to a server and had a few dashboards updating real-time. 

A Literally-Hands-on Introduction to Distributed Systems Concepts with Akka Clustering

David Russell, Software Developer @Hootsuite Media Inc.

@Davidgraig

Keywords: RAFT, gossip protocols, Distributed consensus, AKKA Cluster to connect to twitter streams

8 Fallacies of Distributed Systems

(AKKA helps with those)

  • Network is reliable

  • Latency is zero

  • Bandwidth is infinite

  • Topology doesn't change

  • Network is homogenous

 (AKKA doesn't help here)

  • There is one administrator

  • Transport cost is zero

  • Network is secure

CAP Theorem

Consistency (all nodes, same data, same time), Availability (every request gets a timely response), Partition Tolerance (can't have split brain). Unfortunately, this buffet only lets you pick 2 of 3.   You cannot provide all three guarantees, always need P, either pick C or A. 

Gossip Protocols

Each node holds the state of the cluster and tells neighbors about it; the options of PAXOS or Centralized State Database don't scale well. Gossip protocols are difficult for real-time and fast data. Convergence: AKKA uses to make decisions on managing the cluster. AKKA detects when a node fails using a Failure detector to determine if neighbors are reachable. What could happen is that you get Network partitions; islands of clustered nodes communicate and spread bad data amongst themselves cut off from the rest of the cluster. You could try Human intervention with monitoring, that doesn't scale well and who wants that job? Check out some more AKKA articles by the authors company HootSuite.

AKKA Insights: Distributed Data (experimental in AKKA 2.4) allows to merge friendly data types  to maintain state across a cluster using conflict-free replicated data types / CRDT

Split Brain Resolver in AKKA 2.3 with LightBend Subscription; makes decisions based on quorum Size or keeps oldest nodes. It makes decisions based on last known state of the cluster.

See also:

Redis Plugin for AKKA

AKKA on Raspberry PI

Build a Recommender System in Apache Spark and Integrate It Using Akka

Willem MeintsTechnical Evangelist @Info Support@willem_meints Microsoft MVP



 

For doing recommendations in Spark, Alternating Least Squares is your primary option. An interesting Spark tip that Willem gave is to always name your application so you can find it in the 4040 UI or Spark History UI

config.setAppName("MyUniqueSparkAppName")

Cool Quote: "Scala is the most optimum language since you Write once, Read never."

What makes recommendations hard is that you have a Sparse Matrix of reviews (R= U * I) that can be quite massive and constantly growing. There is no way to optimally solve both user reviews and # of items to review. You pick a random number for every cell in the matrix for U, then manipulate Item factors with cost function (least squares - approximate, minimize errors between expected and real value). Then you try again, pick random # for every cell for every cell in the matrix for I, this is an NP Hard problem. You need to run many iterations, which is where Spark with a YARN cluster is great. You need to tweak the number of iterations, Rank and Lambda many times to get the results you like.

(Github)

Step 1: Store recommendations

Step 2: Measure what items the user click on in your system

Step 3: Crunch the numbers

Using Continuous Deployment with many, many iterations and using AKKA Microservices to kick off asynchronous Spark jobs for a constant stream of runs.

Quick Summaries

Intel has some open source benchmarks for Big Data and Scala. One key is that you need to pair you best version of JDK/JVM with the optimal version of Scala for you issue. This requires benchmarking.

(Github

Other Resources

Understand the needs and benefits around implementing the right monitoring solution for a growing containerized market. Brought to you in partnership with AppDynamics.

Topics:
scala ,spark ,apache spark ,akka ,big data

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}