The Barclays Data Science Hackathon: Using Apache Spark and Scala for Rapid Prototyping

DZone 's Guide to

The Barclays Data Science Hackathon: Using Apache Spark and Scala for Rapid Prototyping

In this post, members of the Barclays Advanced Data Analytics Team describe the results of an offsite hackathon to develop a recommendation system using Apache Spark.

· Big Data Zone ·
Free Resource

In the depths of the cold, wet British winter, the Advanced Data Analytics team from Barclays escaped to a villa on Lanzarote, Canary Islands, for a week to collaboratively solve a key business problem: how to design a better customer experience. We framed the problem in the context of using customer shopping behavior data to build a personalized recommender system.

This was a great opportunity to enjoy some sun, surf, and good food, as well as to have fun building a useful application in a short timeframe. The idea was to work collaboratively in a Kaggle-like competition. Each team’s goal was to build an MVP and then repeatedly iterate on it using common interfaces defined by a framework specifically built for our contest.

Our competition operated within the following ground rules.

  • Build, test, and refactor quickly.
  • Work with common structures without constraining individual initiative and innovation.
  • Design for deployment to production on a multi-tenant cluster.

Previously in this blog, we have written about our general experiences with Apache Spark. In this post, we’ll explain how we used Spark (via the Scala API) to rapidly create prototypes in this hackathon-like environment, and the lessons learned about rapid prototyping (as well as building simple machine-learning models).

Building Prototypes That Scale

We wanted to build fast, but we also wanted to build an application that could go straight into production and scale accordingly. Fortunately, Spark allows you to write jobs that can run at small scale on a laptop and at large scale on a cluster with hardly any code changes. So during the hackathon, we could run tests on our laptops (using the Local Spark Context) and then ship the same code to a cluster if we needed more computational power.

Moreover, with Spark you can generate elastic jobs where the computation is optimized based on the available resources (CPU and memory). That means you can theoretically process more data than fits in memory in local mode or in a small cluster because Spark will automatically spill data to disk and then load that data when the corresponding task is invoked. In fact, during the competition, we faced scalability issues even with a sample of data (for example matrix factorization) that would have caused memory issues without Spark.

Consequently, we didn’t have to build “toy” models that were then rebuilt to scale. Rather, we could work directly on the full data set with production-quality code.

Doing Type-safe ETL

As the same dataset was shared across teams, we needed to be confident that it was valid. Spark provided the following ETL functions:

  • Connecting to the data source
  • Extracting and transforming the data
  • Casting data into typed Scala case classes
  • Persisting data efficiently as serialized objects

This approach was simple and efficient and provided great transparency across the process. Thanks to the type safety, we could write comprehensive tests for any transformations or joins, and we didn’t need to worry about whether the data was consistent–we could just get on with our work.

Easy Collaboration on the Same API

We needed easy sharing of components so that each model could run on the same basis with no duplication of effort.

By using Spark’s Scala API, we could define flexible APIs based on domain-specific data types (rather than having to use vectors or the kinds of fixed APIs found in machine learning and statistical packages). This approach was easy and quick, and because the APIs were typed, we were confident that each model followed the defined APIs at compile-time, thereby reducing errors in code.

Here is our basic model API (you can find the rest of the code on Github at https://github.com/gm-spacagna/lanzarote-awesomeness):

case class AnonymizedRecord(maskedCustomerId: Long, generalizedCategoricalGroup: GeneralizedCategoricalBucketGroup,
                            dayOfWeek: Int, merchantCategoryCode: String,
                            businessName: String, businessTown: String, businessPostcode: Option[String])

trait Recommender extends Serializable {
  // returns customerId -> List[(merchantName, merchantTown)]
 def recommendations(customers: RDD[Long], n: Int): RDD[(Long, List[(String, String)])]

trait RecommenderTrainer {
  def train(data: RDD[AnonymizedRecord]): Recommender

Some notes on the above:

  • RecommenderTrainer receives the raw data and has to perform the feature engineering tailored for the specific implementation and return a RecommenderModel instance.
  • The Recommender instance takes an RDD of customer ids and a number N and returns at most N recommendations for each customer.
  • We used the pair (MerchantName, MerchantTown) to represent the unique business we want to recommend.

Common Evaluation Framework

A shared evaluation framework is the key to any successful data science project, so we built a simple one on Spark for evaluating recommenders. It turns out that it is often easier to build such a framework specifically for your use case than to try to force your use case onto a general framework. It was quick and easy to do that in Spark.

This process set the baseline for our competition. When we found half-way through the week that we needed more functionality to evaluate one of our methodologies, we refactored it without affecting any of the previous implementations.

case class RecommenderEvaluation(@transient sc: SparkContext) {

  def splitData(data: RDD[AnonymizedRecord],
                fraction: Double = 0.8): (RDD[AnonymizedRecord], RDD[AnonymizedRecord])
  def evaluate(recommenderTrainer: RecommenderTrainer,
               trainingData: RDD[AnonymizedRecord],
               testData: RDD[AnonymizedRecord],
               n: Int = 20,
                              evaluationSamplingFraction: Double): Double


The evaluate method computes the Mean Average Precision (MAP) at N (default is 20). We quickly implemented our own version combining a bunch of map and fold functions (code available at https://github.com/gm-spacagna/sparkz/blob/master/src/main/scala/sparkz/evaluation/MAP.scala).

How Scala Helped Our Cause

When you are trying to innovate, it is important to focus on the business logic rather than the details of implementation. Spark allowed us to do that because it encapsulates much of the functionality within higher-order functions that can be composed straightforwardly. There are no loops or state to cloud the issues, so if you can understand each function, you can understand the code as a whole. It really helps the creative process to be able to think about and express your ideas with clarity: used appropriately, Scala and Spark encourage and facilitate that.

There was also minimum boilerplate code getting in the way of clarity. So, we were easily able to collaborate, reviewing and commenting on each other’s code with minimal learning curve. We wanted to be able to work on lots of ideas at once, while getting feedback from each other without having to spend time explaining what the code was trying to do.

Here is our “Random Recommender” that recommends random businesses, excluding from the training data the ones where the customer has already spent time:

import scala.util.Random

case class RandomRecommender(@transient sc: SparkContext) extends RecommenderTrainer {
  def train(data: RDD[AnonymizedRecord]): Recommender = {
    val businessesBV = sc.broadcast(data.map(_.businessKey).distinct().collect().toSet)

    val customerIdToTrainingBusinesses = sc.broadcast(
      .reduceByKey(_ ++ _).collect().toMap

    new Recommender {
      // returns customerId -> List[(merchantName, merchantTown)]
      def recommendations(customers: RDD[Long], n: Int): RDD[(Long, List[(String, String)])] =
        customers.map(customerId => customerId -> customerIdToTrainingBusinesses.value.getOrElse(customerId, Set.empty))
        .mapValues(trainingBusinesses => (businessesBV.value -- trainingBusinesses).take(n).toList)

And the following is a better baseline recommending the most popular businesses ranked by number of unique customers:

case class MostPopularItemsRecommender(@transient sc: SparkContext,
                                       maxNumBusinesses: Int) extends RecommenderTrainer {
  def train(data: RDD[AnonymizedRecord]): Recommender = {
    val customerIdToTrainingBusinesses = sc.broadcast(
      .reduceByKey(_ ++ _).collect().toMap

    val rankedBusinessesByNumberOfCustomersBV: Broadcast[List[(String, String)]] = sc.broadcast(
      data.map(record => (record.businessKey, record.maskedCustomerId)).distinct()
      .mapValues(_ => 1).reduceByKey(_ + _).sortBy(_._2, ascending = false).keys.take(maxNumBusinesses).toList

    new Recommender {
      // returns customerId -> List[(merchantName, merchantTown)]
      override def recommendations(customers: RDD[Long], n: Int): RDD[(Long, List[(String, String)])] =
        customers.map(customerId => customerId -> rankedBusinessesByNumberOfCustomersBV.value.filter(business =>

Most of our models used this simple utils function for selecting the top-N elements of a collection without having to do a global sort:

import scala.reflect.ClassTag

object TopElements {
  def topN[T: ClassTag](elems: Iterable[T])(scoreFunc: T => Double, n: Int): List[T] =
    elems.foldLeft((Set.empty[(T, Double)], Double.MaxValue)) {
      case (accumulator@(topElems, minScore), elem) =>
        val score = scoreFunc(elem)
        if (topElems.size < n)
          (topElems + (elem -> score), math.min(minScore, score))
        else if (score > minScore) {
          val newTopElems = topElems - topElems.minBy(_._2) + (elem -> score)
          (newTopElems, newTopElems.map(_._2).min)
        else accumulator

Further Observations

As mentioned previously, although Spark was not the obvious choice for rapid prototyping, we were pleased with the results. However, we have some opinions about some current limitations with respect to building models:

  • ML and MLlib are not always sufficiently flexible and need some extra development. In particular, a number of important statistics (such as the confusion matrix or the score predictions of the classifiers) are hidden within components as private members, which means that you often need to re-implement some of their functionality.
  • The linear algebra libraries available to MLlib are quite limited, and it took us a while to learn how to optimize them for practical use.
  • There are certain inconsistencies between Scala and Spark that can be confusing, such as different implementations of methods like fold, collect, mapValues, groupBy, and so on. Another example is the for comprehension, where you cannot simply iterate through two RDDs as you would with two Lists. When you mix your logic of Scala collections and Spark RDDs, it can be easy to get confused. The strong type system helps to spot those errors, at least.
  • Moreover, in common with many machine-learning libraries, ML and MLlib are based on vectors and do not allow the ad-hoc definition of data types based on the business context. This is a challenge because different models may use very different representations of the data, even for the same prediction problem. Rather, we wanted to express our train and predict methods as functions of our custom types.See this deck for details about how we solve a similar problem for predicting financial products buyers using our own machine-learning pipelines framework.

Models and Results

Over the week we built five types of recommendation models (where we tried to predict what businesses a customer would visit):

  • Business-to-business similarity models: Each pair of businesses is characterized by a similarity coefficient based on the portion of common customers. There are several ways to calculate “similarity” and we experimented with a couple of them (conditional probability and Tanimoto coefficient). This is probably the most scalable of all models since we have relatively fewer business pairs for each large city (say 20 billion for city the size of Bristol) compared to customer pairs (say 100 trillion).
  • Customer-to-customer similarity models: Each pair of customers is characterized by a similarity metric based on common behavior (geographical and temporal shopping journeys) and demographic information (age, residential area, gender, job position…). This is often known as a “nearest neighbors” model. It would be intractable for a large number of customers, but we implemented an useful data structure (from the Vantage Point Tree), which allows us to be very efficient in storing only the relationships between close-ish neighbors.
  • Matrix factorization models: We factorize a matrix of Customers-to-Businesses into two matrices of Customers-to-Topics and Topics-to-Businesses (e.g. Latent Semantic Analysis, SVD). This approach is quite efficient but is essentially an averaging approach—meaning, it is easy to get reasonable results but hard to get brilliant ones.
  • Random Forest models: We built a separate Random Forest model for each business. This process would take a long time to build for every business, but it is quite easy to do (and reasonably effective) for just the top-n ones. Although this model is flexible as well because each business can be treated a bit differently, it doesn’t scale so well (imagine making different assumptions for every business customer) and we are not sure how to order results between businesses.
  • Ensemble models: Where a number of approaches are brought together and we let the machine choose the best features of each.

Although not all of the implementations were computationally feasible given our available resources and time, we manage to have 6 models evaluated end-to-end.


All of these models achieved a MAP@20 of around 12%, which is a pretty good result for a recommender. (Remember, for every national retail chain where you have a lot of customers, you have a lot of local niche businesses where only a small proportion of the customer base will ever business–and these are pretty hard to predict.) The winning model was based on a hybrid approach of k-neighbors and item-to-item models, which scored 16% MAP@20—which proved out for us that simple solutions made of counts and divisions often outperform more advanced ones.


In summary, even if it seems unusual, Spark and Scala were excellent tools for rapid prototyping during the weekespecially for bespoke ad-hoc algorithms. The success of the event was not solely down to technology, however: Innovation requires an environment where great people can connect, where they can set clear ambitious business goals, where they can work together free of distractions and external dependencies, and where the pressure to deliver incredible results comes from the group–where you can fail safely, go to bed, and wake up the next day ready to try something else.

Barclays Advanced Data Analytics is a team of data scientists that builds scalable machine learning and data-driven applications in Spark and Scala.

apache spark, api, big data, data science, etl, hackathon, prototyping, scala

Published at DZone with permission of Justin Kestelyn . See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}