It was great talking with Frank Evans, Data Scientist at Exaptive, about the state of data science with Spark.
Q: What are the keys to a successful data science strategy with Spark?
A: Start by figuring out if Spark is the best tool to accomplish your objective. While it is certainly one of the hottest tools in data science, it’s not necessarily the best solution for every situation and simply using Spark does not ensure the success of your data science initiative.
Understand the business problem you are trying to solve. Spark is right for jobs that require computationally complex work on really big, multiple servers worth of data performed quickly. If you do have a lot of data or computationally complex challenges, you will waste a lot of time and money getting Spark up and running and feel like you’ve wasted both.
I used to be a data scientist at Sonic, quick-serve restaurants. We initially had standard enterprise analytics without computationally complex challenges. When we introduced interactive menu boards, we started generating a tremendous amount of clickstream data that we wanted to use to improve our targeted marketing efforts, enable A/B testing, improve the customer experience, and inform our research and development efforts. This created a use case for Spark for which we found many use cases once we began fulfilling the needs of marketing.
Q: How can companies get more out of data science with Spark?
A: Stay abreast of all the changes taking place with Big Data and Spark. Spark and the Big Data tools are difficult to learn but are extremely effective once you’ve learned them. Also, tools like Hive with Stinger and Spark SQL have become easier to use in a short period of time.
Get both into the hands of the people who understand the domain — not just one or two people who know Big Data or they will become a bottleneck. Bring in interactive data applications like Exaptive, Platfora, and Datameer to build interactive visuals so people can drill down into the data to find the answers to their questions or explore hypotheses. Empower everyone who understands the domain to access the data they need to make informed decisions.
Q: How has Spark changed in the past year? Why did it replace R as the “Big Data” architecture?
A: I view this as three different elements. Big Data is not necessarily computational and does not necessarily provide insights from analytics. Data science involves intense machine learning with data, though not necessarily big data. Big Data science is computationally complex using multiple servers of data.
R wasn’t a Big Data tool. R is more of an interaction language. The R environment doesn’t scale to big data but it can accomplish your goals analytically. Spark, Scala, and Java are highly related. They serve as the underlying translation engine for Java, R, and SQL. You can use R as the underlying language for Spark. Mid-level technical data scientists will gravitate toward Spark and interact via R or Scala. R is becoming the language brought into the enterprise to write code against SQL server tables.
Q: What real-world problems are your clients solving with data science and Spark?
A: We worked with the University of Oklahoma on the text analytics for an academic research corpus of text data of 25 years of congressional hearing transcripts. We enabled the exploration of the texts without reading all 20,000 entries that ranged from five to 100+ pages each.
We helped test a thesis about how international topics were discussed in Congress over 25 years and how the tone of the conversation changed over time and by party. We used Spark to explore 25,000 documents by building topic models tied back to metadata based on key terms used by committees and following how the terms evolved over time. We used the Spark engine to split the data using Spark’s pool of memory to build different models and then used a mechanism for exploring the data set.
We could apply wholesale information to a lot of text data combined with metadata.
Q: What are the most common issues you see preventing companies from realizing the benefits of data science with Spark?
A: First is the ability to find people who know what they are doing and are knowledgeable about the technologies. Technical experts that build functionality, make the translation layer for the business level and for the technical professionals. The tools are becoming easier to use. A lot of companies are out there solving the problem.
Second is having the experts to set up the environments and infrastructure. It can take six months to set up. The upfront complexity can lead to failure, wasted time, and money. While the tools are becoming easier to use, they are still harder to integrate than they need to be.
Third is the disparity of data, coming from different places in different formats. However, this is a much more solvable problem than the first two.
Q: Where do you see the biggest opportunities in the continued evolution of data science with Spark?
A: The tools are 80 percent there. Python binding, R binding, Spark SQL are making it easier to build an interaction layer. Tools map applications and visuals with SQL queries. Spark machine learning tools are good. If you understand the high-level libraries of existing tools, Spark makes sense – it’s a natural plug into embedded systems like the Hive Stinger initiative with self-aware code behind the scenes embedding Spark engines and capabilities.
R, Python, and SQL have to be compiled down to lower-level languages. Tools continue to get better and more effective. Spark has changed more in the last two years than Oracle has in the past 10. The ability to embed Spark engines with things we see now, you don’t have to learn from scratch. There’s more embedding of the Spark engine invisibly into enterprise-class tools.
Q: What skills do developers need to have to work on data science projects with Spark?
A: It depends on where their interests lie. 1) If they want to apply Spark with SQL then get a small Spark environment up and running, play with it, submit queries, and get reports. 2) If the developer is interested in building a translation layer, they need to understand how Spark solves problems. The code to do this is very straightforward. Learn how to solve a series of problems broken into pieces. How to solve problems in individual components that leads to a solution in a larger sense. It’s clear code once you’ve done that. It’s easy to walk through and test assumptions.
Q: What have I failed to ask that you think developers need to know about data science and Spark?
A: Clean data is important. When you get to very large data sets it becomes less necessary to make very complex computations. Pattern recognition comes with straightforward analytics with insights separating themselves from the noise.