In this new DZone series, Coffee With a Data Scientist, we interview various data scientists working on projects in machine learning, deep learning, data analytics, and/or big data in an effort to learn more about the work from the people who know it best. Oh yeah, and the coffee aspect of it all... we like to buy all of our interviewees a coffee. So, if you're a data scientist who is up for talking with us and you like coffee, please get in touch.
To start the series off, we spoke with Rob Hickey, VP of Engineering at DataRobot, to ask him about his work as a data scientist and learn more about the DataRobot platform.
DZone: Tell us a bit about your background and your data science journey.
Rob: I am relatively new to the data science field. My background is in software development and my domain experience ranges from application firewalls and load balancing to secure video delivery platforms. I developed an affinity for data science working with some of the leading premium content providers, using data gleaned from the video platform we developed to provide a richer user experience. Determining what content might be most appealing to users based on seemingly disparate data like location, device and time of day was eye opening. During this time, I was introduced to Jeremy Achin and Tom de Godoy, the data scientists who founded DataRobot. That was when my education began in earnest. We worked together informally, and I eventually joined the company in September of 2015. Since that time I have had the privilege to work with, and learn from, some of the leading data scientists in the world.
Tell us about DataRobot and explain a bit about how the platform helps businesses and data scientists.
DataRobot offers an enterprise machine learning platform that empowers users of all skill levels to automatically build accurate predictive models to make better data-driven decisions. Incorporating a library of hundreds of the most powerful open source machine learning algorithms, the DataRobot platform automates, trains, evaluates, and deploys predictive models in parallel, delivering more accurate predictions at scale. DataRobot provides the fastest path to data science success for organizations of all sizes.
How was the DataRobot platform built and how does it integrate with existing business systems?
The initial implementation was a cloud-based application that allowed users to upload datasets for processing. That initial implementation has been enhanced to allow the platform to be used by enterprise customers, either in their private cloud or on their own physical servers. The application supports a REST-based API that allows customers to integrate with their ETL systems for data acquisition, BI tools for visualization, and with various other clients when using the DataRobot application to generate either batch-mode or single row, low-latency predictions.
How does DataRobot scale its performance? Can you give an overview on how it optimizes/automates the model building process?
DataRobot application workers are allocated to an organization, user and project and those workers process models against a given dataset. Additional workers can be allocated to a project on a per-user or per-organization basis to further parallelize the processing. The system can be configured to autoscale additional workers if demand for workers exceeds a specified threshold. The system scales linearly with the addition of workers.
How do you evaluate if the insight you obtain from DataRobot's automated machine learning is relevant to the problem domain?
This is one of the areas where DataRobot really shines. The application gives the user many ways to evaluate the predictive accuracy of every model as well as to learn insights from the model. This includes all of the common model evaluation metrics and visualizations, along with some important proprietary visualizations to understand model insights and performance. Users can evaluate which variables are most important according to each model, and our model X-ray shows how changing the value of each feature would change predictions. DataRobot uses word clouds based on text mining to show insights on text variables. While analysts and data scientists expect models to predict what will happen, we have features like reason codes that can explain the `why` behind every prediction we make. Users apply these insights both to tweak their models for greater accuracy and to increase their confidence about the models they use for important business problems.
What technical challenges did your team encounter while developing DataRobot and how did you overcome them?
The team is continually looking to optimize the product and we inevitably encounter challenges as we develop new features. One area that we focus on is performance, especially for the modeling time of our in-memory models and the speed and scale of our low-latency predictions. We want to generate accurate models as quickly as possible and that oftens requires us to optimize models to run in an embedded environment. Similarly, we want to process predictions as quickly as possible and at scale. Often, products that are developed for Data Scientists lack enterprise-class scale and reliability, while products that demonstrate scale and reliability don’t perform as well as hand-built models. DataRobot has married a reliable, extensible platform with the latest data science technology, tuned for high performance and scale at the same time.
Is there anything in particular that a Java/R/Python developer can work on using the DataRobot platform? How can one begin learning DataRobot?
R and Python users can use the DataRobot client SDKs to integrate with the DataRobot application. Java and Python users can use code generated by the DataRobot application to integrate into their stand-alone applications. The application makes it easy to add models and keep up with latest technical advancements.
Learning DataRobot is best done by attending a DataRobot University (DRU) class. Classes are offered around the world and both the seasoned data scientist and the data analyst will find the classes invaluable. We also offer a class tailored specifically for executives to provide insight into the capabilities of the tool and exposure to data science at a high level.
We hear a lot about data science automation. What's your take on it?
DataRobot is, at its core, an automation platform. Building bespoke predictive models requires highly-skilled data scientists to work for long periods of time to produce good results that aren’t always easy to deploy. As businesses become more interested in capitalizing on their data, the conventional process just is not scalable.
With DataRobot, choosing the right model is much easier since we automate best practices around model validation, model selection, feature selection, etc. We also implement hundreds of different machine learning approaches — more than any data scientist could ever implement on any project. This allows users to review many algorithms in an accelerated fashion and quickly settle on the most accurate model. In addition, business analysts and other non-data scientists users can take advantage of the same advanced machine learning algorithms as the data scientists, without having to learn all of the maths and statistics. Automation is coming to data science, and its the forward-looking data scientists who will realize that tools like DataRobot will help them attack more predictive analytics challenges with more accuracy and in much less time.
Technology is changing at a rapid pace. What is the future of data science in the next 5 to 10 years as you see it?
The gap between what can and can’t be learned by ML will continue to narrow. There will also be more automation that focuses on ease of use. The democratization of Machine Learning will happen in the near future. This democratization can’t happen through education alone but rather through accessibility of applications like DataRobot to an ever increasing user group.
When you are not working with DataRobot, what do you like to do in your free time?
I like to spend time with my family and enjoy anything that involves being outdoors and active.
What books or reading material would you suggest to aspiring, young data scientists?
There are many good books around machine learning and statistical learning. My favorite is The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman. A more accessible version of this book is Introduction to Statistical Learning by Witten, James, Tibshirani, and Hastie. The online documentation of the popular scikit-learn machine learning toolkit has lots of practical tips and example code to get started.
Have you ever come across DZone? As an expert data scientist, what are your suggestions for improving our coverage of Machine Learning and Data Science to meet the needs of data professionals?
I was not familiar with DZone before but I am impressed with the breadth of topics being covered as it relates to Big Data in general and Machine Learning in particular. Continuing to focus on process automation would be good as well as looking to make Machine Learning more accessible to people.
Is there anything I haven't asked you about that you'd like to add? (Cool things companies or individuals are doing with DataRobot, Interesting happenings in Machine Learning that you want to mention, etc.)
Customers are doing some very cool things with DataRobot. The application is helping in almost every market and the insights generated by the application are proving to be very valuable. Our customers want to do more with their data and we are happy to help them innovate. Universities are using the application to further their research and to help educate the next generation of Data Scientists. In the future we will see the continued buildout of the product into tangential markets, new ways to provide access to the application for individuals and improvements in scale and ease of use. It is an exciting time to be involved in Data Science and Automated Machine Learning, in particular.
Thanks for the interview, Rob.