Following the Anaconda Path to Success for Enterprise Data Science
Following the Anaconda Path to Success for Enterprise Data Science
The adoption of open-source tools for data science is accelerating both for individuals and enterprises. And the reason is Anaconda.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
It was great speaking with Ian Stokes-Rees, a computational scientist at Continuum Analytics, about their recent release of Anaconda version 4.4. Continuum claims that Anaconda is one of the leading platforms for “open data science,” designed to address the challenge of “doing data science” in an easy, reliable, reproducible, and collaborative fashion. With reports of over a million downloads a month and a global community of millions of users, it appears there is some substance to that claim.
Recent research finds that 89% of companies have at least one data scientist, but less than half have a data science team, with the job of “doing data science” often spread across an organization. Finding ways in which those individuals can come together, collaborate with others in the organization, and provide meaningful analysis and insights from rapidly growing business data is a significant challenge. A new generation of open-source tools and software has emerged to address this need and it is seeing adoption across the spectrum: from sophisticated financial services firms to enterprises who have been slow to formally adopt data science. Ian and I discussed how data science can up the enterprise game and add a layer of intelligence critical for further success.
According to Ian, Anaconda has a thriving community of both commercial and personal users thanks to its availability for Windows, Mac, and Linux, its modularity, and its regular quarterly releases. To support commercial users, Continuum has developed Anaconda Enterprise, which layers on top of existing data science capabilities within an organization to create a complete platform — a rich ecosystem built around Python and R that links together data services, storage systems, and compute resources. Ian described how Anaconda was designed to make it easy to transition analysis from a laptop to a server-based “data lab” and then into workflows that can execute automatically and in parallel on massive compute clusters.
Prior to Anaconda, every open-source data science platform was built by hand from bits and pieces of software downloaded from places such as GitHub, causing a maintenance nightmare. Anaconda brought the pieces together into an integrated and optimized system, allowing anyone to get up-and-running in minutes with hundreds of tools and libraries for data processing, analysis, and visualization. Ian reports that Continuum curates and manages about 2,000 of these packages, while another 100,000 are community contributed and available as part of the larger Anaconda ecosystem.
What are the keys to a successful big data and data science strategy?
At Continuum, we see a common three-step path to success: empower individuals to access and analyze data; empower teams to collaborate; empower governance and reproducibility. Anaconda has been designed to play a part in all three of these steps. But if there is one critical piece, it is the “team collaboration” part. We see this anchored around a central “Data Lab” that gets people working together, connected to enterprise data sources, supported by high-performance infrastructure, developing analyses that can deliver real business value.
How can companies get more out of their data with data science?
The Data Lab is central to this. It enables exploratory data science where a team can partner with different parts of the business to understand data sources, identify problems to be solved, and work towards analyses that can inform business strategy and operations. Without the Data Lab, individuals will work in isolation only with subsets of data or teams will spend weeks to months slowly iterating on “full scale” analysis workflows.
What are the most common issues you see preventing companies from realizing the benefits of big data and data science?
Let me provide three. Probably the leading issue comes from technology-led rather than business-led data science initiatives. For example, we’ve seen organizations launch big Hadoop clusters without knowing anything about the data that is going to go in, the business problem to be solved, or the people who are going to use the system. A lot of companies will try Hadoop and walk away after a few months wasting a lot of time and money.
The next most common is the disconnect between data science teams and IT operations — an analysis routine is developed in R or Python by a single person on a subset of data, but then an operations team needs to do an automated, parallel production deployment to run the analysis workflow at scale. There are a lot of things that can go wrong in that handover. Anaconda has been designed to address both these challenges, allowing data science teams to work closely with business experts, and to smooth the transition from local systems into production systems.
The third obstacle I’d offer is that corralling business data into a “data lake” is just the first step in being able to derive value from it. If not done carefully, this can create more problems than it solves. It is essential to think through the process and ensure it empowers people to be able to use that data efficiently.
What are some real-world problems your clients can solve with Anaconda?
We have a lot of customers in financial services who were previously writing their quantitative trading algorithms in C or C++. With Anaconda, they’ve been able to transition to using Python as it can now match performance expectations of this industry while also benefiting from a syntax that encourages rapid development. On top of that, it comes packed with hundreds of high-quality numerical algorithms, data processing routines, and visualization tools. With the Anaconda Enterprise platform, they are then able to move their algorithms from development to production quickly and reliably.
What do developers need to be successful working on big data and data science projects?
To differentiate themselves, burgeoning data scientists require an understanding of machine learning models as they apply to big data. A lot of data scientists today don't understand enough statistics or the foundational structure of certain techniques. Tensorflow is a great example — it seems everyone is trying to use it everywhere, even if a neural network is the wrong modeling approach and a simple linear model would make more sense. The next piece that we bring to data science projects is professional engineering. Individuals can stand out through discipline around code management, testing, and documentation. We see organizations suffering from poorly developed analysis routines every day, whether they were written in SAS, Matlab, C, R or Python — although we have to admit it seems like SAS code is especially easy to make incomprehensible. Finally, success will come from being able to embrace open-source technology. There is no question that the most exciting things that are happening in the world of data science are happening in open source software.
What does the future hold for data science?
As tools mature, we see more and more people becoming involved in the data science process either directly or indirectly through collaboration. We see people who call themselves “data scientists” becoming more disciplined and adopting established software engineering techniques. To date, many organizations have seen their data science expenditures as an “innovation investment” but increasingly, these budgets will need to be justified by cost savings and revenue generation. We believe Hadoop investments will especially need to prove their value. There is certainly a demand for more graphical interfaces for advanced analytics in data science, and it may be that in a few years Tableau, Power BI, Cognos, or some new player will emerge that can deliver on that. In the meantime, you won’t be surprised to hear me say that we see more people turning to Python and Anaconda as the foundation of their data science.
Opinions expressed by DZone contributors are their own.