The Managed Data Lake: A Strong Foundation for Data Science
What is Data Science? The jury is still out on a precise definition. This probably has to do with the reality that the field is constantly evolving as the types of data and the tools we have to extract value from data also evolve.
Join the DZone community and get the full member experience.Join For Free
What is Data Science? The jury is still out on a precise definition. This probably has to do with the reality that the field is constantly evolving as the types of data and the tools we have to extract value from data also evolve. A Booz Allen Hamilton guide says that data science is about turning data into action and delivering this actionable intelligence in an understandable way to business end users. Data science pioneers Thomas H. Davenport and D.J. Patil aptly described the iterative nature of data science in a Harvard Business Review blog post, stating “data scientists help decision makers shift from ad hoc analysis to an ongoing conversation with data.” Others specify that unlike business intelligence, data science uses complex algorithms and machine-learning/predictive analytics to not only look for answers, but discover new questions to ask.
At Zaloni, we see data science as an umbrella term encompassing many data-related activities that have been going on for some time. The distinction is that true data science brings formerly siloed disciplines (and people) together. Data science teams have to not only understand data and develop technology and algorithms, they need to effectively collaborate with business partners and ultimately solve problems that create value for enterprises. They must have the ability to put those problems into an analytics framework, apply mathematical techniques, and then translate them back into business results. Today, many successful data scientists come from fields with a strong data, statistical and computational focus – anything from astrophysics to systems biology.
Building a Strong Foundation for Data Science
Data science is what helps us find new ways to discover, combine and manipulate Big Data to make analysis possible as the volume of data balloons and the variety of data becomes more complex. The data science process involves four basic steps: 1) data discovery and acquisition, 2) data preparation, 3) data analytics and modeling, and 4) delivering actionable insights or product deployments that solve real business problems. Although analytics, modeling and business insights are the more exciting parts of this process, it’s important to note that they are unachievable without a solid investment in steps one and two.
How should an enterprise start on its journey to developing a data science capability? The concept of the data lake is emerging as one of the best tools to enable data scientists to prepare data for analysis and get the most value from it. A data lake is a Hadoop-based unified data repository into which raw data from multiple sources can be stored. With all data in one place versus in separate silos, it’s exponentially easier for data science teams to discover the data needed as well as get a more complete picture for a particular use case.
Further, building a data lake in the cloud allows enterprises to create a federated Hadoop data management platform and data catalog spanning on-premise and cloud-based computing. A managed data lake – data lake plus data management platform – helps data science teams increase productivity, reducing much of the time spent on data acquisition and data preparation. Additionally, a data management platform provides a programmatic and repeatable way for data scientists to access and manipulate data – necessary to support the iterative process of data science.
Derive the Most Value From a Data Lake
The ability to leverage Big Data is a significant competitive advantage for most enterprises, whether business insights are used to improve operational efficiencies or develop new products and services. We think there are two key considerations to derive maximum value from a data lake.
First, while many companies are currently or considering investing in building a data lake, most do not have the internal skill set to use it for data science projects. Therefore, it’s critical for enterprises to consider early in the process how they will build a data science capability so that they can deliver on the expected ROI of a data lake deployment.
Second, not everyone who wants to use data in an enterprise can or needs to be a data scientist. That’s why it’s important for enterprises to develop strategies that “democratize” access to data. With today’s Big Data management and self-service data preparation tools, business users can discover and prepare data for analytics without needing to involve data scientists and/or IT at every step. Taking this multi-pronged approach to the data lake gives enterprises more opportunities to advance their maturity in Big Data analytics and data science.
Published at DZone with permission of Satish Vutukuru. See the original article here.
Opinions expressed by DZone contributors are their own.