Data Science in a Box With Dataiku
Data Science in a Box With Dataiku
In this article, we explore a new application that makes it easy for devs to play the role of data scientist and interview a PM at the company.
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
Data science is the new hotness, with thousands of job postings (some of which really aren’t data science), to dozens of platforms promising to help professionals in the field do their job more effectively. In typical fashion, not all these tools are new, but re-purposed for new use cases, with tools such as Python, R, and Hadoop experiencing new surges in interest thanks to the ‘new’ field of Data Science.
One of the most well conceived and cohesive tools I’ve seen is Dataiku. It aims to package together all the tools that a data scientist and the teams that work with them might need in one application.
Dataiku consists of a handful of open-source components (many of which you might recognize), but the software is closed source bound together with proprietary code, with free and enterprise editions that you can install locally or in the cloud. For this review, I will use the Mac version of the free desktop client.
Download the application, run it, and your browser will automatically open to http://localhost:11200. Then head over to the New project section and choose one of the helpers to get you started, I chose the ‘Tutorial 101 Starting project.’
You can import data from a local or server file system, Hadoop, a variety of SQL and NoSQL sources, cloud storage providers, and further options provided by plugins. After scanning your data, Dataiku provides a preview and some options for tweaking the import and schema, then you’re ready to create your dataset by clicking the green create button.
Next, you will see the Data exploration screen where you can view, filter, sort, and analyze (provides a column based overview) your data. There are also processors for certain data types, for example, geocoding location data. You can create a wide variety of charts by dragging and dropping fields, or switching between types for a preview.
Useful so far, but you can also mix and match the GUI interface with Python, R, and SQL, if you have ever used Jupyter notebooks, then the style will be familiar to you. I’m no Python programmer, but thankfully there’s also a built-in console and debugger to help me figure out what the problem is.
For the non-coders, Dataiku offers built-in machine learning models for prediction and clustering of data, and the ability to create your own learning models and train them. Again, creating your own is a matter of clicking, dragging and selecting options, for example, I created a model to show me what taxi pickups fell on weekends and public holidays in the US.
And finally, to assemble all these components together is the workflow section where you can define which steps to run, and in what order, triggered manually, or programmatically via a REST API.
This scratches the surface of what anyone needing to process and analyze large data sets can accomplish with Dataiku, and you can find more details on their website, or listen to the interview I conducted with Claude Perdigou, a product manager with the company.
Opinions expressed by DZone contributors are their own.