DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Data Science in a Box With Dataiku

Data Science in a Box With Dataiku

In this article, we explore a new application that makes it easy for devs to play the role of data scientist and interview a PM at the company.

Chris Ward user avatar by
Chris Ward
CORE ·
Sep. 06, 17 · Review
Like (4)
Save
Tweet
Share
6.32K Views

Join the DZone community and get the full member experience.

Join For Free

Data science is the new hotness, with thousands of job postings (some of which really aren’t data science), to dozens of platforms promising to help professionals in the field do their job more effectively. In typical fashion, not all these tools are new, but re-purposed for new use cases, with tools such as Python, R, and Hadoop experiencing new surges in interest thanks to the ‘new’ field of Data Science.

One of the most well conceived and cohesive tools I’ve seen is Dataiku. It aims to package together all the tools that a data scientist and the teams that work with them might need in one application.

To experiment with Dataiku you will need a decently sized dataset, I opted for the time-honored NYC Taxi trip records, but for the sanity of my laptop, used a couple of gigabytes of the data.

Dataiku consists of a handful of open-source components (many of which you might recognize), but the software is closed source bound together with proprietary code, with free and enterprise editions that you can install locally or in the cloud. For this review, I will use the Mac version of the free desktop client.

Download the application, run it, and your browser will automatically open to http://localhost:11200. Then head over to the New project section and choose one of the helpers to get you started, I chose the ‘Tutorial 101 Starting project.’

New template project

You can import data from a local or server file system, Hadoop, a variety of SQL and NoSQL sources, cloud storage providers, and further options provided by plugins. After scanning your data, Dataiku provides a preview and some options for tweaking the import and schema, then you’re ready to create your dataset by clicking the green create button.

Import data

Next, you will see the Data exploration screen where you can view, filter, sort, and analyze (provides a column based overview) your data. There are also processors for certain data types, for example, geocoding location data. You can create a wide variety of charts by dragging and dropping fields, or switching between types for a preview.

Data Explorer

Charts

Useful so far, but you can also mix and match the GUI interface with Python, R, and SQL, if you have ever used Jupyter notebooks, then the style will be familiar to you. I’m no Python programmer, but thankfully there’s also a built-in console and debugger to help me figure out what the problem is.

For the non-coders, Dataiku offers built-in machine learning models for prediction and clustering of data, and the ability to create your own learning models and train them. Again, creating your own is a matter of clicking, dragging and selecting options, for example, I created a model to show me what taxi pickups fell on weekends and public holidays in the US.

Analysis

Anomoly detection

And finally, to assemble all these components together is the workflow section where you can define which steps to run, and in what order, triggered manually, or programmatically via a REST API.

This scratches the surface of what anyone needing to process and analyze large data sets can accomplish with Dataiku, and you can find more details on their website, or listen to the interview I conducted with Claude Perdigou, a product manager with the company.

Data science

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Distributed SQL: An Alternative to Database Sharding
  • 5 Factors When Selecting a Database
  • Better Performance and Security by Monitoring Logs, Metrics, and More
  • Deploying Java Serverless Functions as AWS Lambda

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: