Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Science Fun at Velocity Amsterdam

DZone's Guide to

Data Science Fun at Velocity Amsterdam

Check out this summary of Bart Devylder's Python data science workshop in Amsterdam and iPython notebooks.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Last week I gave a hands-on Python data science workshop at Velocity Amsterdam together with my colleague Pieter Buteneers. The purpose was to introduce techniques for visualizing large datasets, finding correlations between metrics, applying machine learning, anomaly detection, and data forecasting . With 54 active participants and quite some positive feedback, I think it was a success (but I might be biased) and I would like to share some of our experiences and the tools we used.

IMG_20161109_144923676.jpg

One of the challenges we faced in preparing for the workshop was to find a convenient way to let everyone participate without having to worry about whether they had a compatible version of the Python data science stack installed on their laptops. We decided to give the tutorial using an iPython notebook, which runs in the browser and allows you to execute code and show graphical output. This opened up the possibility to relieve the participants from installing anything, given we provided a server running these notebooks on behalf of the users. 

Screenshot from 2016-11-17 09-15-49.png 
One very promising service that offers this is mybinder.org, which spins up fully functional notebooks based on any public GitHub repository. It requires no setup whatsoever as the notebooks run on a Kubernetes cluster of the Freeman lab. However, for the purpose of running the tutorial for a relatively big audience at a very specific time, we felt it was too risky to rely on it. We had no way to know how much capacity would be available when we needed it, and we also had observed occasional downtime.

Therefore we decided to use JupyterHub, a service which dynamically spins up Jupyter notebook servers for each user. The installation and configuration went pretty smooth thanks to the many documented examples. We installed it on a rather heavy machine on Azure (as it was configured to run everything locally) and it could serve the load  well during the workshop.

If you would like to (re)try the tutorial, you can check out the code on GitHub or directly jump to a fully functional notebook offered by myBinder (most of the time). Have fun, and let us know your feedback!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data science ,python ,machine learning ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}