Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Consider Introducing Docker to Your Data Science Workflow

DZone's Guide to

Consider Introducing Docker to Your Data Science Workflow

Docker is an important tool for every every data scientist to deploy and share work. It allows you to reproduce the exact environment you used during my development process.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

One of the big issue as a data scientist is to configure correctly the data science environment. Sometimes this means installing a lot of packages, waiting for the packages to compile, handing obscure errors, setting everything up to work correctly... and most of the time, this is a pain. But configuring the environment correctly is necessary to reproduce the analysis and share work with others.

For these reasons, I introduced Docker in my data science workflow.

What Is Docker?

Docker is a tool that simplifies the installation process for software engineers. To explain in a very simple way (sorry, Docker gurus, for this definition), Docker creates a super lightweight virtual machine that can be run in very few milliseconds and contains all we need to run our environment in the right way.

If you would read more, this is the Docker official website.

The goal of this post is to create an environment to run a very simple Jupyter notebook.

First of all, we need to install Docker for the correct platform.

Now, we can start to create our environment. Really, we can pull a ready-to-use container for this. On Docker Hub, there are a lot of ready to use images. For example:

  • dataquestio/python3-starter
  • dataquestio/python2-starter

But my goal is to create my own environment from scratch!

Open your favorite text editor and start to create the Dockerfile. Dockerfile is a file that describes how the container will be built:

# base image
FROM python:3.6

# updating repository
RUN apt-get update

# copying requirements.txt file
COPY requirements.txt requirements.txt

# pip install
RUN pip install --no-cache -r requirements.txt

# exposing port 8888
EXPOSE 8888

# Running jupyter notebook
# --NotebookApp.token='mynotebook' is the passsword
CMD ["jupyter","notebook","--no-browser","--ip=0.0.0.0","--allow-root","--NotebookApp.token='mynotebook'"]
  1. Start with a simple python3 image that is based on Debian.
  2. Then update all packages at last version.
  3. Copy the requirements.txt that describes all Python packages we need for our data science environment.
  4. Run the installation of all packages.
  5. Expose the port for Jupyter.
  6. Run the command to start the Jupyter notebook.

Now, it’s time to write the requirements.txt. This file describes all the Python packages we need and will be used by pip to install all packages correctly.

bleach==1.5.0
certifi==2016.2.28
cycler==0.10.0
decorator==4.1.2
entrypoints==0.2.3
html5lib==0.9999999
ipykernel==4.6.1
ipython==6.2.1
ipython-genutils==0.2.0
ipywidgets==7.0.3
jedi==0.11.0
Jinja2==2.9.6
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.1.0
jupyter-console==5.2.0
jupyter-core==4.3.0
Markdown==2.6.9
MarkupSafe==1.0
matplotlib==2.1.0
mistune==0.7.4
nbconvert==5.3.1
nbformat==4.4.0
networkx==2.0
notebook==5.2.0
numpy==1.13.3
olefile==0.44
opencv-python==3.3.0.10
pandocfilters==1.4.2
parso==0.1.0
pexpect==4.2.1
pickleshare==0.7.4
Pillow==4.3.0
prompt-toolkit==1.0.15
protobuf==3.4.0
ptyprocess==0.5.2
Pygments==2.2.0
pyparsing==2.2.0
python-dateutil==2.6.1
pytz==2017.2
PyWavelets==0.5.2
pyzmq==16.0.2
qtconsole==4.3.1
scikit-image==0.13.1
scikit-learn==0.19.1
scipy==0.19.1
simplegeneric==0.8.1
six==1.11.0
terminado==0.6
testpath==0.3.1
tornado==4.5.2
traitlets==4.3.2
wcwidth==0.1.7
Werkzeug==0.12.2
widgetsnbextension==3.0.6
pandas>=0.22.0 

OK, we are ready to compile our container. The command is:

docker build -t your_container_name .

With the -t option, we can tag our container; for example:

docker build -t mydatascience_env .

When the build process is finished, we can run our container:

docker run -p 8887:8888 -v /path_your_machine/notebook_folder/:/Documents -it datascience_env

With the -v option, /path_your_machine/notebook_folder/ will be mounted into the Docker container at the /Documents path.

This is useful to save the work and to the environment separate from the notebook. I prefer this way to organize my work instead of creating a Docker container that contains the environment and notebook, too.

When the container is up, we can open the Jupyter web interface:

http://127.0.0.1:8007

and when the token is asked we put ‘mynotebook’, or whatever you set into your dockerfile, and that’s all! Now we can work into our new data science environment.

Click on Documents we have all our notebook! 

Note: Every change will be saved when the container is stopped.

To test this environment, I used the example of DBSCAN founded on the sk-learn website. This is the link.

When our work is finished, we can stop the container with the command:

docker  stop datascience_env

I think Docker is a very important tool for every developer and for every data scientist to deploy and share work. From my point of view, the most important innovation Docker as introduced is a way to describe how to correctly recreate an environment where my code can run (with a Dockerfile). In this way, I can reproduce, every time, the exact environment I used during my development process and I can share the container built with everyone.   

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
data science ,docker ,data analysis ,containers ,jupyter ,tutorial ,big data

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}