Giving a one day tutorial on data science is something I’ve been considering in different contexts from time to time, but for different reasons it never really happened. Finally, last Friday, the tutorial took place as a workshop in the data2day conference, and I think it went pretty well. In this post I’d like to talk a bit about our approach and our experiences.
The conference was organized by the heise publisher, well known in Germany for their print magazines c’t and iX, which have been household names in IT since the eighties. It was the first conference in the Big Data/Data Science context organized by them, but already brought together over 150 participants.
For the workshop, I was happy to team up with Jan Müller and Paul Bünau from idalab. In fact, Paul and I had developed a similar kind of hands-on introduction to data analysis a few years ago while he was working on his PhD at TU Berlin. Designed as a summer long course, the idea was to have students implement a number of machine learning algorithms themselves. Each method would first be presented by focussing on the main ideas, without going into the theory too much. Then, the students would have two to three weeks time to implement the method and play around with them on some toy data. During that phase, we would have a weekly office hour where we would go around and talk to the students individually to help them where they got stuck.
This course seemed to be quite popular with the students. We would still randomly get praise for the course years later with students telling us that this was among the courses where they learned most.
So when designing this one day workshop, the idea was from the beginning to keep these two ingredients: Focus on main ideas and context, and a hands-on approach.
It was particularly important to us to not just go through a bunch of learning algorithms, but also stress how important is to know what you are doing. As I have discussed before, it is too easy to put together some data analysis pipeline and then not properly evaluate. Everything looks great, but in the end you have just looked at training error, resulting in really bad performance on future data.
For the hands-on part, we chose to work with IPython notebooks. These are available on all major operating systems, notebooks can saved and loaded easily, it integrates with plotting, and so on. Toolwise we chose to work with numpy, pandas, [scikit-learn], and matplotlib. Originally the plan was to have one session where we go through the basics of the tools and then two use cases, but while putting the material together it became apparent that there wasn’t enough time for two use cases, so we just sticked with a simple example based on MNIST character recognition, and decision trees.
So in the end the course went like this:
about one hour if introductory course on what is data science/machine learning, and things like supervised vs. unsupervised learning, evaluation, cross-validation, etc.
one hour of going through the basics of numpy and pandas in an interactive IPython session
one hour of doing some exercises with numpy and pandas
another hour of going through an example with scikit-learn
two hours of doing the use case
The notebook from the example sessions were handed out at the beginning of the exercises, and the exercises were prepared as IPython notebooks themselves with free cells where you could put down your solutions.
As it is with all such things, you never know whether you thought of everything, but all in all, we felt the workshop went very well. With three of us, there was enough time to help each of the participants individually, including fixing issues like finding out where IPython was keeping it files under Windows, dealing with oddities of Python’s indexing scheme, and so on.
In the end, all participants had a running notebook which loaded the MNIST data, learned a decision tree whose hyperparameter was adjusted by cross- validation, giving them about 83% accuracy. Of course that is not optimal, but already pretty good for a few lines of code. Most importantly, everyone now has a complete framework from which they can start exploring other approaches, try out new methods, and so on.
Next time, we would probably intersperse the background talk with the solutions, such that there isn’t such a monolithic block at the beginning, and be more careful with Python 3 vs Python 2. But overall I think our approach worked out very well (also based on the feedback we got).
The workshop also showed that there is a real need of teaching people the more high level concepts like proper validation. Unfortunately, even at universities, the focus is too much on the methods themselves. Students often learn the process and things like proper validation only when they work on their master thesis. On the hand, for doing robust and reliable data analyses, these things are absolutely essential.