Review of the Coursera Machine Learning Course
Review of the Coursera Machine Learning Course
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
I've just finished going through the Machine Learning course on Coursera, and this is a brief review.
The scope of the course is quite broad, covering a lot of topics from both supervised and unsupervised machine learning. "Practical advice" parts are sprinkled throughout the lectures and are extremely useful - how to evaluate your algorithm, how to know what to focus on, practical tips for improving speed and accuracy, etc.
Machine Learning is one of the oldest courses on Coursera, and is extremely polished by now. In the lectures, Andrew Ng foresees the questions that may arise and answers them beforehead. The programming excercises are very well organized with very little detail left to chance. They're quite sizable too. I added extra work to myself by using Python instead of Octave (if you use Octave or Matlab a lot gets set up for you by the code provided with the assignment), but that's probably not the bulk of the code. ~2200 lines of dense numerical code (in throwaway homework mode - barely any comments and few unit tests) is quite a lot.
One thing I liked somewhat less is the lack of mathematical depth. I realize the reason for this is the relatively low barrier of entry they want to set for the course, I'm sure that Stanford students taking the real course are expected to know some linear algebra and probability and the formulae can be developed, rather than just presented as an act of god. This was especially noticeable in the lecture on PCA, where Prof Ng just dropped the eigenvector approach, and even went so far as to provide the exact Octave function to call, without much mathematical background or reasoning. Indeed, it is quite possible to implement the PCA part of the programming exercises without really understanding how PCA works.
Which brings me to a related topic. The course is way too easy to provide a meaningful certificate. The review quizzes are trivial and require very little thought. The programming exercises, while demanding in terms of invested time, aren't going deep either, and don't deviate from the lecture materials - you basically follow very detailed instructions and transcribe formulae into code.
Overall, I enjoyed the course, and I would highly recommend it to anyone interested in getting into machine learning (or, by its new-age name, big data).
Finally, an interesting observation I had while working through the programming assignments in Python (using Numpy, Scipy and Scikit-learn). Domain specific languages (like Matlab/Octave for general numerics and R for statistics) are often hailed as eye-openers because they come equipped with some facilities tailored at scientific programming (like a convenient notation for defining constant matrices), but I don't think this is the right approach. Writing Python with Numpy et. al, some basic things may take a bit more keystrokes to achieve, but eventually you end up with very similar code. Moreover, it's all the same Fortran-written LINPACK running under the hood anyways, so performance is the same.
However, expressing a constant matrix in the most concise way possible is not the end of the story when writing a machine learning system. It's nice to have an actual programming language in your hands, with all that entails - community, libraries, etc. It was curious to notice the impact of this in the spam classification assignment. Tons of textual preprocessing for which Octave is not very well suitable. Need to implement some NLP algorithms, for which Octave has no libraries, etc. In Python it was all a breeze, of course, including importing NLTK to take care of any NLP needs.
I'm not saying that Python is the best tool for every job. But for exploratory scientific programming, it seems like the strongest option out there. Numpy and its kin are very mature, fast and well-supported. Plotting with matplotlib is great (and I heard that the legendary ggplot from R now has Python bindings as well). And when you need to step outside the narrow domain of computations, you have the full programming language with all its support structure at your command.
Published at DZone with permission of Eli Bendersky . See the original article here.
Opinions expressed by DZone contributors are their own.