Book Review: Data Analysis with Open Source Tools
Before I get to the book review, I wanted to mention a basic note about book reviews. In the past I have reviewed books in a less than traditional manner, focusing on how they could be used in a startup. I will probably be receiving more titles to review, so a more traditional review is going to become more common and they will likely be shorter posts as well. There will be the short summary, followed by more detail so that you can focus on whichever section you like. Now, onto the review.
In this post, we look at Data Analysis with Open Source Tools by Philipp K. Janert. Given the rise of big data on the web, this book is going to be very handy for a lot of people. However, having the right expectations for this book is important. If you are expecting a book filled with examples of NoSQL databases like Hadoop and Cassandra, you are definitely going to be disappointed. The key with this book is to look at the cover. Data analysis is the main point of this book, and open source tools are really just a nice sidebar. The data analysis information is fairly solid and ranges from some basic methods, through statistics and eventually getting to some machine learning methods like clustering and categorization. The open source tools portions of the book are based on examples of the various analysis methods, but do not delve too deeply into how the tools work.
First, this book is on the longer side with a lot of content and some lengthy appendices. Also, this book is not for beginners, but it is not terribly advanced either. The author tries to limit the amount of mathematics involved as that can quickly complicate a book like this. However, in many cases there are explanations using statistical formulas and maybe even some calculus. Thankfully, much of those complicated examples are illustrated using graphs to simplify the explanation. The one problem was that the author struggled with making the topic approachable for people without the statistical and mathematical background.
From the data analysis perspective, the book is thorough and includes a lot of content. Some people may think that some important statistical methods were missing, but the author follows each chapter with recommended resources. These recommendations end up being a huge collection of excellent books that you could review for deeper treatment of the various topics. This is also one of the major benefits of the book, you get a solid overview and then you know where to look for more information.
From a software engineer’s perspective, the lack of information about current open source tools is disappointing. However, the examples include important tools like GnuPlot, Python and NumPy, and the R language. Even more information is given in the first appendix which talks about programming environments for scientific computing.
This leads to my next point, the appendices are fantastic. The first talks more about programming tools, the second gives a nice overview of some of the calculus used in the book, and the third talks about where to get the data you want to work with and how to work with it, like cleaning the data and normalizing it. In some cases, you may even want to start with the appendices before getting into the meat of the book.
Publisher Link: Data Analysis With Open Source Tools by Philipp K. Janert