Hi Mikio! Thanks for talking to us. Could you tell us a little bit about your work with machine learning?
I started with machine learning as a graduate student back in 2000. I guess my original motivation was to build algorithms which are able to grasp the meaning of data to assist us humans in various ways, sifting through data, filtering and aggregating, or generally having an accurate understanding of "what's happening."
I started out as a "methods guy" -- that is, someone who develops new methods, but has also worked on quite a few application areas, including bioinformatics, analysis of brain signals, or natural language processing. I've worked on clustering, kernel methods (like the support vector machine) and wrote a pretty theoretical dissertation about convergence of eigenvalues.
About three years ago I've become interested in social media analysis, which eventually lead to a shift of focus towards real-time analysis of event data and technologies which let you analyze a huge amount of event data without having to rent a huge cluster on the cloud. I'm currently working on launching streamdrill, a real-time analysis engine based on the technology we've developed the last few years four our startup TWIMPACT.
What do you think were some of the most important developments in the data science space in the last year or so?
I really think that large scale and big data are finally arriving in the machine learning community. Don't get me wrong, large scale learning was always a topic, but more in the sense of finding fast optimization algorithms. Data always had to reside in memory completely, or be streamed through the algorithm, but not being processed in a distributed fashion as in MapReduce. Platforms like Hadoop and SciPy are also great because they allow people to focus on the abstract side of the analysis methods without getting bogged down in infrastructure and low-level database code.
There has also been a lot of work done on learning complex models like latent Dirichlet allocation, which you can use for topic detection, in an online fashion. I think streaming and real-time will be much more important in the coming years, and that's definitely a great step in that direction.
On the other hand, I've closely followed the ML community for the past 13 years and at some point you see patterns emerge. New methods are developed or re-discovered from statistics, physics, or other sources, and become hot for 2 or 3 years, after which other topics take over. I still think there are some fundamental problems yet to be solved. Currently, finding the right representation for your data is still crucial and requires a lot of human intervention. ML methods are good at crunching numbers and finding statistical connections, but they still cannot really "understand" the data.
Are there any particular developer tools or resources you couldn't live without?
I've become a big fan of Scala, and couldn't live without the IntelliJ IDEA IDE. For smaller editing jobs, I favor emacs over vim. Lately, I've been trying out SublimeText which I also like a lot. In terms of build infrastructure, we've stuck with maven over the more Scala-centric build tools like sbt. It's not always pretty, but it's mature and integrates nicely with everything.
Do you have any favorite open source projects you've contributed to recently?
I have one OSS project, jblas, which is a matrix library for Java based on the Fortran LAPACK and BLAS libraries for the actual number crunching. Usage seems to be on the rise, because I'm getting more and more bug reports lately, which is a good thing. This was one of my first larger projects in Java, so I'd probably design things differently now, but I tried to do the simplest thing which works and I thing that went quite well. The build process is really a nightmare, I have Ruby scripts which parse Fortran sources to automatically generate the JNI stub files, more Ruby to generate code for float classes from double classes, and every time I make a major release, I have to recompile stuff on a number of operating systems. Luckily, there is virtualization.
Do you follow any blogs or Twitter feeds that you would recommend to developers?
That's hard to say. I follow about 500 people on Twitter which provides me with a constant stream of interesting posts. I also follow a bunch of blogs, but personal blogs seem to be on the decline. Aggregators like DZone (obviously ;)) or getprismatic.com are also a great source of news and relevant stuff.
Anything else you'd like to mention?
Yeah, we're coming to the Bay Area in the week from April 22 and April 26 to meet with people and talk about streamdrill and real-time event analysis. If you're interesting in talking to us, contact me over Twitter.