I recently read the white paper “Challenges and Opportunities with Big Data” published by the Computing Community Consortium of the CRA. It was an interesting read, but I found it a bit too database centric, which is probably no suprise given its set of authors.
White papers are an interesting class of documents. They’re usually written from a high-level point of view and are often used to represent opinions. I often wonder what their purpose is. At least one is to form a citable piece of authority which others can then use to support whatever agenda they’re up to. For example, McKinsey’s Big Data study is continually cited to say that Big Data will have a huge economic impact in the next years.
The CCC report of course also believes that Big Data is The Next Big Thing, and discusses what it believes are the upcoming challenges. It first defines a Big Data Analysis Pipeline consisting of the steps
- Analysis/Modeling, and
as well as overall properties like Heterogeneity, Scale, Timeliness, Privacy, and Human Collaboration (i.e. cloud sourcing). The report then discusses each of these aspects by giving a brief overview of the state of the art and then discussing future and current challenges.
It becomes pretty clear quickly that most of the authors come from a data base background, because the report outlines an approach similar to what has worked so well in the data base world. Basically, Big Data will be solved by the following two steps:
- Transform raw data into a format which is more suitable for automated analysis, also dealing with noisy or incomplete data.
- Separate the “what” from the “how”, just as a SQL query declaratively says what to compute, leaving it up to the database to break this down into an optimal set of primitive search operations, which we would then know how to scale.
I have my doubts concerning both steps. Don’t get me wrong, having a declarative query language which automatically scales to petabytes of data would be really cool, as well as automated ways to represent data effectively.
But from an ML point of view, data is always noisy and also highly unstructured, in particular in a way we don’t yet know how to interpret automatically. For example, if you take text, it’s pretty clear for us as humans that there is certain information encoded in the text. We have taught computers to understand some of that information using representations like n-grams, some robust form of parsing, and so on, but it is also pretty clear that there is a lot of information we just don’t know how to preserve. In other word, we just don’t know how to preserve the information contained in the data for many media types, and existing techniques will likely destroy a substantial part of that information.
Concerning a declarative query language, I think data base people probably don’t have a good idea of just how divers Machine Learning methods can be. They are inspired by biology, physics, statistics, usually use vectorial representations and linear algebra for computations, or run complex Monte Carlo simulation, all of which cannot be naturally expressed using JOINs and similar things in a relational data base system. So when the report complains that
Today’s analysts are impeded by a tedious process of exporting data from the database, performing a non-SQL process and bringing the data back.
the main reason is that the non-SQL process is so different from a relational world view that this is actually the most effective way to do the computation.
Moreover, ML algorithms are often iterative and have a lot of state, meaning that you don’t just run a query of sorts, but rather complex computations which update a lot of state again and again until your results have converged.
It’s probably possible to build such DSLs for subgroups of algorithms (for example, Bayesian algorithms, distributed linear algebra libraries, etc.), I’m unsure whether there is a general query language which is not just a generic cluster job system. Existing approaches like map reduce or stream processing come with very little assumptions on your data and mostly manage just the parallelization aspect of your computations.
Still, an interesting paper from an infrastructure point of view, but I feel like they’re missing the point when it comes to the data analysis part.