This month’s Wired Magazine features a story on the roots of Hadoop at Yahoo and the three companies vying to drive its commercial frontiers farther forward faster: Hortonworks (Apache Lucene Eurocon Barcelona Keynote Video now available, see below), MapR, and Cloudera. MapR CEO John Schroeder sums it up:
If I can get a terabyte drive for $100 — or less if I buy in bulk — and I can get cheap processing power and network bandwidth to get to that drive, why wouldn’t I just just keep everything?” he says. “Hadoop lets you keep all your raw data and ask questions of it in the future.”
Yahoo, while otherwise lamented in the press for its business model woes, has done this with an array of applications from spam-hunting (retraining the model every few hours) to auto-categorization and user content mapping, running 5 million jobs a month across over 40 thousand servers and 170 petabytes of storage (a mere $17M worth of disk, enough to keep at most maybe a half-dozen enterprise storage sales guys busy. Multi-billion enterprise storage companies are in a tizzy). With the leverage this affords, it’s no surprise that Ebay has increased their Hadoop footprint 5x to over 2500 servers in the last year. Nor it is surprising that Eric Baldeschwieler, Keynote speaker at Apache Lucene Eurocon 2011 in Barcelona last week, predicts that 50% of the world’s data will be stored on Hadoop within 5 years:
So step one: store it all, and map/reduce to your heart’s content, cranking through key-value abstractions that produce insights you just couldn’t get running it in and out of a relational database (though with HDFS and Hive, the constructs of filesystem and query retrieval from the conventional data world are not out of reach). At Lucid, we’ve helped streamline that process, for example, with built-in HDFS connectors from LucidWorks.
But that doesn’t answer the question about how to animate the virtuous cycle of insights available once you get all that data stored. Here’s where the search equation gets interesting. If you know exactly what you are looking for every time, it’s one thing to write some jobs that extract a particular trend or insight. But when you keep everything, can you know everything a priori? Of course not. Grant Ingersoll’s talk sets forth a powerful portfolio of tools centered on Lucene/Solr
These two talks between them will give you a solid foundation for why applying search to big data matters to end users and businesses alike. Better awareness driven by the search backed by real data, combined with enablement of developers who can better fine tune access and retrieval, and the agility to fill the white spaces of relationships between available information — what you didn’t know you didn’t know.
More talks from Barcelona are here. We’ll touch on the talk from Michael Busch of Twitter soon.