Jon Gifford of Loggly on SolrCloud, Near-Real Time Search, and Twitter's Secrets
Jon Gifford has spent his days coercing Solr into playing nice with the cloud, and with high-volume real-time data streams. An active user and frequent hacker of Lucene since 2004, he's happy to let Solr take care of some of the hard work for a change. Prior to Loggly, hespent more than a decade working on Search systems at Minimal Loop, Scout Labs, Technorati and LookSmart. He says he is concerned that his near-complete web-anonymity is under threat. :)
DZone: You were at the 2010 Lucene Revolution conference and you gave a talk about near real-time indexing and searching of log data. What will you discuss in your 2011 talk?
Jon Gifford: We've been using SolrCloud for over a year, and loving it. I'll be talking about how we're using it, how we've extended it to handle our specific use-cases, and what hiccups we've had along the way.
DZone: What made you choose Solr as the search platform for the Loggly service architecture?
Jon: Having been involved in building and running a number of different systems, I've learned that its a lot easier to use the highest level open-source system available, to minimize the amount of time you spend reinventing the wheel, or re-visiting decisions when the time for upgrading comes. At Technorati, for example, we built a number of things that eventually (several years later) came standard in Solr/Lucene, and every upgrade was a much longer process than I'd have liked because a lot of our changes were deep in the Lucene internals.
This isn't a panacea, of course, but by using the work that had already been done on SolrCloud (for example), building the first version of our system was a lot easier than it would have been if we had implemented the same stuff ourselves. Its a simple numbers game - there are always more smart people not working for you than there are working for you.
DZone: Could you give a brief description of what Loggly does in terms of indexing and searching logs?
Jon: Loggly lets anyone send all of their logs to a single location that provides archive, search, and (soon) live tail and (later) large scale analytics. As logs are received, we route them directly to Solr indexers, where we index them in near-real time. Within 15-20 seconds of receiving a log event, its available for search.
Our current indexing approach is to treat all logs as simple text, which is fast and flexible - we'll accept events in any format at all. We create one "index" per customer, sharding by time, and this allows us to grow a users index as large as they need.
We're exposing the full power of Solr/Lucene's query formulation to our users, so you can formulate arbitrarily complex search queries to extract everything you need from your logs.
Coming later this year, we'll also support flat JSON indexing and searching, using Solr 3.1. Simple text search is great for searching unstructured data, but there are a lot of cases where logs contain important tracking, metrics or performance data.
DZone: What's the difference between how you build indexes in Loggly, vs. some of the Hadoop-based systems?
Jon: Our approach is to try and be as close to real-time as we can, and at the time we started development none of the hadoop-based systems looked like they could get as close to real-time as we would like to be. Since we're streaming data into our indexers, rather than batching, we think we can continue to improve how close we get to real time as Solr/Lucene evolves towards fully real-time.
When the system is operating properly, we never need to update our index shards, so the ability to rebuild our entire index is not something we need. Hadoop based systems do this very well, and we do use hadoop for rebuilds if we lose a machine in our cluster, but this is a very rare occurence.
Overall then, the strengths of the Hadoop based Solr/Lucene systems don't fit well with how we use Solr.
DZone: What kinds of things does a system like yours need to do to be Near Real-Time and fully Real-Time?
Jon: We need to be able to:
1) stream data directly to an indexer with minimum overhead. We use 0MQ to do this.
2) index as quickly as possible, which is why we chose a simple Analyzer that treats the events we're indexing as plain text.
3) distribute the indexing load across multiple machines so that we can handle large volumes of data. We've implemented a custom router using 0MQ and ZooKeeper to distribute this load automatically
4) make the index available as quickly as possible after we add an event. We're currently doing this with frequent commits on small indices, which works, but isn't particularly nice. We're looking forward to more of the NRT and Twitter work being exposed in Solr.
DZone: Is Loggly anticipating heavy usage of the new features coming in Solr 3.1?
Jon: We're currently on a pretty old version of Trunk (November 2009), which is working well for us for now, but we will be upgrading in the next few months. The specific features we expect to use heavily are numeric facets and JSON indexing.
DZone: What is your "wishlist" for new features in future versions of Solr and Lucene?
Jon: The major wishlist item for us is for Twitter to release their index changes, since (as you can imagine) we have a very similar problem to them, and the performance improvements they've made with time-based search look very promising.
DZone: You said previously that you chose ØMQ for messaging because it was fast and lightweight. How has that open source library progressed over the past year? Is it still meeting Loggly's needs?
Jon: Over the last year we've only done a couple of 0MQ updates, since we 're using a fairly restricted set of the complete 0MQ functionality. We've updated primarily to "stay in touch" with the ongoing development, rather than to get new features. For us, 0MQ has been very stable, and we're very very happy with it. Considering the tiny size of the team responsible for its development, I'm amazed at the overall quality and performance. Its one of those pieces of our toolset that we almost never even think about - it just works.
DZone: How was the audience at Lucene Revolution 2010? Were they very interactive? How about the speakers besides yourself?
Jon: The audience for my talk was great. Lots of good questions during and after the talk, and lots of discussions about how we're doing things here at Loggly. Its nice to be able to talk purely about the technical side of search, with people who understand the problems and have enough experience in solving them to shine a new light on things.
The other sessions were interesting for the breadth of problems being tacked, and the variety of ways in which Solr/Lucene was being used to solve them. I liked the balance of the sessions too - the deep technical talks were food for thought on how we're doing things inside and around Solr, the more general talks were interesting as a precis of the ecosystem, and the "future of..." sessions were useful in terms of planning what we should be doing ourselves vs waiting for version X.Y.
Check the Lucene Revolution site for conference agenda, pricing, some early bird discounts and training specials may still be available. Don't miss this once a year opportunity!