For Every Evernote—Its Own Lucene Index!
Today I found an extremely interesting blog post on a very high-tech (yet open source) architecture that is being used by one of the most popular online text editing tools around: Evernote.
One of the clever architectural techniques they use to make their notes so convenient—they are instantly shared and organized by the system—is the creation of a database shard for every single note. Each shard contains three well-known open source technologies: MySQL, Tomcat, and Lucene.
The graphic shows you how each note gets a shard containing three different storage systems for Metadata, Resources, and Searchable Text.
"All of the metadata about each note goes into structured tables in MySQL. And by “metadata”, I mean all of the fields in the data model structures for a Note and its Resources, except for the Resource’s raw data body and any recognition/alternate data files.
Those Resource files are de-duplicated in software on each shard (using MD5+length) and then stored on a relatively simple hierarchical file system using a folder tree derived from the MD5 checksum.
The combination of MySQL and the file system allows us to store the full contents of the data model and support the vast majority of our API calls. Text-based searches on our servers require some sort of Full-Text Search (FTS) engine to provide any sort of usable performance across large data sets."
--Dave Engberg, Evernote
Evernote initially used MyISAM's FTS engine within MySQL to index the searchable text metadata in notes. They tried a few things with MyISAM including batch updates, but they eventually gave up and switched to Apache Lucene—a proven search library.
Why did they make the change? Evernote had high standards: "When users create or update notes, they expect those notes to immediately match any text searches," said Dave Engberg, the author of the post. Only Lucene could give them the virtually synchronous text indexing for each individual note after its creation.
When you use Evernote, every single note now has its own Lucene search index occupying a separate directory on the file system.
It wasn't so simple, however, to maintain the level of performance that they wanted, so there was definitely some tuning required for Lucene, MySQL, and even their hardware. Go ahead and read the post if you're interested in all the gory details of how they made Lucene work well for them.
Before you do, let's hear some of your thoughts. Do you think Evernote's got the right idea? Lucene is currently making twice as many IO operations as MySQL, but they expect they can bring that down with some eventual tuning. Do you think it would be worth the uncertainty and effort to try putting newer, less-mature technologies into the solution like NoSQL stores or ElasticSearch?