DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Avatar

Michael Mccandless

Developer at IBM

Lexington, US

Joined Apr 2011

About

Michael loves building software; he's been building search engines for more than a decade, and has been working on Lucene as a committer, PMC member and Apache member, for the past few years. He's co-author of the recently published Lucene in Action, 2nd edition. In his spare time Michael enjoys building his own computers, writing software to control his house (mostly in Python), encoding videos and tinkering with all sorts of other things.

Stats

Reputation: 26
Pageviews: 454.1K
Articles: 5
Comments: 0
  • Articles

Articles

article thumbnail
Apache Lucene: Fast Range Faceting Using Segment Trees and the Java ASM Library
In Lucene's facet module we recently added support for dynamic range faceting, to show how many hits match each of a dynamic set of ranges. For example, the Updated drill-down in the Lucene/Solr issue search application uses range facets. Another example is distance facets (< 1 km, < 2 km, etc.), where the distance is dynamically computed based on the user's current location. Price faceting might also use range facets, if the ranges cannot be established during indexing. To implement range faceting, for each hit, we first calculate the value (the distance, the age, the price) to be aggregated, and then lookup which ranges match that value and increment its counts. Today we use a simple linear search through all ranges, which has O(N) cost, where N is the number of ranges. But this is inefficient! Segment Trees There are fun data structures like segment trees and interval trees with O(log(N) + M) cost per lookup, where M is the number of ranges that match the given value. I chose to explore segment trees, as Lucene only requires looking up by a single value (interval trees can also efficiently look up all ranges overlapping a provided range) and also because all the ranges are known up front (interval trees also support dynamically adding or removing ranges). If the ranges will never overlap, you can use a simple binary search; Guava's ImmutableRangeSet takes this approach. However, Lucene's range faceting allows overlapping ranges so we can't do that. Segment trees are simple to visualize: you "project" all ranges down on top of one another, creating a one-dimensional Venn diagram, to define the elementary intervals. This classifies the entire range of numbers into a minimal number of distinct ranges, each elementary interval, such that all points in each elementary interval always match the same set of input ranges. The lookup process is then a binary search to determine which elementary interval a point belongs to, recording the matched ranges as you recurse down the tree. Consider these ranges; the lower number is inclusive and the upper number is exclusive: 0: 0 – 10 1: 0 – 20 2: 10 – 30 3: 15 – 50 4: 40 – 70 The elementary intervals (think Venn diagram!) are: -∞ – 0 0 – 10 10 – 15 15 – 20 20 – 30 30 – 40 40 – 50 50 – 70 70 – ∞ Finally, you build a binary tree on top of the elementary ranges, and then add output range indices to both internal nodes and the leaves of that tree, necessary to prevent adversarial cases that would require too much (O(N^2)) space. During lookup, as you walk down the tree, you gather up the output ranges (indices) you encounter; for our example, each elementary range is assigned the follow range indices as outputs: -∞ – 0 → 0 – 10 → 0 10 – 15 → 1, 2 15 – 20 → 1, 2, 3 20 – 30 → 2, 3 30 – 40 → 3 40 – 50 → 3, 4 50 – 70 → 4 70 – ∞ → Some ranges correspond to 1 elementary interval, while other ranges correspond to 2 or 3 or more, in general. Some, 2 in this example, may have no matching input ranges. Looking Up Matched Ranges I've pushed all sources described below to new Google code project; the code is still somewhat rough and exploratory, so there are likely exciting bugs lurking, but it does seem to work: it includes (passing!) tests and simple micro-benchmarks. I started with a basic segment tree implementation as described on the Wikipedia page, for long values, called SimpleLongRangeMultiSet; here's the recursive lookup method: private int lookup(Node node, long v, int[] answers, int upto) { if (node.outputs != null) { for(int range : node.outputs) { answers[upto++] = range; } } if (node.left != null) { if (v <= node.left.end) { upto = lookup(node.left, v, answers, upto); } else { upto = lookup(node.right, v, answers, upto); } } return upto; } This worked correctly, but I realized there must be non-trivial overhead for the recursion, checking for nulls, the for loop over the output values, etc. Next, I tried switching to parallel arrays to hold the binary tree (ArrayLongRangeMultiSet), where the left child of node N is at 2*N and the right child is at 2*N+1, but this turned out to be slower. After that I tested a code specializing implementation, first by creating dynamic Java source code from the binary tree. This eliminates the recursion and creates a single simple method that uses a series of if statements, specialized to the specific ranges, to do the binary search and record the range indices. Here's the resulting specialized code, compiled from the above ranges: void lookup(long v, int[] answers) { int upto = 0; if (v <= 19) { if (v <= 9) { if (v >= 0) { answers[upto++] = 0; answers[upto++] = 1; } } else { answers[upto++] = 1; answers[upto++] = 2; if (v >= 15) { answers[upto++] = 3; } } } else { if (v <= 39) { answers[upto++] = 3; if (v <= 29) { answers[upto++] = 2; } } else { if (v <= 49) { answers[upto++] = 3; answers[upto++] = 4; } else { if (v <= 69) { answers[upto++] = 4; } } } } } Finally, using the ASM library, I compiled the tree directly to specialized Java bytecode, and this proved to be fastest (up to 2.5X faster in some cases). As a baseline, I also added the simple linear search method, LinearLongRangeMultiSet; as long as you don't have too many ranges (say 10 or less), its performance is often better than the Java segment tree. The implementation also allows you to specify the allowed range of input values (for example, maybe all values are >=0 in your usage), which can save an if statement or two in the lookup method. Counting All Matched Ranges While the segment tree allows you to quickly look up all matching ranges for a given value, after a nice tip from fellow Lucene committee Robert Muir, we realized Lucene's range faceting does not need to know the ranges for each value; instead, it only requires the aggregated counts for each range in the end, after seeing many values. This leads to an optimization: compute counts for each elementary interval and then in the end, roll up those counts to get the count for each range. This will only work for single-valued fields, since for a multi-valued field you'd need to carefully never increment the same range more than once per hit. So based on that approach, I created a new LongRangeCounter abstract base class, and the SimpleLongRangeCounter Java implementation, and also the ASM specialized version, and the results are indeed faster (~20 to 50%) than using the lookup method to count; I'll use this approach with Lucene. Segment trees are normally always "perfectly" balanced but one final twist I explored was to use a training set of values to bias the order of the if statements. For example, if your ranges cover a tiny portion of the search space, as is the case for the Updated drill-down, then it should be faster to use a slightly unbalanced tree, by first checking if the value is less than the maximum range. However, in testing, while there are some cases where this "training" is a bit faster, often it's slower; I'm not sure why. Lucene I haven't folded this into Lucene yet, but I plan to; all the exploratory code lives in the segment-trees Google code project for now. Results on the micro-benchmarks can be entirely different once the implementations are folded into a "real" search application. While ASM is a powerful way to generate specialized code, and it gives sizable performance gains at least in the micro-benchmarks, it is an added dependency and complexity for ongoing development and many more developers know Java than ASM. It may also confuse hotspot, causing deoptimizations when there are multiple implementations for an abstract base class. Furthermore, if there are many ranges, the resulting specialized bytecode can be become quite large (but, still O(N*log(N)) in size), which may cause other problems. On balance I'm not sure the sizable performance gains (on a micro-benchmark) warrant using ASM in Lucene's range faceting.
December 17, 2013
· 9,777 Views
article thumbnail
Searching relational content with Lucene's BlockJoinQuery
Lucene's 3.4.0 release adds a new feature called index-time join (also sometimes called sub-documents, nested documents or parent/child documents), enabling efficient indexing and searching of certain types of relational content. Most search engines can't directly index relational content, as documents in the index logically behave like a single flat database table. Yet, relational content is everywhere! A job listing site has each company joined to the specific listings for that company. Each resume might have separate list of skills, education and past work experience. A music search engine has an artist/band joined to albums and then joined to songs. A source code search engine would have projects joined to modules and then files. Perhaps the PDF documents you need to search are immense, so you break them up and index each section as a separate Lucene document; in this case you'll have common fields (title, abstract, author, date published, etc.) for the overall document, joined to the sub-document (section) with its own fields (text, page number, etc.). XML documents typically contain nested tags, representing joined sub-documents; emails have attachments; office documents can embed other documents. Nearly all search domains have some form of relational content, often requiring more than one join. If such content is so common then how do search applications handle it today? One obvious "solution" is to simply use a relational database instead of a search engine! If relevance scores are less important and you need to do substantial joining, grouping, sorting, etc., then using a database could be best overall. Most databases include some form a text search, some even using Lucene. If you still want to use a search engine, then one common approach is to denormalize the content up front, at index-time, by joining all tables and indexing the resulting rows, duplicating content in the process. For example, you'd index each song as a Lucene document, copying over all fields from the song's joined album and artist/band. This works correctly, but can be horribly wasteful as you are indexing identical fields, possibly including large text fields, over and over. Another approach is to do the join yourself, outside of Lucene, by indexing songs, albums and artist/band as separate Lucene documents, perhaps even in separate indices. At search-time, you first run a query against one collection, for example the songs. Then you iterate through all hits, gathering up (joining) the full set of corresponding albums and then run a second query against the albums, with a large OR'd list of the albums from the first query, repeating this process if you need to join to artist/band as well. This approach will also work, but doesn't scale well as you may have to create possibly immense follow-on queries. Yet another approach is to use a software package that has already implemented one of these approaches for you! elasticsearch, Apache Solr, Apache Jackrabbit, Hibernate Search and many others all handle relational content in some way. With BlockJoinQuery you can now directly search relational content yourself! Let's work through a simple example: imagine you sell shirts online. Each shirt has certain common fields such as name, description, fabric, price, etc. For each shirt you have a number of separate stock keeping units or SKUs, which have their own fields like size, color, inventory count, etc. The SKUs are what you actually sell, and what you must stock, because when someone buys a shirt they buy a specific SKU (size and color). Maybe you are lucky enough to sell the incredible Mountain Three-wolf Moon Short Sleeve Tee, with these SKUs (size, color): small, blue small, black medium, black large, gray Perhaps a user first searches for "wolf shirt", gets a bunch of hits, and then drills down on a particular size and color, resulting in this query: name:wolf AND size=small AND color=blue which should match this shirt. name is a shirt field while the size and color are SKU fields. But if the user drills down instead on a small gray shirt: name:wolf AND size=small AND color=gray then this shirt should not match because the small size only comes in blue and black. How can you run these queries using BlockJoinQuery? Start by indexing each shirt (parent) and all of its SKUs (children) as separate documents, using the new IndexWriter.addDocuments API to add one shirt and all of its SKUs as a single document block. This method atomically adds a block of documents into a single segment as adjacent document IDs, which BlockJoinQuery relies on. You should also add a marker field to each shirt document (e.g. type = shirt), as BlockJoinQuery requires a Filter identifying the parent documents. To run a BlockJoinQuery at search-time, you'll first need to create the parent filter, matching only shirts. Note that the filter must use FixedBitSet under the hood, like CachingWrapperFilter: Filter shirts = new CachingWrapperFilter( new QueryWrapperFilter( new TermQuery( new Term("type", "shirt")))); Create this filter once, up front and re-use it any time you need to perform this join. Then, for each query that requires a join, because it involves both SKU and shirt fields, start with the child query matching only SKU fields: BooleanQuery skuQuery = new BooleanQuery(); skuQuery.add(new TermQuery(new Term("size", "small")), Occur.MUST); skuQuery.add(new TermQuery(new Term("color", "blue")), Occur.MUST); Next, use BlockJoinQuery to translate hits from the SKU document space up to the shirt document space: BlockJoinQuery skuJoinQuery = new BlockJoinQuery( skuQuery, shirts, ScoreMode.None); The ScoreMode enum decides how scores for multiple SKU hits should be aggregated to the score for the corresponding shirt hit. In this query you don't need scores from the SKU matches, but if you did you can aggregate with Avg, Max or Total instead. Finally you are now free to build up an arbitrary shirt query using skuJoinQuery as a clause: BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term("name", "wolf")), Occur.MUST); query.add(skuJoinQuery, Occur.MUST); You could also just run skuJoinQuery as-is if the query doesn't have any shirt fields. Finally, just run this query like normal! The returned hits will be only shirt documents; if you'd also like to see which SKUs matched for each shirt, use BlockJoinCollector: BlockJoinCollector c = new BlockJoinCollector( Sort.RELEVANCE, // sort 10, // numHits true, // trackScores false // trackMaxScore ); searcher.search(query, c); The provided Sort must use only shirt fields (you cannot sort by any SKU fields). When each hit (a shirt) is competitive, this collector will also record all SKUs that matched for that shirt, which you can retrieve like this: TopGroups hits = c.getTopGroups( skuJoinQuery, skuSort, 0, // offset 10, // maxDocsPerGroup 0, // withinGroupOffset true // fillSortFields ); Set skuSort to the sort order for the SKUs within each shirt. The first offset hits are skipped (use this for paging through shirt hits). Under each shirt, at most maxDocsPerGroup SKUs will be returned. Use withinGroupOffset if you want to page within the SKUs. If fillSortFields is true then each SKU hit will have values for the fields from skuSort. The hits returned by BlockJoinCollector.getTopGroups are SKU hits, grouped by shirt. You'd get the exact same results if you had denormalized up-front and then used grouping to group results by shirt. You can also do more than one join in a single query; the joins can be nested (parent to child to grandchild) or parallel (parent to child1 and parent to child2). However, there are some important limitations of index-time joins: The join must be computed at index-time and "compiled" into the index, in that all joined child documents must be indexed along with the parent document, as a single document block. Different document types (for example, shirts and SKUs) must share a single index, which is wasteful as it means non-sparse data structures like FieldCache entries consume more memory than they would if you had separate indices. If you need to re-index a parent document or any of its child documents, or delete or add a child, then the entire block must be re-indexed. This is a big problem in some cases, for example if you index "user reviews" as child documents then whenever a user adds a review you'll have to re-index that shirt as well as all its SKUs and user reviews. There is no QueryParser support, so you need to programmatically create the parent and child queries, separating according to parent and child fields. The join can currently only go in one direction (mapping child docIDs to parent docIDs), but in some cases you need to map parent docIDs to child docIDs. For example, when searching songs, perhaps you want all matching songs sorted by their title. You can't easily do this today because the only way to get song hits is to group by album or band/artist. The join is a one (parent) to many (children), inner join. As usual, patches are welcome! There is work underway to create a more flexible, but likely less performant, query-time join capability, which should address a number of the above limitations. Source: http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
January 9, 2012
· 12,252 Views
article thumbnail
Lucene's near-real-time search is fast!
Lucene's near-real-time (NRT) search feature, available since 2.9, enables an application to make index changes visible to a new searcher with fast turnaround time. In some cases, such as modern social/news sites (e.g., LinkedIn, Twitter, Facebook, Stack Overflow, Hacker News, DZone, etc.), fast turnaround time is a hard requirement. Fortunately, it's trivial to use. Just open your initial NRT reader, like this: // w is your IndexWriter IndexReader r = IndexReader.open(w, true); (That's the 3.1+ API; prior to that use w.getReader() instead). The returned reader behaves just like one opened with IndexReader.open: it exposes the point-in-time snapshot of the index as of when it was opened. Wrap it in an IndexSearcher and search away! Once you've made changes to the index, call r.reopen() and you'll get another NRT reader; just be sure to close the old one. What's special about the NRT reader is that it searches uncommitted changes from IndexWriter, enabling your application to decouple fast turnaround time from index durability on crash (i.e., how often commit is called), something not previously possible. Under the hood, when an NRT reader is opened, Lucene flushes indexed documents as a new segment, applies any buffered deletions to in-memory bit-sets, and then opens a new reader showing the changes. The reopen time is in proportion to how many changes you made since last reopening that reader. Lucene's approach is a nice compromise between immediate consistency, where changes are visible after each index change, and eventual consistency, where changes are visible "later" but you don't usually know exactly when. With NRT, your application has controlled consistency: you decide exactly when changes must become visible. Recently there have been some good improvements related to NRT: New default merge policy, TieredMergePolicy, which is able to select more efficient non-contiguous merges, and favors segments with more deletions. NRTCachingDirectory takes load off the IO system by caching small segments in RAM (LUCENE-3092). When you open an NRT reader you can now optionally specify that deletions do not need to be applied, making reopen faster for those cases that can tolerate temporarily seeing deleted documents returned, or have some other means of filtering them out (LUCENE-2900). Segments that are 100% deleted are now dropped instead of inefficiently merged (LUCENE-2010). How fast is NRT search? I created a simple performance test to answer this. I first built a starting index by indexing all of Wikipedia's content (25 GB plain text), broken into 1 KB sized documents. Using this index, the test then reindexes all the documents again, this time at a fixed rate of 1 MB/second plain text. This is a very fast rate compared to the typical NRT application; for example, it's almost twice as fast as Twitter's recent peak during this year's superbowl (4,064 tweets/second), assuming every tweet is 140 bytes, and assuming Twitter indexed all tweets on a single shard. The test uses updateDocument, replacing documents by randomly selected ID, so that Lucene is forced to apply deletes across all segments. In addition, 8 search threads run a fixed TermQuery at the same time. Finally, the NRT reader is reopened once per second. I ran the test on modern hardware, a 24 core machine (dual x5680 Xeon CPUs) with an OCZ Vertex 3 240 GB SSD, using Oracle's 64 bit Java 1.6.0_21 and Linux Fedora 13. I gave Java a 2 GB max heap, and used MMapDirectory. The test ran for 6 hours 25 minutes, since that's how long it takes to re-index all of Wikipedia at a limited rate of 1 MB/sec; here's the resulting QPS and NRT reopen delay (milliseconds) over that time: The search QPS is green and the time to reopen each reader (NRT reopen delay in milliseconds) is blue; the graph is an interactive Dygraph, so if you click through above, you can then zoom in to any interesting region by clicking and dragging. You can also apply smoothing by entering the size of the window into the text box in the bottom left part of the graph. Search QPS dropped substantially with time. While annoying, this is expected, because of how deletions work in Lucene: documents are merely marked as deleted and thus are still visited but then filtered out, during searching. They are only truly deleted when the segments are merged. TermQuery is a worst-case query; harder queries, such as BooleanQuery, should see less slowdown from deleted, but not reclaimed, documents. Since the starting index had no deletions, and then picked up deletions over time, the QPS dropped. It looks like TieredMergePolicy should perhaps be even more aggressive in targeting segments with deletions; however, finally around 5:40 a very large merge (reclaiming many deletions) was kicked off. Once it finished the QPS recovered somewhat. Note that a real NRT application with deletions would see a more stable QPS since the index in "steady state" would always have some number of deletions in it; starting from a fresh index with no deletions is not typical. Reopen delay during merging The reopen delay is mostly around 55-60 milliseconds (mean is 57.0), which is very fast (i.e., only 5.7% "duty cycle" of the every 1.0 second reopen rate). There are random single spikes, which is caused by Java running a full GC cycle. However, large merges can slow down the reopen delay (once around 1:14, again at 3:34, and then the very large merge starting at 5:40). Many small merges (up to a few 100s of MB) were done but don't seem to impact reopen delay. Large merges have been a challenge in Lucene for some time, also causing trouble for ongoing searching. I'm not yet sure why large merges so adversely impact reopen time; there are several possibilities. It could be simple IO contention: a merge keeps the IO system very busy reading and writing many bytes, thus interfering with any IO required during reopen. However, if that were the case, NRTCachingDirectory (used by the test) should have prevented it, but didn't. It's also possible that the OS is [poorly] choosing to evict important process pages, such as the terms index, in favor of IO caching, causing the term lookups required when applying deletes to hit page faults; however, this also shouldn't be happening in my test since I've set Linux's swappiness to 0. Yet another possibility is Linux's write cache becomes temporarily too full, thus stalling all IO in the process until it clears; in this case perhaps tuning some of Linux's pdflush tunables could help, although I'd much rather find a Lucene-only solution so this problem can be fixed without users having to tweak such advanced OS tunables, even swappiness. Fortunately, we have an active Google Summer of Code student, Varun Thacker, working on enabling Directory implementations to pass appropriate flags to the OS when opening files for merging (LUCENE-2793 and LUCENE-2795). From past testing I know that passing O_DIRECT can prevent merges from evicting hot pages, so it's possible this will fix our slow reopen time as well since it bypasses the write cache. Finally, it's always possible other OSs do a better job managing the buffer cache, and wouldn't see such reopen delays during large merges. This issue is still a mystery, as there are many possibilities, but we'll eventually get to the bottom of it. It could be we should simply add our own IO throttling, so we can control net MB/sec read and written by merging activity. This would make a nice addition to Lucene! Except for the slowdown during merging, the performance of NRT is impressive. Most applications will have a required indexing rate far below 1 MB/sec per shard, and for most applications reopening once per second is fast enough. While there are exciting ideas to bring true real-time search to Lucene, by directly searching IndexWriter's RAM buffer as Michael Busch has implemented at Twitter with some cool custom extensions to Lucene, I doubt even the most demanding social apps actually truly need better performance than we see today with NRT. NIOFSDirectory vs MMapDirectory Out of curiosity, I ran the exact same test as above, but this time with NIOFSDirectory instead of MMapDirectory: There are some interesting differences. The search QPS is substantially slower -- starting at 107 QPS vs 151, though part of this could easily be from getting different compilation out of hotspot. For some reason TermQuery, in particular, has high variance from one JVM instance to another. The mean reopen time is slower: 67.7 milliseconds vs 57.0, and the reopen time seems more affected by the number of segments in the index (this is the saw-tooth pattern in the graph, matching when minor merges occur). The takeaway message seems clear: on Linux, use MMapDirectory not NIOFSDirectory! Optimizing your NRT turnaround time My test was just one datapoint, at a fixed fast reopen period (once per second) and at a high indexing rate (1 MB/sec plain text). You should test specifically for your use-case what reopen rate works best. Generally, the more frequently you reopen the faster the turnaround time will be, since fewer changes need to be applied; however, frequent reopening will reduce the maximum indexing rate. Most apps have relatively low required indexing rates compared to what Lucene can handle and can thus pick a reopen rate to suit the application's turnaround time requirements. There are also some simple steps you can take to reduce the turnaround time: Store the index on a fast IO system, ideally a modern SSD. Install a merged segment warmer (see IndexWriter.setMergedSegmentWarmer). This warmer is invoked by IndexWriter to warm up a newly merged segment without blocking the reopen of a new NRT reader. If your application uses Lucene's FieldCache or has its own caches, this is important as otherwise that warming cost will be spent on the first query to hit the new reader. Use only as many indexing threads as needed to achieve your required indexing rate; often 1 thread suffices. The fewer threads used for indexing, the faster the flushing, and the less merging (on trunk). If you are using Lucene's trunk, and your changes include deleting or updating prior documents, then use the Pulsing codec for your id field since this gives faster lookup performance which will make your reopen faster. Use the new NRTCachingDirectory, which buffers small segments in RAM to take load off the IO system (LUCENE-3092). Pass false for applyDeletes when opening an NRT reader, if your application can tolerate seeing deleted doccs from the returned reader. While it's not clear that thread priorities actually work correctly (see this Google Tech Talk), you should still set your thread priorities properly: the thread reopening your readers should be highest; next should be your indexing threads; and finally lowest should be all searching threads. If the machine becomes saturated, ideally only the search threads should take the hit. Happy near-real-time searching!
July 11, 2011
· 18,340 Views
article thumbnail
Lucene's indexing is fast!
Wikipedia periodically exports all of the content on their site, providing a nice corpus for performance testing. I downloaded their most recent English XML export: it uncompresses to a healthy 21 GB of plain text! Then I fully indexed this with Lucene's current trunk (to be 4.0): it took 13 minutes and 9 seconds, or 95.8 GB/hour -- not bad! Here are the details: I first pre-process the XML file into a single-line file, whereby each doc's title, date, and body are written to a single line, and then index from this file, so that I measure "pure" indexing cost. Note that a real app would likely have a higher document creation cost here, perhaps having to pull documents from a remote database or from separate files, run filters to extract text from PDFs or MS Office docs, etc. I use Lucene's contrib/benchmark package to do the indexing; here's the alg I used: analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer content.source = org.apache.lucene.benchmark.byTask.feeds.LineDocSource docs.file = /lucene/enwiki-20100904-pages-articles.txt doc.stored = true doc.term.vector = false doc.tokenized = false doc.body.stored = false doc.body.tokenized = true log.step.AddDoc=10000 directory=FSDirectory compound=false ram.flush.mb = 256 work.dir=/lucene/indices/enwiki content.source.forever = false CreateIndex { "BuildIndex" [ { "AddDocs" AddDoc > : * ] : 6 - CloseIndex } RepSumByPrefRound BuildIndex There is no field truncation taking place, since this is now disabled by default -- every token in every Wikipedia article is being indexed. I tokenize the body field, and don't store it, and don't tokenize the title and date fields, but do store them. I use StandardAnalyzer, and I include the time to close the index, which means IndexWriter waits for any running background merges to complete. The index only has 4 fields -- title, date, body, and docid. I've done a few things to speed up the indexing: Increase IndexWriter's RAM buffer from the default 16 MB to 256 MB Run with 6 threads Disable compound file format Reuse document/field instances (contrib/benchmark does this by default) Lucene's wiki describes additional steps you can take to speed up indexing. Both the source lines file and index are on an Intel X25-M SSD, and I'm running it on a modern machine, with dual Xeon X5680s, overclocked to 4.0 Ghz, with 12 GB RAM, running Fedora Linux. Java is 64bit 1.6.0_21-b06, and I run as java -server -Xmx2g -Xms2g. I could certainly give it more RAM, but it's not really needed. The resulting index is 6.9 GB. Out of curiosity, I made a small change to contrib/benchmark, to print the ingest rate over time. It looks like this (over a 100-second window): Note that a large part (slightly over half!) of the time, the ingest rate is 0; this is not good! This happens because the flushing process, which writes a new segment when the RAM buffer is full, is single-threaded, and, blocks all indexing while it's running. This is a known issue, and is actively being addressed under LUCENE-2324. Flushing is CPU intensive -- the decode and reencode of the great many vInts is costly. Computers usually have big write caches these days, so the IO shouldn't be a bottleneck. With LUCENE-2324, each indexing thread state will flush its own segment, privately, which will allow us to make full use of CPU concurrency, IO concurrency as well as concurrency across CPUs and the IO system. Once this is fixed, Lucene should be able to make full use of the hardware, ie fully saturate either concurrent CPU or concurrent IO such that whichever is the bottleneck in your context gates your ingest rate. Then maybe we can double this already fast ingest rate!
May 15, 2011
· 14,888 Views
article thumbnail
Just say no to swapping!
Imagine you love to cook; it's an intense hobby of yours. Over time, you've accumulated many fun spices, but your pantry is too small, so, you rent an off-site storage facility, and move the less frequently used spice racks there. Problem solved! Suddenly you decide to cook this great new recipe. You head to the pantry to retrieve your Saffron, but it's not there! It was moved out to the storage facility and must now be retrieved (this is a hard page fault). No problem -- your neighbor volunteers to go fetch it for you. Unfortunately, the facility is ~2,900 miles away, all the way across the US, so it takes your friend 6 days to retrieve it! This assumes you normally take 7 seconds to retrieve a spice from the pantry; that your data was in main memory (~100 nanoseconds access time), not in the CPU's caches (which'd be maybe 10 nanoseconds); that your swap file is on a fast (say, WD Raptor) spinning-magnets hard drive with 5 millisecond average access time; and that your neighbor drives non-stop at 60 mph to the facility and back. Even worse, your neighbor drives a motorcycle, and so he can only retrieve one spice rack at a time. So, after waiting 6 days for the Saffron to come back, when you next go to the pantry to get some Paprika, it's also "swapped out" and you must wait another 6 days! It's possible that first spice rack also happened to have the Paprika but it's also likely it did not; that depends on your spice locality. Also, with each trip, your neighbor must pick a spice rack to move out to the facility, so that the returned spice rack has a place to go (it is a "swap", after all), so the Paprika could have just been swapped out! Sadly, it might easily be many weeks until you succeed in cooking your dish. Maybe in the olden days, when memory itself was a core of little magnets, swapping cost wasn't so extreme, but these days, as memory access time has improved drastically while hard drive access time hasn't budged, the disparity is now unacceptable. Swapping has become a badly leaking abstraction. When a typical process (say, your e-mail reader) has to "swap back in" after not being used for a while, it can hit 100s of such page faults, before finishing redrawing its window. It's an awful experience, though it has the fun side effect of letting you see, in slow motion, just what precise steps your email reader goes through when redrawing its window. Swapping is especially disastrous with JVM processes. See, the JVM generally won't do a full GC cycle until it has run out of its allowed heap, so most of your heap is likely occupied by not-yet-collected garbage. Since these pages aren't being touched (because they are garbage and thus unreferenced), the OS happily swaps them out. When GC finally runs, you have a ridiculous swap storm, pulling in all these pages only to then discover that they are in fact filled with garbage and should be discarded; this can easily make your GC cycle take many minutes! It'd be better if the JVM could work more closely with the OS so that GC would somehow run on-demand whenever the OS wants to start swapping so that, at least, we never swap out garbage. Until then, make sure you don't set your JVM's heap size too large! Just use an SSD... These days, many machines ship with solid state disks, which are an astounding (though still costly) improvement over spinning magnets; once you've used an SSD you can never go back; it's just one of life's many one-way doors. You might be tempted to declare that this problem is solved, since SSDs are so blazingly fast, right? Indeed, they are orders of magnitudes faster than spinning magnets, but they are still 2-3 orders of magnitude slower than main memory or CPU cache. The typical SSD might have 50 microsends access time, which equates to ~58 total miles of driving at 60 mph. Certainly a huge improvement, but still unacceptable if you want to cook your dish on time! Just add RAM... Another common workaround is to put lots of RAM in your machine, but this can easily back-fire: operating systems will happily swap out memory pages in favor of caching IO pages, so if you have any processes accessing lots of bytes (say, mencoder encoding a 50 GB bluray movie, maybe a virus checker or backup program, or even Lucene searching against a large index or doing a large merge), the OS will swap your pages out. This then means that the more RAM you have, the more swapping you get, and the problem only gets worse! Fortunately, some OS's let you control this behavior: on Linux, you can tune swappiness down to 0 (most Linux distros default this to a highish number); Windows also has a checkbox, under My Computer -> Properties -> Advanced -> Performance Settings -> Advanced -> Memory Usage, that lets you favor Programs or System Cache, that's likely doing something similar. There are low-level IO flags that these programs are supposed to use so that the OS knows not to cache the pages they access, but sometimes the processes fail to use them or cannot use them (for example, they are not yet exposed to Java), and even if they do, sometimes the OS ignores them! When swapping is OK If your computer never runs any interactive processes, ie, a process where a human is blocked (waiting) on the other end for something to happen, and only runs batch processes which tend to be active at different times, then swapping can be an overall win since it allows that process which is active to make nearly-full use of the available RAM. Net/net, over time, this will give greater overall throughput for the batch processes on the machine. But, remember that the server running your web-site is an interactive process; if your server processes (web/app server, database, search server, etc.) are stuck swapping, your site has for all intents and purposes become unusable to your users. This is a fixable problem Most processes have known data structures that consume substantial RAM, and in many cases these processes could easily discard and later regenerate their data structures in much less time than even a single page fault. Caches can simply be pruned or discarded since they will self-regenerate over time. These data structures should never be swapped out, since regeneration is far cheaper. Somehow the OS should ask each RAM-intensive and least-recently-accessed process to discard its data structures to free up RAM, instead of swapping out the pages occupied by the data structure. Of course, this would require a tighter interaction between the OS and processes than exists today; Java's SoftReference is close, except this only works within a single JVM, and does not interact with the OS. What can you do? Until this problem is solved for real, the simplest workaround is to disable swapping entirely, and stuff as much RAM as you can into the machine. RAM is cheap, memory modules are dense, and modern motherboards accept many modules. This is what I do. Of course, with this approach, when you run out of RAM stuff will start failing. If the software is well written, it'll fail gracefully: your browser will tell you it cannot open a new window or visit a new page. If it's poorly written it will simply crash, thus quickly freeing up RAM and hopefully not losing any data or corrupting any files in the process. Linux takes the simple draconian approach of picking a memory hogging process and SIGKILL'ing it. If you don't want to disable swapping you should at least tell the OS not to swap pages out for IO caching. Just say no to swapping!
April 27, 2011
· 25,417 Views

User has been successfully modified

Failed to modify user

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: