Apache Lucene and Solr 3.6 Release! New Language Analysis, Joins, and Finite-State APIs
Lucene / Solr 3.6 has been released and is available for download.
As release manager, here’s my take on the new features:
- Language analysis:
- Newly added morphological analysis and part-of-speech tagger for the Japanese Language, geared for search, contributed by Christian Moen.
- CJK Analysis improvements inspired by the folks at Hathitrust, who are indexing terabytes of text across hundreds of languages with Apache Solr. I encourage you to investigate their blog if you are interested in reading about large-scale search challenges.
- Lucene/Solr analysis for many languages was tuned and simplified.
For instance, to get started with the new Japanese capabilities
described above, simply use the text_ja language defined in Solr’s example schema.xml. We configured this for 30 languages out-of-box.
- Ability to do index-time block-joins in the opposite direction, useful when you have indexed a parent-child relationship already but sometimes want ungrouped child documents as the result.
- Addition of query-time joins: an alternative when index-time joins are not feasible.
- Important bugfixes to index-time joins.
- Auto-suggest and finite-state APIs:
- New Weighted FST suggester that offers more fine-grained ranking for suggestions.
- FST APIs were extended to support reverse-lookups for monotonically increasing outputs, and support n-shortest-path algorithms by weight.
- Improved suggester API that exploits our incremental automata construction to build suggester FSTs from huge amounts of data.
- FST compression support, based on research by Lucene/Solr committer Dawid Weiss.
- Additions to Apache Solr for easier integration of phrase-based
auto-suggest, e.g. for previous phrases recorded from query logs.
- A new index pruning module with configurable policies supports faster and smaller indexes that give similar relevance to a complete index.
- Added phonetic analysis module, for accomplishing sounds-like search: different algorithms and languages are supported from Apache’s commons-codec project.
- Performance improvements for index splitter tools.
- Solr improvements:
- Better defaults and configuration for multi-term queries. Queries such as wildcard queries have better interaction with the analysis chain, especially regarding case- or accent- insensitivity.
- Distributed date and number range-faceting support.
- Improved concurrency control for distributed search.
- SolrJ support for latest HttpComponents release.
- Clustering improvements: new support for clustering multilingual search results and for clustering on multiple fields.
- Upgraded Tika integration to 1.0, with improved RTF, Word, and PDF parsing support.
- Highlighting improvements:
- A new HTMLStripCharFilter implementation, faster and reliable for matching result snippets to the underlying raw html.
- Performance improvements for FastVectorHighlighter.
- Bugfixes to many analysis components that would cause corner-case highlighting bugs.
If you want to hear more about these features, many of the committers who worked on them will be giving talks at Lucene Revolution in Boston, including:
- Mark Miller will explain the SolrCloud architecture for distributed indexing.
- Grant Ingersoll will tie together Solr, Hadoop, and Mahout.
- Martijn van Groningen will be giving a talk about grouping and join features.
- Erick Erickson will talk about SolrCloud from the user perspective.
- Christian Moen will be giving a talk introducing Lucene/Solr’s new Japanese language capabilities.
- Andrzej Bialecki will share adventures into Lucene 4.0′s codec APIs: including updateable fields.
- Simon Willnauer will discuss some of the challenges of implementing high-performance search in Java.
- Uwe Schindler will be talking about refactoring of the upcoming Lucene 4.0 IndexReader API.
- Mike McCandless and I will discuss current and future improvements related to finite-state technology.
- Chris Hostetter will play the chump, please try to stump him with your questions!
Hope to see you there!