Over a million developers have joined DZone.

Apache Lucene and Solr 3.6 Release! New Language Analysis, Joins, and Finite-State APIs

DZone's Guide to

Apache Lucene and Solr 3.6 Release! New Language Analysis, Joins, and Finite-State APIs

· Java Zone ·
Free Resource

Get the Edge with a Professional Java IDE. 30-day free trial.

Lucene / Solr 3.6 has been released and is available for download.

As release manager, here’s my take on the new features:

  • Language analysis:
    • Newly added morphological analysis and part-of-speech tagger for the Japanese Language, geared for search, contributed by Christian Moen.
    • CJK Analysis improvements inspired by the folks at Hathitrust, who are indexing terabytes of text across hundreds of languages with Apache Solr. I encourage you to investigate their blog if you are interested in reading about large-scale search challenges.
    • Lucene/Solr analysis for many languages was tuned and simplified. For instance, to get started with the new Japanese capabilities described above, simply use the text_ja language defined in Solr’s example schema.xml. We configured this for 30 languages out-of-box.
  • Joins:
    • Ability to do index-time block-joins in the opposite direction, useful when you have indexed a parent-child relationship already but sometimes want ungrouped child documents as the result.
    • Addition of query-time joins: an alternative when index-time joins are not feasible.
    • Important bugfixes to index-time joins.
  • Auto-suggest and finite-state APIs:
    • New Weighted FST suggester that offers more fine-grained ranking for suggestions.
    • FST APIs were extended to support reverse-lookups for monotonically increasing outputs, and support n-shortest-path algorithms by weight.
    • Improved suggester API that exploits our incremental automata construction to build suggester FSTs from huge amounts of data.
    • FST compression support, based on research by Lucene/Solr committer Dawid Weiss.
    • Additions to Apache Solr for easier integration of phrase-based auto-suggest, e.g. for previous phrases recorded from query logs.
  • Miscellaneous:
    • A new index pruning module with configurable policies supports faster and smaller indexes that give similar relevance to a complete index.
    • Added phonetic analysis module, for accomplishing sounds-like search: different algorithms and languages are supported from Apache’s commons-codec project.
    • Performance improvements for index splitter tools.
  • Solr improvements:
    • Better defaults and configuration for multi-term queries. Queries such as wildcard queries have better interaction with the analysis chain, especially regarding case- or accent- insensitivity.
    • Distributed date and number range-faceting support.
    • Improved concurrency control for distributed search.
    • SolrJ support for latest HttpComponents release.
    • Clustering improvements: new support for clustering multilingual search results and for clustering on multiple fields.
    • Upgraded Tika integration to 1.0, with improved RTF, Word, and PDF parsing support.
  • Highlighting improvements:
    • A new HTMLStripCharFilter implementation, faster and reliable for matching result snippets to the underlying raw html.
    • Performance improvements for FastVectorHighlighter.
    • Bugfixes to many analysis components that would cause corner-case highlighting bugs.

If you want to hear more about these features, many of the committers who worked on them will be giving talks at Lucene Revolution in Boston, including:

  • Mark Miller will explain the SolrCloud architecture for distributed indexing.
  • Grant Ingersoll will tie together Solr, Hadoop, and Mahout.
  • Martijn van Groningen will be giving a talk about grouping and join features.
  • Erick Erickson will talk about SolrCloud from the user perspective.
  • Christian Moen will be giving a talk introducing Lucene/Solr’s new Japanese language capabilities.
  • Andrzej Bialecki will share adventures into Lucene 4.0′s codec APIs: including updateable fields.
  • Simon Willnauer will discuss some of the challenges of implementing high-performance search in Java.
  • Uwe Schindler will be talking about refactoring of the upcoming Lucene 4.0 IndexReader API.
  • Mike McCandless and I will discuss current and future improvements related to finite-state technology.
  • Chris Hostetter will play the chump, please try to stump him with your questions!

Hope to see you there!

Get the Java IDE that understands code & makes developing enjoyable. Level up your code with IntelliJ IDEA. Download the free trial.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}