How can you efficiently index content without polluting or tightly coupling your business logic to a search engine API?
Crawling is one option, but it isn't necessarily that efficient nor do you have fine grained control over the fields that are stored in the search index documents. For one customer project we built an elegant asynchronous event driven indexing mechanism.
Event driven architecture
Event-Driven Architecture is an architectural pattern to promote loose coupling between applications. It is an applied form of the classic 'Gang of Four' Observer pattern. In essence an application emits events about things that have happened. Other applications that are interested in these events can register as an event listener.
Asynchronous messaging, such as JMS publish-subscribe Topics, can be used to implement a scalable event-driven architecture.
This can be illustrated using the de-facto Enterprise Integration Patterns language (ref: Enterprise Integration Patterns, Gregor Hohpe & Bobby Woolf, Addison Wesley, ISBN 0321200683) and the Hohpe EIP Visio stencil.
Loose coupling is promoted because the event emitting application is unaware of the event subscribers. In practice, the event message still provides a form of contract, so therefore should be stable or retain backwards compatibility to avoid breaking downstream systems.
Complex Event Processing
Another trend in event-driven architecture is to apply complex event processing (CEP) to the event streams. One example of this in Financial Service is to apply correlation across different events to identify potentially fraudulent trades.
Solr in an Event driven architecture
Solr is a search solution based on Lucene. It provides an indexing & search service along with powerful faceted search.
The combination of Solr, the Spring Framework and JMS was successfully used on the Virgin Money Giving project (medallist in the BCS 2010 Computing awards) to provide event-driven index updates.
This was achieved by leveraging Spring application events with an internal application listener to transmit the message via JMS. Business logic methods would emit events at key stages (e.g. New Fundraiser Registration), the internal application listener would receive all application events and filter out those that it wasn't interested in (by class); the events of interest would be converted to a serializable form and published as a JMS Message.
The JMS Message representing the business event is delivered to a Message Driven POJO on the master indexer node that is responsible for creating, updating or deleting the Lucene index document via the SolrJ client API. The indexing happens on each data centre site using clustered JMS.
The index updates are then replicated from the master indexer to the search nodes using Solr Snap-Pull replication.