Acquire Content and Feed it to Search Technologies with ManifoldCF
is an interesting project currently warming up in Apache's incubation
stage, and it was the subject of a presentation at ApacheCon NA 2011.
Don't know what ManifoldCF is? Think of search technologies like Solr. Now imagine you're working at the enterprise level and have multiple content repositories, all containing a ton of data, and you need to feed the data from those repositories into the search technology.
Enter ManifoldCF. ManifoldCF is a connector. It'll integrate with your content repositories, your search engine indexes, and even your authentication provider so that users only see results for documents that they have access to. With its plug-in style architecture it offers functionality for numerous commercial and open source data sources (e.g. Documentum and SharePoint). If your search technology or content repositories aren't currently supported, you can design your own custom connectors. For more on this and pretty much everything else about ManifoldCF, you might want to check out ManifoldCF in Action. If you're only interested in the quick and dirty, chapter one is available for free.
But I digress. Where were we? Ah, right, ManifoldCF at ApacheCon. Karl Wright, the project's founder, one of its principal committers and the author of the aforementioned ManifoldCF in Action, was on hand to explain how ManifoldCF connects source content repositories to target repositories:
I'll introduce ManifoldCF, and describe the general enterprise content acquisition and indexing problem which led to its development. I will discuss accessing multiple repositories, enforcing repository security, and incrementally keeping indexes up to date. I'll give an overview of its architecture, and demonstate simple crawls and a secure integration with Apache Solr.
If by now you've worked up an appetite for more, why not listen to the entire presentation?