Before the start of the Lucene Revolution conference on May 23rd, an event dedicated to open source search, DZone had the opportunity to speak with Alberto Mijares, a software engineer at Canoo with more than 10 years of experience. He is a Scrum Master and an agile practitioner. He has a significant background in Web technologies and Java, having participated in the past in W3C activities related with Semantic Web.
In his talk at Lucene Revolution he will explain how Canoo built an SaaS application that is capable of remotely extracting the content of multiple online newspaper articles, analyzing them, and classifying them to determine which articles are the most similar to a single particular article. The application also integrates this information back into the article to provide the user with a "related articles" feature. Mijares will highlight how Apache Solr was used in this context.
Here were our questions for Alberto:
DZone: Tell me a bit about the type of SaaS application you will talk about at Lucene Revolution 2011.
Alberto Mijares: To describe the application, it is basically Lucene's 'More Like This' but on Steroids. The application regularly crawls the articles of different online newspapers from a media group and offers a REST service that, given an article, suggests the most similar ones. The secret of the superior result quality resides in the language tools that Canoo has developed over the years (WMTrans). With them, we basically create Lucene analyzers and enrich the indexed articles with semantically relevant terms. By using categories, the newspaper can select which articles should be offered to the user. This enables a very important feature: "cross-selling". By using it, it is possible to offer to the readers of a news site related content from the different newspapers of the media group. We were pleased with the outcome of this project and included this functionality as an additional feature in our FindIT Search & Analysis Suite
DZone: You described a feature that other products already offer and I also what you meant by "superior" quality. What makes this application successful and how you compare your quality to other solutions?
Alberto: You are right when you say that the concept is nothing new and, as I mentioned, it even exists within Lucene. However, it is important to note the following: It is Software as a Service, so the customer can test it at almost no cost. Normally we do a preliminary short phase where we basically decide, together with the customer, which contents to index and under various categories. This takes almost no time and requires few resources. The next step is supporting the customer to integrate provided scripts in their test environment and then, let them test the quality of the results.
When the customer is convinced, the same scripts are integrated in the production environment and they start paying the service fees. About the quality: the result that you get with Lucene's "More Like This" is by default not of a high relevance and its quality completely depends on the language. We support different languages within our tools and amongst them we provide the best analyzer tools for the German language. Therefore, the open source options cannot compete in quality with the languages we support and commercial products with similar quality imply owning an infrastructure and paying huge license fees. When such scenarios are analyzed, only the biggest companies can afford such an investment or dare to start with the deployment of something that they could not try first.
DZone: What made Solr the best fit for this application?
Alberto: When we started with the project it is was already clear that WMTrans (Canoo's language tools) should be integrated with a search engine library. In the Java world, it is well known that Apache Lucene is the best open source solution, when not simply the best solution. The rest of the requirements were a Lucene-based language analysis pipeline, a web architecture with good scalability and a container to schedule batch jobs for the crawling and extraction process. As an operative requirement, having the possibility to get support for the selected products was also quite important. We studied at least three possibilities but after trying Apache Solr, it was clear. Solr was exactly what we needed.
DZone: You have a long history with semantic web standards. Tell me a bit about that history and about how people can harness Solr and various semantic knowledge bases for powerful search applications.
Alberto: I have been involved in different projects and initiatives related to the Semantic Web and Semantics in general, and the topic is quite controvertial. The expectations are really high in this area and everybody expects to get a "perfect" search engine in a short period of time. It reminds me what happened some years ago with Artificial Intelligence. Because we didn't build a robot that could think like a human in a short period of time, it seemed that the research was a complete failure and nowadays nobody wants to think about what we achieved and what we are still using from this research.
Semantics is a really a complex topic and, above all, it is subjective. What "makes sense" for me can be completely wrong for another person (different knowledge, different experience, different context). What most people don't know is that Semantic Web technologies perform very well when applied in the field of data integration. The number of databases that are being made public and, most importantly, integrated under the umbrella of the "Linking Open Data" initiative is growing exponentially.
The immediate application in the information retrieval and search fields is the possibility to enrich the search indexes with this semantically structured and curated information. Two things should still be decided here: how to set a bridge between text-based search and semantic search (fuzzy or analog vs. exact and reasoned) and how many resources to invest in inference (which nowadays is still a difficult implementation problem because of its unmanageable complexity). One simple example that comes recurrently in the research papers is using the categories of a knowledge base like Wikipedia to classify textual content.
DZone: Do you have a "wishlist" for new features in future versions of Solr and Lucene?
Alberto: One of the features I would like to have in Solr is a document-level security layer. Indexing information allows people to find it, but it is critical to control who finds it. Other things I can imagine are either not so relevant or already on the pipeline.
DZone: How did you profit from using Solr as platform for the project?
Alberto: Having a well documented platform whose development is lead by experts, allowed us to learn the project in record time and drastically reduced the resources we needed for development. Deploying the application was easy and we quickly understood the high flexibility of Solr's architecture. After doing some workload tests, we got peace of mind and could concentrate on the project features. This positive feedback increased the confidence in the delivery dates. This is a win-win situation that enables us to focus on the business model and the improvement of the product.
Check the Lucene Revolution site for conference agenda, pricing, some early bird discounts and training specials may still be available. Don't miss this once a year opportunity!