Boris Aleksandrovsky works for Yammer, the Enterprise Social Network company, where they are trying to bring benefits of social media to enterprises by creating discoverable knowledge bases. He specializes in solving problems of search, machine learning and data analysis on large scale by employing distributed and scalable software architectures. Boris has almost completed his PhD in Computer Science and Neuroscience at University of California at Irvine.
Boris will be speaking this month at the Lucene Revolution conference in San Francisco. If you want to go see his talk and a host of other highly technical talks around Lucene, Solr, and Enterprise Search, there are still deals available.
DZone: What will be the topics of your Lucene Revolution 2011 talk?
Boris Aleksandrovsky: First and foremost this talk is about architecture and scalability. I will also dedicate time to explaining why we built a system like we did for Yammer. I'll explain what were the performance bottlenecks, operational characteristics and lessons learned. I will also focus on measurability and testability aspects of our system - how do we know when something goes wrong and how can we possibly prevent that from happening. This talk is about how you can build a scalable real time search system with the business requirements and constraints we had in the shortest possible time.
DZone: How important is search for Yammer's product? What are the use cases for search at Yammer?
Boris: From the perspective of search, people use Yammer today in two modes. First, they want to simply capture the information which might have scrolled out of view in their Yammer feed. This is very similar to Twitter - I check it once in a while, but what have I missed since the last time? For this use-case we want to present search results in reverse chronological order and answer simple queries. The second mode is the knowledge exploration mode. Yammer is a knowledge base created by interactions between colleagues over time within a company. Yammer can help with the on-boarding process, faq's, tips, computer setup, company procedures and processes, practices and culture. For this, search is an entry point and quite possibly the most important interaction element. We need to answer complicated queries and present results based on textual similarity, popularity, engagement and social distance.
DZone: What technologies are used in Yammer's search architecture?
Boris: Yammer's search system is build on top of Lucene and Zoie real time search (an open source project provided by LinkedIn). We added the transactional and distribution layer on top of Zoie which allows us to index events in transactional fashion and distribute indexes on multiple hosts for scalability and high availability. We use distributed queueing systems to push events from the application server, which handles Yammer business logic, to the search index. Since those systems do not guarantee in-order delivery, we have built a somewhat complicated system for conflict resolution centered on out-of-order deliveries. We have also given a lot of thought to ease of maintenance; it is possible to extend Yammer index schema to new types of objects with relative ease. This will help us down the line when we build more complicated analytics on top of that index.
DZone: What are some of the challenges in the search architecture of Yammer? Are there similarities between Yammer and other social networking architectures like Twitter?
Boris: The biggest challenges for search at Yammer is the real time nature of the information and the complicated relevancy story.
Information on Yammer should be indexed and available for users to search in real time, virtually in less then a second. This makes the Yammer indexing system similar to Twitter where tweets are indexed in real time. Search results likewise are available in reverse chronological order which is based on the assumption that for certain types of events, timeliness is the most pertinent characteristic. This maps really well into types of content like news where relevancy declines fairly rapidly as time passes, or for types of content which are more transient in nature, like events and meetings.
There are other types of content where the relationship between the creator of the content and the searcher is important, and also the sheer popularity of the content is important. This is more of a Facebook newsfeed case, which tries to present content from people you value or interact with most. A good example will be communications from your boss, or an expert opinion you trust. Popular discussion threads which capture the attention of the company are important to find since they usually encompass the "company culture".
There are however other types of content that are much more knowledge heavy and with the retrieval of each textual similarity, reputation and potential for engagement are more important then timeliness. For instance when the sales representative is searching for a relevant approach to a particular client industry, then he would be interested in the experiences of all other sales people who tried to sell to that industry, and he would want to look back as far as the records go. This is a case where Yammer's search system is trying to act more like Google search system.
DZone: What are some of the things you are working on in terms of data analysis and machine learning?
Boris: I was always interested in clustering and collocation analysis. Clustering is an identification of regularities in data in an unsupervised way, e.g. without labeling the cluster with a specific name. This is opposed to the process of classification where the name of the class is known a priori, e.g spam email or political news. There is an interesting class of algorithms known as topic analyses which allow the clustering of information according to topics of interest. I am also exploring the possibilities of identifying a relationship between entities based on their mutual collocation in the same content or topic. For instance Yahoo,Google, Bing and entity X might be mentioned a lot together, which might indicate some kind of relationship between them. It might be useful to present this relationship to the user for purposes of research and data exploration.
DZone: You've nearly completed your PhD in Comp. Sci and Neuroscience at Irvine. How does Neuroscience influence your work in programming?
Boris: I think there are two main skills one learns in graduate school: how to perform the experiments and which methodologies to use in research. These skills are invaluable in computer science, especially in the field of information retrieval. Skills which one obtains working on interdisciplinary research - how to research, integrate and relate multiple disciplines, come in handy when navigating the world of open source systems where there are seemingly new approaches or systems being offered to the world every week. When working on the experiments in the "wet lab" it is very important to understand what to measure and statistical significance of results once the measurement is obtained. The same is true in IR, where one is trying to understand the influence of one of many (dozens or hundreds) signals for relevance ranking or trying to understand the performance characteristics of distributed systems under heavy load.
And lastly, knowing a little about the brain and its amazing capacities which so far have not even been approximated by the computer with any degree of closeness, one becomes humble and tolerant to machine (and programming) failure, and one is inspired to create more intelligent systems.
Check the Lucene Revolution site for conference agenda, pricing, some early bird discounts and training specials may still be available. Don't miss this once a year opportunity!