This article is a continuation from Part 1.
Infrastructure: A 10k-Foot View
Considering the relevant search goals described earlier, the model of the knowledge graph designed for the specific use case, and the type and amount of data to be processed together with the related data flows and requirements in terms of textual search capabilities, we defined the following architecture described in high-level terms:
The data flow is composed of several data sources like product information sources, product offers, sellers, click streams, feedbacks, and more. All of these data items flow from outside to inside the data architecture using Apache Kafka as the queue and streaming platform for building real-time data pipelines. Information goes through a multi-step process where it is transformed before being stored in Neo4j, the main database of the infrastructure.
In order to avoid concurrency issues, only one microservice reads from the queue and writes to Neo4j. The raw sources are then enriched and processed and new relationships between objects are created. In this way, the knowledge graph is created and maintained.
Many machine learning tools and data mining algorithms- as well as Natural Language Processing operations- are applied to the graph and new relationships are inferred and stored. In order to process this huge amount of data, an Apache Spark cluster is seamlessly integrated into the architecture through the Neo4j-Spark connector.
At this point, data is transformed to several document types and sent to an Elasticsearch cluster where it is stored as documents. In Elasticsearch, these documents are analyzed and indexed for providing text search. The front-end interacts with Neo4j for providing advanced features that require graph queries that cannot be expressed using documents or simple text searches.
We will now describe in more detail the two core elements of the infrastructure.
The Neo4j Roles
Neo4j is the core of the architecture – it is the main database, the “single source of truth” of the product catalog since it stores the entire knowledge graph on which all the searches and navigations are performed. It is a viable tool in a relevant search ecosystem offering not only a suitable model for representing complex data (text, user models, business goals, and context information), but also for providing efficient ways of navigating this data in real time.
Moreover, at an early stage of the “search improvement process,” Neo4j can help relevance engineers to identify salient features describing the content, the user, or the search query. Later on, it will help to find a way to instruct the search engine about those features through extraction and enrichment.
Once the data is stored in Neo4j, it goes through a process of enrichment that comprises three main categories: cleansing, existing data augmentation, and data merging.
First, cleansing. It’s usually well worth the time to parse through documents, look for mistakes such as misspellings and document duplications, and correct them. Otherwise, users might not find a document because it contains a misspelling of the query term. Or they may find twenty duplicates of the same document, which would have the effect of pushing other relevant documents off the end of the search results page. Neo4j and the GraphAware NLP framework provide features that find duplicates or synonyms and relate them, as well as search for misspellings of words.
Second, the existing data is post-processed to augment the features already there. For instance, machine learning techniques can be used to classify or cluster documents. The possibilities are endless. After this new metadata is attached to the documents, it can serve as a valuable feature for users to search through.
Finally, new information is merged into the documents from external sources. In our eCommerce use case, the products being sold often come from external vendors. Product data provided by the vendors can be sparse: for instance, missing important fields such as the product title. In this case, additional information can be appended to documents. The existing product codes can be used to look up product titles, or missing descriptions can be written in by hand. The goal is to provide users with every possible opportunity to find the document they’re looking for, and that means more and richer search features.
Elasticsearch is for SEARCH
“Lucene, which Elasticsearch is built on, has a notion of transactions," says Alex Brasetvik in Elasticsearch as a NoSQL Database. "Elasticsearch, on the other hand, does not have transactions in the typical sense. There is no way to rollback a submitted document, and you cannot submit a group of documents and have either all or none of them indexed. What it does have, however, is a write-ahead-log to ensure the durability of operations without having to do an expensive Lucene-commit. You can also specify the consistency level of index-operations, in terms of how many replicas must acknowledge the operation before returning. […] Visibility of changes is controlled when an index is refreshed, which by default is once per second, and happens on a shard-by-shard-basis.”
Nonetheless, in our opinion, it is used as the main data store too often. This approach could lead to a lot of issues in terms of data consistency and ease of managing the data stored in the documents.
Elasticsearch can provide an efficient interface for accessing catalog information, providing not only advanced text search capability but also any sort of aggregation and faceting. Faceting is the capability of grouping search results for providing predefined filters and allowing users to simply refine the search. This is an example from Amazon:
It can also be useful for providing analytics as well as other capabilities like collapsing results. The latter, added in the latest version (5.3.x at the time of writing), allows you to group results and avoid having to show the same results when they only contain minor differences.
Those among many others are the reasons why in the proposed architecture Elasticsearch is used as the front-end search interface to provide product details as well as autocomplete and suggestion functionalities.
This post continues our series on advanced applications and features built on top of a graph database using Neo4j. The knowledge graph plays a fundamental role since it gathers and represents all the data needed in an organic and homogeneous form and allows access, navigation and extensibility in an easy and performant way.
It is the core of a complex data flow with multiple sources and feeds the Elasticsearch front-end. In this blog post, our presentation of a complete end-to-end architecture illustrates an advanced search infrastructure that powers highly relevant searches.
If you believe GraphAware NLP framework or our expertise with knowledge graphs would be useful for your project or organization, please drop us an email specifying “Knowledge Graph” in the subject and one of our GraphAware team members will get in touch.
D. Turnbull, J. Berryman –
Relevant Search, Manning.
A. L. Farris, G. S. Ingersoll, and T. S. Morton –
Taming Text, Manning.
Google Knowledge Graph –
L. Del Corro – Knowledge graphs: Encyclopaedias for machines –
E. Gabrilovich, N. Usunier –
Constructing and Mining Web-Scale Knowledge Graphs, SIGIR ’16 Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval.