Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Using NLP and Neo4j for a Social Media Recommendation Engine

DZone's Guide to

Using NLP and Neo4j for a Social Media Recommendation Engine

Combining NLP and graph databases to create recommendation engines is a powerful concept. Read on to find out how this can be accomplished.

· Database Zone
Free Resource

Learn how to create flexible schemas in a relational database using SQL for JSON.

Introduction

In recent years, the rapid growth of social media communities has created a vast amount of digital documents on the web. Recommending relevant documents to users is a strategic goal for the effectiveness of customer engagement but at the same time is not a trivial problem.

In a previous blog post, we introduced the GraphAware Natural Language Processing (NLP) plugin. It provides the basis to realize more complex applications that leverage text analysis and to offer enhanced functionalities to end users.

An interesting use case is combining content-based recommendations with a collaborative filtering approach to deliver high quality “suggestions.” This scenario fits well in all applications that combine user-generated content such as social media with any sort of reaction, like tagging, likes, and so on.

In this direction, we're starting from the ideas exposed in the paper "Social-Aware Document Similarity Computation for Recommender Systems" that we developed as part of the GraphAware Enterprise Reco plugin for Neo4j, a recommendation engine that uses a combination of similarities as a model to provide high-quality recommendations.

Document Modeling

In a social community, a document (which could be a post, tweet, blog, etc.) could be characterized by three elements:


  1. The document internal content and extracted tags.

  2. Tags that users associate with it.

  3. The readers’ interactions (i.e., view, comment, tag, like) with the document.

The internal content of the document is static over time. However, tags and users associated with the document are community-driven. They reflect the attitude of the community towards the document and can be changed over time.

With traditional information retrieval techniques, the internal contents of the document are indexed. The index is then used to help users search for documents of their interest.

These techniques are still popular in many information retrieval systems. However, using only the document may miss out certain meaning carried by tags and users. Recognizing the importance of tags as a supplement to internal content indexing, some systems use tags as document external metadata. This type of metadata is used to assist users with browsing or navigating in document databases.

GraphAware Enterprise Reco uses the combined approach of computing document similarity for building recommender systems. The idea is that the meaning of a document is derived not only from its content but also from its associated tags and user interactions.

“These three factors are viewed as three dimensions of a document in social space, named as Content, Tag, and User. Each dimension provides a different view of the document. In Content dimension, the meaning of the document is given by its author(s). However, in the Tag dimension, the meaning of the document is what it is perceived by the community. Each user may provide a different view of the document by tagging it. This view can be far different from the initial intention of the document’s author(s). In User dimension, the meaning of the document is exposed via its readers’ activities in the community.”

Moreover, while analyzing “static” content and social tags, ontology and semantics can be used to extract hierarchies in concepts. This extension allows the finding of relationships between tags and in this way, discovers the hidden relationship between apparently unrelated documents. So, for instance, if a document is tagged (automatically from content or by a user) with the tag violence while another is tagged with the tag war, at first analysis, they could appear unrelated, but after analyzing the semantic hierarchy of word violence (with ConceptNet 5 for instance), the system can reveal a relation between them.

The designed schema for the database will appear as follows:

Use natural language processing (NLP) and Neo4j to build a social media recommendation engine


This schema shows also how this complex model can be easily stored, and further extended, using graphs and Neo4j.

Similarity Computation

Using all the information stored, three different vectors will be created for each document:

Content- and ontology-based vector:

Ci = {wc(i,1), wc(i,2), …, wc(i,n)} where n is the total number of tags in the database, wc(i,k) is the weight of the kth tag in the document or in the hierarchy of the tag. wc(i,k) is computed using the following formula: α*tf-idf(i,k), where α is a weight associated with the hierarchy in the ontology; it is equal to 1 if the tag is in the document or if it is a synonym of a tag in the document; less than 1 in other cases.

Social Tag-based vector:

Ti = {wt(i,1), wt(i,2), …, wt(i,p)} where p is the total number of tags in the database, wt(i,k) is the weight of the kth tag for the document. wt(i,k) is the association frequency of the tag k to document i.

User vectors:

Ui = {wu(i,1), wu(i,2), …, wu(i,q)} where q is the total number of users in the database, wu(i,k) is the weight of the kth user for the document. This weight can be computed in a different way, considering the different levels of interest expressed by a user for the document. Moreover, more than one user vector can be used if it is necessary to use different weights for each of the components (for instance, one vector for likes, one for rates, and so on).

Using these three (or more) vectors, three (or more) different cosine similarities are computed and then the value for the combined similarity is calculated in the following way:

CombinedSimilarity(i, j) = αCosineSim(Ci, Cj)+βCosineSim(Ti, Tj)+γ*CosineSim(Ui, Uj) where α + β + γ = 1.

It is worth noting that the similarity computed represents new knowledge extracted from the data available in the graph database. It is stored as model for the recommendation engine and it can be used in several ways to provide suggestions to users.

Conclusion

In this use case, the GraphAware NLP Plugin is used to deliver high-quality recommendations to end users. The plugin provides content-based and ontology-based cosine similarities, which, together with the more classical “collaborative filtering” approach, produces completely new and more advanced functionalities in a straightforward way.

The GraphAware NLP Plugin can be used with other plugins available on the GraphAware products page. In particular, using the Neo4j2Elastic plugin for Neo4j and Graph-Aided Search plugin for Elasticsearch, it is possible to provide a complete end-to-end customized search framework.

References

Tran Vu Pham, Le Nguyen Thach, “Social-Aware Document Similarity Computation for Recommender Systems”, vol. 00, no., pp. 872-878, 2011, doi:10.1109/DASC.2011.147

Create flexible schemas using dynamic columns for semi-structured data. Learn how.

Topics:
social ,recommender systems ,nlp ,systems ,search ,social media ,neo4j ,nosql ,graph database

Published at DZone with permission of Alessandro Negro, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}