I’ve been interning at GRAKN.AI for about three weeks now and have already gotten my hands dirty with some projects. One of them is a synthesis of GRAKN.AI and neural nets — keep your eyes peeled in a couple weeks for a blog post on that one!
Today, though, I will demonstrate a movie recommendations program that I wrote on top of Grakn. This is meant to be a command-line tool for you to play with and tweak as you like, so I am putting all of the code on GitHub and you can learn how to get it running by taking a look at the readme there. I will be explaining how to build such a recommendation system using only basic NumPy and Pandas capabilities and — of course — a Grakn graph.
To start off with, we need some data. The canonical, comprehensive online movie repository is MovieLens, which offers both a “small” and “full” dataset. These data come complete with user ratings, timestamps, movie genres, movie titles, and — in the case of the full dataset — a “tag genome” that calculates the predicted relevance of each of 1,128 tags to every one of the movies. For this project, we are going to focus on users, movies, and genres.
How Are Recommendations Made?
Typically, recommendation systems do their work either from contextual information about user behaviors or directly from knowledge of intrinsic truths about the object being recommended. The second option is a lot more involved, using things like audio analysis and facial recognition to answer the fundamental questions and often requiring deep domain knowledge (a typical exception to this is NLP, which has a comparatively low barrier to entry). The first approach, on the other hand, tends to ask collaborative or content-basedquestions and is a much simpler and faster way to provide answers. It is also the approach we will be investigating here.
So what questions do we want to have GRAKN.AI help us answer? Well, the collaborative questions we might want to be answered are questions like:
Who enjoys the same movies I do? Who dislikes the same movies I do? What other movies do these people like?
The content-based questions might take the following form:
What types of genres do I enjoy watching? What general characteristics of a movie might lead me to enjoy it?
Diagram representation of collaborative analysis.
I took a combined approach when writing this program and tried to consider both types of questions. Collaborative and content-based filtering, by the way, are both incredibly easy to implement within GRAKN.AI. My goals in this post are the following:
- How to construct meaningful Grakn ontologies and rulesets from an open-ended question.
- How to best query a graph in Grakn to get the results you need.
- How to think about recommendation systems graphically.
For this project, I started by creating an ontology with two entity types: user and movie. A user interacts with a movie insofar as he/she contributes a rating to that movie; a user can be said to “like” a movie if their rating for that movie is greater than or equal to their mean rating for all movies. Conversely, a user “dislikes” a movie if their rating for that movie is less than the mean. This is a tweakable parameter, by the way; if you wanted to change the threshold for “liking” a movie, you could do so in two lines of code:
Throughout this post, I will make references to situations where my choice of parameters is malleable and can lead to different results that may suit different endeavors better or worse.
I could have added a third entity corresponding to genre, and that was indeed my first approach, but I found it easier to simply assign each genre to a number between 0 and 18, inclusive, and encode each movie’s set of genres as a bit vector in a resource of datatype long that I assigned to the movie entity in the ontology. So for example, in the movies file downloaded from MovieLens:
...corresponds to the 1995 movie Pocahontas, which has five genres that I encoded in the long as:
The least significant bit corresponds to the first genre alphabetically in the dataset and it goes down the alphabet from there.
I included four relations in my ontology. Two of the relations are user-movie relations: user liking a movie and user disliking a movie. The other two are movie relations: recommended movie (given a movie A relates any other movie that was liked by at least one person who liked A) and neg-recommended movie (given a movie B relates any other movie that was liked by at least one person who disliked B). These latter relations are not hard-coded into the ontology and are instead produced through inference rules in the
movieRules.gql file. These relationships can be visualized below:
You can think of this process as a basic clustering algorithm for binary movie classification.
Let’s Get Some Recommendations!
Now it’s time to see how the program actually works. The first step is to parse all the lines of the movies file — the schema for which I showed you above — and insert the relevant information into the Grakn graph. Step 2 is to ingest the ratings data, which comes in the form:
1,2968,1.0,1260759200 1,3671,3.0,1260759117 2,10,4.0,835355493 2,17,5.0,835355681
...where the first column is
userId, the second column is
movieId, the third column is rating given by
movieId, and the fourth column is the rating’s timestamp.
A recommender can only recommend if it is given some inputs off of which to base its search. A (small) dataset to train on, if you will. The program I have written takes in user input one-by-one and stores the information, making correspondence with Grakn through a command-line Graql query after every input.
The program gives the "player" random movies from the movie dataset and allows them to respond in one of three ways. If the player likes the displayed movie, they should respond with a Y or a Yes. If the player dislikes the movie, they respond N or No. If the player does not know or have an opinion on the movie, they respond with ?. The player must give a yes/no response to n movies, at which point the engine calculates the recommendations.
The choice of n is somewhat arbitrary, and in the shell snippet below, I have set it to 10. Think about what happens when you increase or decrease that value, though. If n is too low, say 3, you don’t have a large enough sample size to pursue meaningful content-based filtering and you will be choosing with less refined search terms. If n is very high, say 50, then it will take a long time for the user to go through and respond to every suggestion and you tend to get many of the same movies recommended every time since they will have a lot of connections in the Grakn graph.
You can change the value of n in the
After eachinput, a query is sent to Grakn to go through and select (max) 100 movies randomly that have an inferred relationship with the input movie. If the user input is a yes, then Grakn returns all inferred recommendations (movies liked by users that also liked the input movie). If the user input is no, thenGrakn returns all inferred neg-recommendations (movie liked by users that also disliked the input movie).
The "maximum number of movies selected" parameter is also tweakable from
fetch_limit. Setting a limit on the number of results does two things. First, it makes program execution faster since Grakn is not forced to return every result from a massively interconnected graph. Second, it introduces some randomness into the recommendation process, which means you will get diversity in the movies you are recommended.
By the end of this process, the program has a mapping of
movieIds recommended by the inference engine to the number of times they were matched out of a total of n user inputs. The values, therefore, range from 1 to n. This is the collaborative filtering aspect of the recommendation engine.
We have also kept track of the movies that the user has liked and disliked, and this is translated into a length 19 (the number of genres in the data) vector corresponding to the total "genre-interest" score of the user. We add 1 to element
i if the user "likes" a movie classified as genre
i, and we subtract 1 from element
i if the user "dislikes" a movie with that genre. Movies can have multiple genres. We turn this vector into a unit vector by dividing by its norm and are left with a set of genre weights. This is content-based filtering in action.
All that is left to do is combine the two methods. For each movie in the mapping, we obtain the bit vector corresponding to its genre classifications and convert it into a vector of 1s and 0s. Let's call this vector a and the vector of genre-weights b. We take the dot product of the two as such,
...and are left with a scalar that represents the overall alignment of the genres of the movie with the genre affinity of the user. We multiply this scalar by the number of times Grakn counted an inferred relationship (the key’s value in the movie mapping). Doing this gives us a score for each movie in the mapping, and we rank scores in descending order. Here, I’ve chosen the top 10.
Above are the results of one round of recommendations. You will notice that there is an interesting blend of horror and children’s movies. I personally don’t know anyone with this kind of taste, but it makes sense given the responses I provided. I responded "yes" to movies such as Addams Family Reunion and Angels in the Outfield — typical children’s films — as well as skin-crawlers like Stonehearst Asylum, Intacto, and The Hunger, which Roger Ebert described as "an agonizingly bad vampire movie." If I actually liked the movies I just mentioned, then the recommendations above would be quite accurate.
There is a lot more you can leverage to refine this program, of course. For example, I didn’t filter movies by their average rating, so there is no guarantee that the recommended movies are well reviewed, beyond likely having been seen by many people. Some more possibilities are below:
- Machine learning for binary classification. You could train something like a support vector machine on top of the graph.
- Tags. The larger dataset has "tag genome" data, as I mentioned at the beginning. That would be a great addition to the content-based aspect of the recommender.
- IMDB links. The dataset also contains links to movie aggregation websites, which you could scrape to get more data — actors, director, writers, you name it.
- Rating strength. I only split ratings into "like" and "dislike," but you could certainly go further than that.
Thanks for reading! Hope you enjoyed this article and please hit recommend if you did. Be on the lookout for my next post on the topic of neural nets and GRAKN.AI, which should be out in a few weeks.