Building a Graph-Based Movie Recommender Engine
Join the DZone community and get the full member experience.
Join For Freea recommender engine helps a user find novel and interesting items within a pool of resources. there are numerous types of recommendation algorithms and a graph can serve as a general-purpose substrate for evaluating such algorithms. this post will demonstrate how to build a graph-based movie recommender engine using the publicly available movielens dataset , the graph database neo4j , and the graph traversal language gremlin . feel free to follow along in the gremlin console as the post will go step-by-step from data acquisition, to parsing, and ultimately, to traversing.
the movieratings dataset
the grouplens research group has made available a corpus of movie ratings. there are 3 versions of this dataset: 100 thousand, 1 million, and 10 million ratings. this post makes use of the 1 million ratings version of the dataset. the dataset can be downloaded from the movieratings website (~6 megs in size). the raw dataset is composed of three files: users.dat , movies.dat , and ratings.dat . more information about the particulars of each file is described in the respective readme.txt .
getting started with gremlin
all of the code examples can be cut and pasted into the gremlin console or into a groovy/java class within a larger application. for the sake of simplicity, simply follow along with the gremlin console. gremlin 1.3 is available for download at this location .marko$ ./gremlin.sh \,,,/ (o o) -----oooo-(_)-oooo----- gremlin>
generating a movieratings graph
before getting recommendations of which movies to watch, it is important to first parse the raw movielens data according to the graph schema defined above. while this is not the only way in which the data could be represented as a property graph , the representation above is natural and useful for the traversals presented later.
the data will be inserted into the graph database neo4j. the gremlin/ groovy code below creates a new neo4j graph, removes an unneeded default edge index, and sets the transaction buffer to 1000 mutations per commit. finally, to make the data easier to understand, a map is hardcoded which will allow for the representation of a user’s occupation as a string instead of its integer id.
g = new neo4jgraph('/tmp/movieratings') g.dropindex("edges") g.setmaxbuffersize(1000) occupations = [0:'other', 1:'academic/educator', 2:'artist', 3:'clerical/admin', 4:'college/grad student', 5:'customer service', 6:'doctor/health care', 7:'executive/managerial', 8:'farmer', 9:'homemaker', 10:'k-12 student', 11:'lawyer', 12:'programmer', 13:'retired', 14:'sales/marketing', 15:'scientist', 16:'self-employed', 17:'technician/engineer', 18:'tradesman/craftsman', 19:'unemployed', 20:'writer']
parsing movie data
the file movie.dat contains a list of movies. each row of the raw file has 3 columns: movieid , title , and genera s.
marko$ more -n6 movies.dat 1::toy story (1995)::animation|children's|comedy 2::jumanji (1995)::adventure|children's|fantasy 3::grumpier old men (1995)::comedy|romance 4::waiting to exhale (1995)::comedy|drama 5::father of the bride part ii (1995)::comedythe code to parse this data into neo4j and according to the diagrammed schema is presented below. the details of the code are left to focused eye of the interested reader. for others, simply cutting and pasting this code into the gremlin console will suffice to move forward.
new file('movies.dat').eachline {def line -> def components = line.split('::'); def movievertex = g.addvertex(['type':'movie', 'movieid':components[0].tointeger(), 'title':components[1]]); components[2].split('\\|').each { def genera -> def hits = g.idx(t.v)[[genera:genera]].iterator(); def generavertex = hits.hasnext() ? hits.next() : g.addvertex(['type':'genera', 'genera':genera]); g.addedge(movievertex, generavertex, 'hasgenera'); } }
parsing user data
the file users.dat contains a list of users. each row of the raw file has 5 columns: userid , gender , age , occupation , and zipcode . for schema simplicity, the zipcode field will be ignored.
ml-1m$ more -n6 users.dat 1::f::1::10::48067 2::m::56::16::70072 3::m::25::15::55117 4::m::45::7::02460 5::m::25::20::55455
new file('users.dat').eachline {def line -> def components = line.split('::'); def uservertex = g.addvertex(['type':'user', 'userid':components[0].tointeger(), 'gender':components[1], 'age':components[2].tointeger()]); def occupation = occupations[components[3].tointeger()]; def hits = g.idx(t.v)[[occupation:occupation]].iterator(); def occupationvertex = hits.hasnext() ? hits.next() : g.addvertex(['type':'occupation', 'occupation':occupation]); g.addedge(uservertex, occupationvertex, 'hasoccupation'); }
parsing ratings data
the file ratings.dat contains the ratings that a user provided for the movies they watched. each row of the raw file has 4 columns: userid , movieid , stars (1-5 rating scale), and timestamp . the timestamp field will be ignored.
ml-1m$ more -n6 ratings.dat 1::1193::5::978300760 1::661::3::978302109 1::914::3::978301968 1::3408::4::978300275 1::2355::5::978824291
given that there are 1 million ratings, this code fragment will take a minute or so to evaluate. for those dealing with large datasets and neo4j, its possible to use neo4jbatchgraph which is new with blueprints 1.0.
new file('ratings.dat').eachline {def line -> def components = line.split('::'); def ratededge = g.addedge(g.idx(t.v)[[userid:components[0].tointeger()]].next(), g.idx(t.v)[[movieid:components[1].tointeger()]].next(), 'rated'); ratededge.setproperty('stars', components[2].tointeger()); }to commit any data left over in the transaction buffer, successfully stop the current transaction. now the data is persisted to disk. if you plan on leaving the gremlin console, be sure to g.shutdown() the graph first.
g.stoptransaction(transactionalgraph.conclusion.success)
before moving onto recommender algorithms, lets make sure the graph looks correct.
gremlin> g.v.count() ==>9962 gremlin> g.e.count() ==>1012657 gremlin> g.v[[type:'movie']].count() ==>3883 gremlin> g.v[[type:'genera']].count() ==>18 gremlin> g.v[[type:'user']].count() ==>6040 gremlin> g.v[[type:'occupation']].count() ==>21 gremlin> occupations.size() ==>21
as a side : what is the distribution of occupations amongst the user population?
gremlin> m = [:] gremlin> g.v[[type:'user']].out('hasoccupation').occupation.groupcount(m) >> -1 ==>null gremlin> m.sort{a,b -> b.value <=> a.value} ==>college/grad student=759 ==>other=711 ==>executive/managerial=679 ==>academic/educator=528 ==>technician/engineer=502 ==>programmer=388 ==>sales/marketing=302 ==>writer=281 ==>artist=267 ==>self-employed=241 ==>doctor/health care=236 ==>k-12 student=195 ==>clerical/admin=173 ==>scientist=144 ==>retired=142 ==>lawyer=129 ==>customer service=112 ==>homemaker=92 ==>unemployed=72 ==>tradesman/craftsman=70 ==>farmer=17
what about the average age?
gremlin> g.v[[type:'user']].age.mean() ==>30.639238410596025
traversing the movielens graph
now that the movielens data has been parsed and represented as a graph,
the data is ready to be traversed (i.e. queried). there are two general
types of recommendation:
collaborative filtering
and content-based recommendation.
in
collaborative filtering
, the
rating/liking/preference behavior of users is correlated in order to
recommend the favorites of one user to another, similar user.
i like the movies you like, what other movies do you like that i haven’t seen?
with
content-based recommendation
, if a particular item is liked, then its features are analyzed in order to find other items with analogous features.
i like romantic zombie comedies, what other romantic zombie comedies are there?
basic collaborative filtering
this section will build up an ever more complex traversal in order to explain the foundation of graph-based collaborative filtering and hint at the variations that can be incorporated to meet domain specific requirements. lets start from toy story the movie.
gremlin> v = g.idx(t.v)[[title:'toy story (1995)']] >> 1 ==>v[1] gremlin> v.map() ==>movieid=1 ==>title=toy story (1995) ==>type=movie
- which users gave toy story more than 3 stars? (only return 5 results)
gremlin> v.ine('rated').filter{it.getproperty('stars') > 3}.outv.userid[0..4] ==>v[3902] ==>v[3912] ==>v[3916] ==>v[3918] ==>v[3920]
this traversal doesn’t yield useful information. however, when it is used within a larger path expression, collaborative filtering is effected.
- which users gave toy story more than 3 stars and what other movies did they give more than 3 stars to? (only return 5 results)
gremlin> v.ine('rated').filter{it.getproperty('stars') > 3}.outv.oute('rated').filter{it.getproperty('stars') > 3}.inv.title[0..4] ==>one flew over the cuckoo's nest (1975) ==>erin brockovich (2000) ==>bug's life, a (1998) ==>ben-hur (1959) ==>christmas story, a (1983)
- start from toy story — v
- get the incoming rated edges — ine(‘rated’)
- filter out those edges whose star property is less than 4 — filter{it.getproperty(‘stars’) > 3}
- get the tail user vertices of the remaining edges — outv
- get the rating edges of those user vertices — oute(‘rated’)
- filter out those edges whose star property is less than 4 — filter{it.getproperty(‘stars’) > 3}
- get the head movie vertices of the remaining edges — inv
- get the string title property of those movie vertices — title
- only emit the first 5 titles — [0..4]
what is the meaning of the aggregate of these atomic-steps? —
highly co-rated
. with gremlin, it is possible to work at a higher level of abstraction by making use of a
user defined steps
. the step defined below is called
corated
and it bundles all these atomic-steps together.
gremlin> gremlin.definestep('corated',[vertex,pipe], { def stars -> _().ine('rated').filter{it.getproperty('stars') > stars}.outv.oute('rated').filter{it.getproperty('stars') > stars}.inv}) ==>null
gremlin> v.corated(3).title[0..4] ==>one flew over the cuckoo's nest (1975) ==>erin brockovich (2000) ==>bug's life, a (1998) ==>ben-hur (1959) ==>christmas story, a (1983)
this traversal will return a list of movies. notice that this list
has many duplicates. this is due to the fact that users who like toy
story also like many of the same other movies. this similarity of human
behavior is what is capitalized on in recommendation algorithms.
- how many of toy story’s highly co-rated movies are unique?
gremlin> v.corated(3).count() ==>268493 gremlin> v.corated(3).uniqueobject.count() ==>3353
given that there are 268,493 highly rated paths from toy story to
other movies and only 3,353 of those movies are unique, it is possible
to use these duplicates as a ranking mechanism–ultimately, a
recommendation.
- which movies are most highly co-rated with toy story? (return the top 10)
gremlin> m = [:] gremlin> v.corated(3).title.groupcount(m) >> -1 ==>null gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] ==>toy story (1995)=1655 ==>star wars: episode v - the empire strikes back (1980)=1000 ==>star wars: episode iv - a new hope (1977)=998 ==>american beauty (1999)=949 ==>matrix, the (1999)=925 ==>raiders of the lost ark (1981)=922 ==>silence of the lambs, the (1991)=887 ==>saving private ryan (1998)=878 ==>back to the future (1985)=876 ==>shawshank redemption, the (1994)=875
the highest ranked result makes sense. it says, people who like toy story, also like toy story. in order to remove these reflexive paths, simply filter out toy story.
gremlin> m = [:] gremlin> v.corated(3).filter{it != v}.title.groupcount(m) >> -1 ==>null gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] ==>star wars: episode v - the empire strikes back (1980)=1000 ==>star wars: episode iv - a new hope (1977)=998 ==>american beauty (1999)=949 ==>matrix, the (1999)=925 ==>raiders of the lost ark (1981)=922 ==>silence of the lambs, the (1991)=887 ==>saving private ryan (1998)=878 ==>back to the future (1985)=876 ==>shawshank redemption, the (1994)=875 ==>toy story 2 (1999)=871
these are all fine movies and ones that are generally considered “good.” however, note that the recommendation traversal starts not from a particular user, but from a particular movie (i.e. toy story). if instead the traversal starts from a particular user, then basic collaborative filtering is implemented.
given a user, what movies do they like, who else likes those movies, and what other movies do those users like that are not already liked by that initial user.
expressing this in gremlin is left as an exercise to the more than casually interested reader. feel free to email me your solution for comments (or post it to the gremlin-users mailing list).
mixing in content-based recommendation
the previous section returned generally good movies. however, if i’m a
parent and i want to find a good movie like toy story for my children to
watch, i probably do not want them watching
silence of the lambs
.
to remedy this situation, it is possible to mix collaborative filtering
and content-based recommendation into a traversal that yields “toy
story good” movies of the same genera.
gremlin> v.out('hasgenera').genera ==>animation ==>children's ==>comedy
- which movies are most highly co-rated with toy story that share a genera with toy story? (return the top 10)
gremlin> m = [:] gremlin> x = [] as set gremlin> v.out('hasgenera').aggregate(x).back(2).corated(3).filter{it != v}.out('hasgenera').retain(x).back(2).title.groupcount(m) >> -1 ==>null gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] ==>american beauty (1999)=949 ==>back to the future (1985)=876 ==>toy story 2 (1999)=871 ==>princess bride, the (1987)=851 ==>groundhog day (1993)=843 ==>shakespeare in love (1998)=807 ==>forrest gump (1994)=775 ==>men in black (1997)=747 ==>e.t. the extra-terrestrial (1982)=737 ==>bug's life, a (1998)=700
this ranking makes sense, but still, for the stiflingly worrisome parent, movies like
american beauty
(classified as a comedy) may not be considered appropriate for
children. how about only considering those movies that share all the
same generas with toy story?
- which movies are most highly co-rated with toy story that share all generas with toy story? (return the top 10)
gremlin> m = [:] gremlin> x = [] as set gremlin> v.out('hasgenera').aggregate(x).back(2).corated(3).filter{it != v}.filter{it.out('hasgenera')>>[] as set == x}.title.groupcount(m) >> -1 ==>null gremlin> m.sort{a,b -> b.value <=> a.value}[0..9] ==>toy story 2 (1999)=871 ==>bug's life, a (1998)=700 ==>chicken run (2000)=465 ==>american tail, an (1986)=140 ==>aladdin and the king of thieves (1996)=39 ==>american tail: fievel goes west, an (1991)=37 ==>rugrats movie, the (1998)=34 ==>adventures of rocky and bullwinkle, the (2000)=21 ==>saludos amigos (1943)=4
conclusion
recommendation is a popular technique used for separating the
wheat from the chaff
. for systems with numerous items, helping a user find items that not only satiate their tastes, but also their
momentary needs
,
the algorithms presented can be used. it is important to note that at
some level of abstraction, collaborative filtering and content-based
recommendation are the same. this idea is made salient when considering
that ratings are yet another feature of an item. more specifically,
highly co-rated
movies of a movie are features of that movie. this idea gets to the
flexibility of the property graph data structure and the notion of
abstract/inferred/derived relationships
.
the itemization below presents recommendation themes that can be
further explored by the interested reader. the graph allows an analyst
to slice and dice the data in various ways. with enough data (many types
of things and many types of relationships), there are numerous mappings
that can be made. that is the power of the
graph traversal pattern
.
- make use of genera taxonomies to explore genera generalization and specification.
- determine if gender is a significant factor in the preference of movies.
- recommend movies with filtering based on the occupation (or age) of the user.
- mix in other movie features such as director, actors, and motion picture ratings.
- makes use of a collection of start movies (e.g. “given toy story and big trouble in little china , …”). realize that a single user can be seen as a reference to such a collection (the movies they highly rated).
- “what do the users prefer who don’t like the same movies as me?”
- “what do the users dislike who like the movies i dislike?” (non-collaborative filtering)
- use gender, zipcode, and ratings to recommend users a movie watching date (“watch movie x with user y “).
-
use range filters (e.g.
[0..100]
) or
random walks
for fined grained control of the execution speed (i.e. path sampling).
related articles
rodriguez, m.a., watkins, j., “ faith in the algorithm, part 2: computational eudaemonics ,” proceedings of the international conference on knowledge-based and intelligent information & engineering systems, lecture notes in artificial intelligence, volume 5712, pp. 813–820, april 2009.
rodriguez, m.a., neubauer, p., “ the graph traversal pattern ,” graph data management: techniques and applications, eds. s. sakr, e. pardede, igi global, isbn:9781613500538, august 2011.
rodriguez, m.a., allen, d.w., shinavier, j., ebersole, g., “
a recommender system to support the scholarly communication process
,” krs-2009-02, april 2009.
source : http://markorodriguez.com/2011/09/22/a-graph-based-movie-recommender-engine/
Opinions expressed by DZone contributors are their own.
Comments