DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Python Bags the TIOBE Language of the Year 2021 in a Row
  • Making the Switch to Python? Here's Everything You Need to Know.
  • Playing With Pandas DataFrames (With Missing Values Table Example)
  • Simplifying Access to Db2 Databases in Jupyter Notebook

Trending

  • Revolutionizing Software Testing
  • Agile Estimation: Techniques and Tips for Success
  • Monkey-Patching in Java
  • Demystifying Project Loom: A Guide to Lightweight Threads in Java
  1. DZone
  2. Data Engineering
  3. Databases
  4. Python's Pandas vs Neo4j's Cypher: Exploring Popular Phrases in How I Met Your Mother Transcripts

Python's Pandas vs Neo4j's Cypher: Exploring Popular Phrases in How I Met Your Mother Transcripts

Mark Needham user avatar by
Mark Needham
·
Feb. 24, 15 · Interview
Like (0)
Save
Tweet
Share
6.05K Views

Join the DZone community and get the full member experience.

Join For Free

I’ve previously written about extracting TF/IDF scores for phrases in documents using scikit-learn and the final step in that post involved writing the words into a CSV file for analysis later on.

I wasn’t sure what the most appropriate tool of choice for that analysis was so I decided to explore the data using Python’s pandas library and load it into Neo4j and write some Cypher queries.

To do anything with Neo4j we need to first load the CSV file into the database. The easiest way to do that is with Cypher’s LOAD CSV command.

First we’ll load the phrases in and then we’ll connect them to the episodes which were previously loaded:

USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MERGE (phrase:Phrase {value: row.Phrase});
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Users/markneedham/projects/neo4j-himym/data/import/tfidf_scikit.csv" AS row
MATCH (phrase:Phrase {value: row.Phrase})
MATCH (episode:Episode {id: TOINT(row.EpisodeId)})
MERGE (phrase)-[:USED_IN_EPISODE {tfidfScore: TOFLOAT(row.Score)}]->(episode);

Now we’re ready to start writing some queries. To start with we’ll write a simple query to find the top 3 phrases for each episode.

In pandas this is quite easy – we just need to group by the appropriate field and then take the top 3 records in that grouping:

top_words_by_episode = df \
    .sort(["EpisodeId", "Score"], ascending = [True, False]) \
    .groupby(["EpisodeId"], sort = False) \
    .head(3)
 
>>> print(top_words_by_episode.to_string())
 
        EpisodeId              Phrase     Score
3976            1                 ted  0.262518
2912            1              olives  0.195714
2441            1            marshall  0.155515
8143            2                 ted  0.292184
5197            2              carlos  0.227454
7482            2               robin  0.195150
12551           3                 ted  0.232662
9040            3              barney  0.187255
11254           3              mcneil  0.170619
15641           4             natalie  0.562485
16763           4                 ted  0.191873
16234           4               robin  0.102671
20715           5            subtitle  0.310866
18121           5          coat check  0.181682
20861           5                 ted  0.169973
...

The cypher version looks quite similar, the main difference being that we use the COLLECT to generate an array of phrases by episode and then take the top 3:

MATCH (e:Episode)<-[rel:USED_IN_EPISODE]-(phrase)
WITH e, rel, phrase
ORDER BY e.id, rel.tfidfScore DESC
RETURN e.id, e.title, COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]
ORDER BY e.id
 
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | e.id | e.title                                     | COLLECT({phrase: phrase.value, score: rel.tfidfScore})[..3]                                                                                                               |
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | 1    | "Pilot"                                     | [{phrase -> "ted", score -> 0.2625177493269755},{phrase -> "olives", score -> 0.19571419072701732},{phrase -> "marshall", score -> 0.15551468983363487}]                  |
==> | 2    | "Purple Giraffe"                            | [{phrase -> "ted", score -> 0.292184496766088},{phrase -> "carlos", score -> 0.22745438090499026},{phrase -> "robin", score -> 0.19514993122773566}]                      |
==> | 3    | "Sweet Taste of Liberty"                    | [{phrase -> "ted", score -> 0.23266190616714866},{phrase -> "barney", score -> 0.18725456678444408},{phrase -> "officer mcneil", score -> 0.17061872221616137}]           |
==> | 4    | "Return of the Shirt"                       | [{phrase -> "natalie", score -> 0.5624848345525686},{phrase -> "ted", score -> 0.19187323894701674},{phrase -> "robin", score -> 0.10267067360622682}]                    |
==> | 5    | "Okay Awesome"                              | [{phrase -> "subtitle", score -> 0.310865508347106},{phrase -> "coat check", score -> 0.18168178787561182},{phrase -> "ted", score -> 0.16997258596683185}]               |
==> | 6    | "Slutty Pumpkin"                            | [{phrase -> "mike", score -> 0.2966610054610693},{phrase -> "ted", score -> 0.19333276951599407},{phrase -> "robin", score -> 0.1656172994411056}]                        |
==> | 7    | "Matchmaker"                                | [{phrase -> "ellen", score -> 0.4947912795578686},{phrase -> "sarah", score -> 0.24462913913669443},{phrase -> "ted", score -> 0.23728319597607636}]                      |
==> | 8    | "The Duel"                                  | [{phrase -> "ted", score -> 0.26713931416222847},{phrase -> "marshall", score -> 0.22816702335751904},{phrase -> "swords", score -> 0.17841675237702592}]                 |
==> | 9    | "Belly Full of Turkey"                      | [{phrase -> "ericksen", score -> 0.43145756691027665},{phrase -> "mrs ericksen", score -> 0.1939318283559959},{phrase -> "kendall", score -> 0.1846969793866628}]         |
==> | 10   | "The Pineapple Incident"                    | [{phrase -> "ted", score -> 0.439756993033922},{phrase -> "trudy", score -> 0.36367907631894536},{phrase -> "carl", score -> 0.16413071244131686}]                        |
==> | 11   | "The Limo"                                  | [{phrase -> "moby", score -> 0.48314164479037003},{phrase -> "party number", score -> 0.30458929780262456},{phrase -> "ranjit", score -> 0.1991061739767796}]             |
...
==> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

In the cypher version we get one row per episode whereas with the Python version we get 3 rows. It might be possible to achieve this effect with pandas too but I wasn’t sure how to do so.

Next let’s find the top phrases for a single episode – the type of query that might be part of an episode page on a How I met your mother wiki:

top_words = df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(20)
 
>>> print(top_words.to_string())
 
      EpisodeId                Phrase     Score
3976          1                   ted  0.262518
2912          1                olives  0.195714
2441          1              marshall  0.155515
4732          1               yasmine  0.152279
3347          1                 robin  0.130418
209           1                barney  0.124412
2146          1                  lily  0.122925
3637          1                signal  0.103793
1366          1                goanna  0.098138
3524          1                 scene  0.095342
710           1                   cut  0.091734
2720          1              narrator  0.086462
1147          1             flashback  0.078296
1148          1        flashback date  0.070283
3224          1                ranjit  0.069393
4178          1           ted yasmine  0.058569
1149          1  flashback date robin  0.058569
525           1                  carl  0.058210
3714          1           smurf pen1s  0.054365
2048          1              lebanese  0.054365
MATCH (e:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel
ORDER BY rel.tfidfScore DESC
RETURN phrase.value AS phrase, rel.tfidfScore AS score
LIMIT 20
 
==> +-----------------------------------------------+
==> | phrase                 | score                |
==> +-----------------------------------------------+
==> | "ted"                  | 0.2625177493269755   |
==> | "olives"               | 0.19571419072701732  |
==> | "marshall"             | 0.15551468983363487  |
==> | "yasmine"              | 0.15227880637176266  |
==> | "robin"                | 0.1304175242341549   |
==> | "barney"               | 0.12441175186690791  |
==> | "lily"                 | 0.12292497785945679  |
==> | "signal"               | 0.1037932464656365   |
==> | "goanna"               | 0.09813798750091524  |
==> | "scene"                | 0.09534236041231685  |
==> | "cut"                  | 0.09173366535740156  |
==> | "narrator"             | 0.08646229819848741  |
==> | "flashback"            | 0.07829592155397117  |
==> | "flashback date"       | 0.07028252601773662  |
==> | "ranjit"               | 0.06939276915589167  |
==> | "ted yasmine"          | 0.05856877168144719  |
==> | "flashback date robin" | 0.05856877168144719  |
==> | "carl"                 | 0.058210117288760355 |
==> | "smurf pen1s"          | 0.05436505297972703  |
==> | "lebanese"             | 0.05436505297972703  |
==> +-----------------------------------------------+

Our next query is a negation – find the episodes which don’t mention the phrase ‘robin’. In python we can do some simple set operations to work this out:

all_episodes = set(range(1, 209))
robin_episodes = set(df[(df["Phrase"] == "robin")]["EpisodeId"])
 
>>> print(set(all_episodes) - set(robin_episodes))
set([145, 198, 143])

In cypher land a query will suffice:

MATCH (episode:Episode), (phrase:Phrase {value: "robin"})
WHERE NOT (episode)<-[:USED_IN_EPISODE]-(phrase)
RETURN episode.id AS id, episode.season AS season, episode.number AS episode

And finally a mini recommendation engine type query – how many of the top phrases in Episode 1 were used in other episodes:

First python:

phrases_used = set(df[(df["EpisodeId"] == 1)] \
    .sort(["Score"], ascending = False) \
    .head(10)["Phrase"])
 
phrases = df[df["Phrase"].isin(phrases_used)]
 
print (phrases[phrases["EpisodeId"] != 1] \
    .groupby(["Phrase"]) \
    .size() \
    .order(ascending = False))

Here we’ve pulled it out into a few steps – first we identify the top phrases, then we find out where they occur across the whole data set and finally we filter out the occurrences in the first episode and count the other occurrences.

Phrase
marshall    207
barney      207
ted         206
lily        206
robin       204
scene        36
signal        4
goanna        3
olives        1

In cypher we can write a query to do this as well:

MATCH (episode:Episode {title: "Pilot"})<-[rel:USED_IN_EPISODE]-(phrase)
WITH phrase, rel, episode
ORDER BY rel.tfidfScore DESC
LIMIT 10
MATCH (phrase)-[:USED_IN_EPISODE]->(otherEpisode)
WHERE otherEpisode <> episode
RETURN phrase.value AS phrase, COUNT(*) AS numberOfOtherEpisodes
ORDER BY numberOfOtherEpisodes DESC
 
==> +------------------------------------+
==> | phrase     | numberOfOtherEpisodes |
==> +------------------------------------+
==> | "barney"   | 207                   |
==> | "marshall" | 207                   |
==> | "ted"      | 206                   |
==> | "lily"     | 206                   |
==> | "robin"    | 204                   |
==> | "scene"    | 36                    |
==> | "signal"   | 4                     |
==> | "goanna"   | 3                     |
==> | "olives"   | 1                     |
==> +------------------------------------+

Overall there’s not much in it – for some of the queries I found it easier in cypher and for others easier with pandas. It’s always useful to have multiple tools in the toolbox!

Database Phrase (software) Pandas Python (language) Neo4j

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Python Bags the TIOBE Language of the Year 2021 in a Row
  • Making the Switch to Python? Here's Everything You Need to Know.
  • Playing With Pandas DataFrames (With Missing Values Table Example)
  • Simplifying Access to Db2 Databases in Jupyter Notebook

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: