DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
Securing Your Software Supply Chain with JFrog and Azure
Register Today

Trending

  • Azure Virtual Machines
  • A Complete Guide to Agile Software Development
  • Building a Robust Data Engineering Pipeline in the Streaming Media Industry: An Insider’s Perspective
  • Authorization: Get It Done Right, Get It Done Early

Trending

  • Azure Virtual Machines
  • A Complete Guide to Agile Software Development
  • Building a Robust Data Engineering Pipeline in the Streaming Media Industry: An Insider’s Perspective
  • Authorization: Get It Done Right, Get It Done Early
  1. DZone
  2. Data Engineering
  3. Data
  4. Neo4j: Using Aliases to Handle Messy Data

Neo4j: Using Aliases to Handle Messy Data

Mark Needham user avatar by
Mark Needham
·
Dec. 11, 13 · Interview
Like (0)
Save
Tweet
Share
4.97K Views

Join the DZone community and get the full member experience.

Join For Free

One of the common problems when building data-heavy applications is that names of things in the domain are often named differently depending on which system you get the data from.

This means that we’ll typically end up running the data from different sources through a normalization process to ensure that we have consistent naming in the database:

Data nromalisation

I’ve recently started linking the football stadium a match was played in to the match in my football graph, but unfortunately different match compilers use different spellings or even names for the same stadium.

My first instinct was to write a normalization layer, but instead I decided to store the stadium names as they were but separately create an ‘alias’ relationship back in to the correct spelling.

The model looks like this:

2013 11 26 00 16 42

I add a ‘Stadium’ label to all the stadiums, which means that if we do a query to find the number of games at each stadium we’ll get back games at the misspelled versions too:

MATCH (stadium:Stadium)<-[:in_stadium]-(game)
RETURN stadium.name, count(game) AS games
ORDER BY games DESC
==> +-------------------------------+
==> | stadium.name          | games |
==> +-------------------------------+
==> | "Craven Cottage"      | 57    |
==> | "Villa Park"          | 57    |
==> | "Anfield"             | 57    |
==> | "Stamford Bridge"     | 57    |
==> | "Britannia Stadium"   | 57    |
==> | "Emirates Stadium"    | 57    |
==> | "Etihad Stadium"      | 57    |
==> | "Stadium of Light"    | 57    |
==> | "Old Trafford"        | 57    |
==> | "The Hawthorns"       | 57    |
==> | "White Hart Lane"     | 57    |
==> | "Goodison Park"       | 57    |
==> | "DW Stadium"          | 39    |
==> | "Molineux Stadium"    | 38    |
==> | "Liberty Stadium"     | 38    |
==> | "Ewood Park"          | 38    |
==> | "Carrow Road"         | 38    |
==> | "Reebok Stadium"      | 38    |
==> | "Loftus Road Stadium" | 37    |
==> | "St James' Park"      | 34    |
==> | "Upton Park"          | 33    |
==> | "Bloomfield Road"     | 19    |
==> | "Madejski Stadium"    | 19    |
==> | "St. James' Park"     | 19    |
==> | "St Andrews Stadium"  | 19    |
==> | "St. Mary's Stadium"  | 19    |
==> | "The DW Stadium"      | 18    |
==> | "Boleyn Ground"       | 5     |
==> | "St. James Park"      | 4     |
==> | "Loftus Road"         | 1     |
==> +-------------------------------+
==> 30 rows

We only want to show the proper spellings of stadiums which we can do with the following query:

MATCH (stadium:Stadium)<-[:alias_of*0..1]-()<-[:in_stadium]-(game)
WHERE NOT(stadium-[:alias_of]->())
return stadium.name, count(game) AS games
ORDER BY games DESC

Here we get all the stadiums, then get any incoming aliases and find the games played in those stadiums. We then filter out any stadium which has an outgoing ‘alias_of’ relationship because that indicates that the node isn’t the original node for the stadium.

If we run the query we’ll get the expected result:

==> +-------------------------------+
==> | stadium.name          | games |
==> +-------------------------------+
==> | "St James' Park"      | 57    |
==> | "Craven Cottage"      | 57    |
==> | "Villa Park"          | 57    |
==> | "Anfield"             | 57    |
==> | "Stamford Bridge"     | 57    |
==> | "Britannia Stadium"   | 57    |
==> | "Emirates Stadium"    | 57    |
==> | "Etihad Stadium"      | 57    |
==> | "Stadium of Light"    | 57    |
==> | "Old Trafford"        | 57    |
==> | "The Hawthorns"       | 57    |
==> | "White Hart Lane"     | 57    |
==> | "The DW Stadium"      | 57    |
==> | "Goodison Park"       | 57    |
==> | "Loftus Road Stadium" | 38    |
==> | "Molineux Stadium"    | 38    |
==> | "Upton Park"          | 38    |
==> | "Liberty Stadium"     | 38    |
==> | "Ewood Park"          | 38    |
==> | "Carrow Road"         | 38    |
==> | "Reebok Stadium"      | 38    |
==> | "Bloomfield Road"     | 19    |
==> | "Madejski Stadium"    | 19    |
==> | "St Andrews Stadium"  | 19    |
==> | "St. Mary's Stadium"  | 19    |
==> +-------------------------------+
==> 25 rows

The nice thing about this approach is that I only need to collect aliases in one place and I can just import the data as is from the source.

On the other hand it does add some complexity to queries as you need to take aliases into account each time.

I’d love to hear your thoughts on what you think of this approach.


Data (computing) Neo4j

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Trending

  • Azure Virtual Machines
  • A Complete Guide to Agile Software Development
  • Building a Robust Data Engineering Pipeline in the Streaming Media Industry: An Insider’s Perspective
  • Authorization: Get It Done Right, Get It Done Early

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com

Let's be friends: