Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Data Profiling: A Holistic View of Data Using Neo4j

DZone's Guide to

Data Profiling: A Holistic View of Data Using Neo4j

Data profiling helps in understanding anomalies and assessing data quality. Read on for some practical data profiling techniques using the Cypher graph query language.

· Database Zone ·
Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

Data profiling is a widely used methodology in the relational database world to analyze the structure, contents, and metadata of a data source. Generally, data profiling consists a series of jobs executed upon the data source to collect statistics and produce informative summaries of the underlying data.

As a general rule, data evolves with time. After some years, the actual data stored and used in a database may vary significantly from what people think it is or what the database was designed for at the beginning. Data profiling helps not only to understand anomalies and assess data quality but also to discover, register, and assess enterprise metadata.

The Neo4j graph database is the best for analyzing connected, high-volume, and variably structured data assets, which makes data profiling more critical, as it would help us obtain a better understanding of the data, identify hidden patterns more easily, and potentially improve query performance.

This article will share practical data profiling techniques using the Cypher graph query language.

The following are system requirements:

  • Neo4j Graph Database version 3.2.x, either Community or Enterprise Edition and either Linux or Windows (I use Windows 10).
  • Internet browser to access the Neo4j Browser (I use Chrome).
  • A graph database. Here, I imported data from the Stack Overflow Questions dataset, which contains more than 31 million nodes, 77 million relationships, and 260 million properties. The total database size on Windows 10 is about 20 GB.

All of the Cypher scripts and outcomes — mostly screenshots — are executed inside the Neo4j Browser, unless specified otherwise.

Database Schema Analysis

Database schema analysis is usually the first step of data profiling. The simple purpose of it is to know what the data model looks like and what objects are available.

Note: Most of the scripts used in this section can be found in Neo4j Browser under Favorites > Data Profiling.

Show the Graph Data Model (metamodel)

Cypher script:

// Show what is related, and how (the meta graph model)
CALL db.schema()

The Stack Overflow database has three nodes:

  1. User

  2. Post

  3. Tag

And it has four relationships:

  1. User POSTED Post

  2. Post HAS_TAG Tag

  3. Post is PARENT_OF Post

  4. Post ANSWER another Post

Show Existing Constraints and Indexes

Cypher script:

// Display constraints and indexes
:schema

Indexes:

   ON :Post(answers) ONLINE
   ON :Post(createdAt) ONLINE
   ON :Post(favorites) ONLINE
   ON :Post(score) ONLINE
... ...

Constraints:

   ON ( post:Post ) ASSERT post.postId IS UNIQUE
   ON ( tag:Tag ) ASSERT tag.tagId IS UNIQUE
   ON ( user:User ) ASSERT user.userId IS UNIQUE

Indexes tell us the properties that will have the best query performance when used for matching.

Constraints tell us the properties that are unique and can be used to identify a node or relationship.

Show All Relationship Types

Cypher script:

// List relationship types
CALL db.relationshipTypes()

A list of available relationship types.

Show All Node Labels/Types

Cypher script:

// List node labels
CALL db.labels()

A list of available node labels/types.

Count All Nodes

Cypher script:

// Count all nodes
MATCH (n) RETURN count(n)

It only takes 1ms for Neo4j to count 31 million plus nodes.

Count All Relationships

Cypher script:

// Count all relationships
MATCH ()-[r]->() RETURN count(*)

Again, it only takes 1 ms for Neo4j to return the total number of relationships.

Show Data Storage Sizes

Cypher script:

// Data storage sizes
:sysinfo

Sample Data

Cypher script:

// What kind of nodes exist
// Sample some nodes, reporting on property and relationship counts per node.
MATCH (n) WHERE rand() <= 0.1
RETURN
DISTINCT labels(n),
count(*) AS SampleSize,
avg(size(keys(n))) as Avg_PropertyCount,
min(size(keys(n))) as Min_PropertyCount,
max(size(keys(n))) as Max_PropertyCount,
avg(size( (n)-[]-() ) ) as Avg_RelationshipCount,
min(size( (n)-[]-() ) ) as Min_RelationshipCount,
max(size( (n)-[]-() ) ) as Max_RelationshipCount

You may have noticed the first line of the script, MATCH (n) WHERE rand() <= 0.1, effectively chooses 10% (0.1) of the total nodes for sampling. Changing this value would change the sample size (i.e. using 0.01 uses 1%).

Node Analysis

Node analysis is more or less similar to table and column analysis for the profiling of a relational database (RDBMS). The purpose of node analysis is to reveal facts about nodes, as well as properties of nodes.

Count Nodes by Their Labels/Types

Cypher script:

// List all node types and counts
MATCH (n) RETURN labels(n) AS NodeType, count(n) AS NumberOfNodes;

Node counting gives a clearer idea of the volume of each type of the node in a database.

Property Analysis

List All Properties of a Node

Cypher script:

// List all properties of a node
MATCH (u:User) RETURN keys(u) LIMIT 1

List All Properties of a Relationship

Cypher script:

// List all properties of a relationship
MATCH ()-[t:POSTED]-() RETURN keys(t) LIMIT 1

There is no property for the relationship.

Uniqueness of the Property

Cypher script:

// Calculate uniqueness of a property
MATCH (u:User) 
RETURN count(DISTINCT u.name) AS DistinctName, 
       count(u.name) AS TotalUser, 
       100*count(DISTINCT u.name)/count(u.name) AS Uniqueness;

It seems 78% of the usernames are unique. A property having unique values can be a good candidate for the ID.

Nullability of the Property

Cypher script:

// Calculate nullability of a property
MATCH (u:User) WHERE u.name IS null RETURN count(u);

There is no empty value for property Name of node User.

Min, Max, Average, and Standard Deviation of the Values of a Property

Cypher script:

// Calculate min, max, average and standard deviation of the values of a property
MATCH (p:Post) 
RETURN min(p.favorites) AS MinFavorites, 
       max(p.favorites) AS MaxFavorites, 
       avg(p.favorites) AS AvgFavorites, 
       stDev(p.favorites) AS StdFavorites;

Occurrences of Values of a Property

Cypher script:

// Find out most often used values for a property
MATCH (p:Post)
RETURN p.answers AS Answers, count(p.answers) AS CountOfAnswers
ORDER BY Answers ASC;

From the results, there are 1.17 million posts that have zero answers, 4.66 million have one answer, and so on.

Node Rank (Centrality)

Importance of a User

Cypher script:

// Calculate node rank / Centrality of a node
// i.e., the relevance of a node by counting the edges from other nodes:
// in-degree, out-degree and total degree.

MATCH (u:User)
WITH u,size( (u)-[:POSTED]->()) AS OutDepth, size( (u)<-[:POSTED]-()) AS InDepth
ORDER BY OutDepth, InDepth
WHERE u.name STARTS WITH 'T'
RETURN u.name, min(OutDepth), max(OutDepth), min(InDepth), max(InDepth)

Calculate node rank / centrality of a node in Neo4j

For a user, the max(OutDepth) represents the max number of posts they have submit. When max(InDepth) is 0, it means there is no relationship ending at the User node.

The user who has the most OutDepth per post can be considered to be more important within the community.

Note: As this is a heavy query, make sure there is enough heap size (specified by dbms.memory.heap.max_size in the neo4j.conf file). Alternatively, use a filter to limit the scope of the query as shown in the sample, which only looks for users whose name starts with T.

Importance of a Post

By looking at which post has the most number of answers, we can tell the importance and/or received attention of the post.

Orphan Nodes

Cypher script:

// orphans: node has no relationship
match (u:User)
with u,size( (u)-[:POSTED]->()) as posts where posts = 0
return u.name, posts;

Find orphan nodes in Neo4j

These are users who have never submitted any post or answer.

Relationship Analysis

Relationship analysis focuses on relationships in a graph database. It can help us understand the completeness, integrity, and density of certain relationships between nodes.

What is unique to graph databases, compared to normal RDBMSs, is the powerful analyses available to reveal the hidden knowledge of connected data. One example is to find out the shortest path between two nodes. Another one is to identify relationship triangles.

Statistics on Relationships

Cypher script:

// Count relationships by type
match (u)-[p]-() with type(p) as RelationshipName, 
count(p) as RelationshipNumber 
return RelationshipName, RelationshipNumber;

Calculate data relationship statistics in Neo4j using Cypher

Display the total number of each relationship type in the database.

Another query to get similar results is given below; however, it takes much more time to complete:

MATCH ()-[r]->() RETURN type(r), count(*)

Find the Shortest Path Between Two Nodes

Cypher script:

// Find all shortest path between 2 nodes
MATCH path = 
      allShortestPaths((u:User {name:"Darin Dimitrov"})-[*]-(me:User {name:"Michael Hunger"}))
RETURN path;

Learn more about data profiling using the Neo4j graph database and the APOC library

The shortest path between the two chosen users is 6.

The shortest path between two users — highlighted by red arrows in the diagram above — tells us the two users are connected by posts having the same tags (red nodes). These are not necessarily the only paths, as users may post to answer each other’s questions or posts, but in this case, connections through the same tag — i.e. a common area of interest — are the fastest way to connect the two users.

In a large graph like this one, it may not be viable to calculate the shortest path between any two users. However, it may be valuable to check the connectivity among the most important people or among posts having the most interest.

Triangle Detection

Cypher script:

// Triangle detection:
match (u:User)-[p1:POSTED]-(x1),(u)-[p2:POSTED]-(x2),(x1)-[r1]-(x2) 
where x1 <> x2 return u,p1,p2,x1,x2 limit 10;

// Count all triangles in the graph
match (u:User)-[p1:POSTED]-(x1),(u)-[p2:POSTED]-(x2),(x1)-[r1]-(x2) 
where x1 <> x2 return count(p1);

How to detect triadic closures (triangles) in the Neo4j graph database using the Cypher query language

Triangles are another key concept in graph theory. Triangles are represented by three connected nodes, directional or uni-directional. Identifying triangles — or a lack of triangles — provides interesting insights into the underlying data asset.

Triangles are also referred as Triadic Closures, as per Graph Databases, 2nd Edition (O'Reilly Media):

A triadic closure is a common property of social graphs, where we observe that if two nodes are connected via a path involving a third node, there is an increased likelihood that the two nodes will become directly connected at some point in the future.

Putting this concept into our daily life, it’s a familiar social occurrence. If we happen to be friends with two people who don’t know one another, there’s an increased chance that those two people will become direct friends at some point in the future.

By discovering the existence of triangles in a graph database, we can create more efficient queries to avoid circular traversal.

Using the APOC Library

Since Neo4j 3.0, users can implement customized functionality using Java to extend Cypher for highly complex graph algorithms. This is the so-called concept of user-defined procedures.

The APOC library is one of the most powerful and popular Neo4j libraries. It consists of many procedures (about 300 at the time of writing) to help with many different tasks in areas like data integration, graph algorithms, or data conversion— and no surprise, it also has several functions for analyzing the metadata of the graph database.

To enable APOC in Neo4j 3.x, there are a few simple steps:

  1. Stop Neo4j service.
  2. Download and copy the most recent version of the APOC JAR file to the plugins folder under the database, i.e. graph.db\plugins
  3. Add the following line to the neo4j.conf file: dbms.security.procedures.unrestricted=apoc.*
  4. Start Neo4j service again.

The out-of-box functions for profiling are all under apoc.meta.*.

Below are some samples:

•  CALL apoc.meta.data()

This will list all nodes and relationships as well as properties of each.

List all nodes, relationships and properties in Neo4j using the APOC library

•  CALL apoc.meta.graph()

This is equivalent to CALL db.schema() (refer to section 1.1 above).

•  CALL apoc.meta.stats()

This will list statistics of nodes and relationships. It also shows the cardinality of each relationship by node types. For example, the following stats communicate the fact that the INTERACTS relationship is between the nodes of label INTERACTS

{
  "labelCount": 1,
  "relTypeCount": 1,
  "propertyKeyCount": 3,
  "nodeCount": 107,
  "relCount": 352,
  "labels": {
    "Character": 107
  },
  "relTypes": {
    "(:Character)-[:INTERACTS]->()": 352,
    "()-[:INTERACTS]->(:Character)": 352,
    "()-[:INTERACTS]->()": 352
  }
}
•  CALL apoc.meta.schema()

This will return metadata of all node labels, relationship types, and properties.

•  CALL apoc.meta.subGraph({labels:['Character'],rels:['INTERECTS']})

This is a very useful function especially for a very large graph, as it allows you to analyze a subset of nodes and relationships (subgraphs). The complete specification looks like this:

CALL apoc.meta.subGraph({labels:[labels],rels:[rel-types],excludes:[label,rel-type,…]})

Further Discussion

There is a huge advantage to storing data as a graph. The graph data model enables us to do much more powerful analysis on relationships over large amounts of data and unearths buried connections among the vast amount of individual data elements (nodes).

Data profiling on a graph database like Neo4j gives us more insightful understandings of the actual data we are working on. The results obtained can then be used for further detailed analysis, performance tuning, database schema optimization, and data migration.

As a native graph database, Neo4j provides native graph data storage and native query processing through the Cypher query language. Some of the most useful data profiling tasks can be easily done using Cypher, as shown in this article.

There are also extensions that support more complex and advanced graph analysis; for example, Betweenness Centrality and Connected Components. Neo4j graph algorithms is the one I’ve been using to perform more complex data profiling. Once installed, the functions can be called directly as part of the Cypher query and results visualized inside the Neo4j Browser. I plan to cover more on this in coming articles.

Get comfortable using NoSQL in a free, self-directed learning course provided by RavenDB. Learn to create fully-functional real-world programs on NoSQL Databases. Register today.

Topics:
relational database ,data profiling ,database ,neo4j ,graph database ,tutorial ,cypher

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}