Neo4j: How Do Null Values Even Work?
Importing CSVs is a great timesaver, but how do you get around the null values lurking within? Fortunately, we can work our way toward a query that handles them in Neo4j.
Join the DZone community and get the full member experience.
Join For FreeEvery now and then I find myself wanting to import a CSV file into Neo4j and I always get confused with how to handle the various null values that can lurk within.
Let’s start with an example that doesn’t have a CSV file in sight. Consider the following list and my attempt to only return null values:
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value = null
RETURN value
(no changes, no records)
Hmm, that’s weird. I’d have expected that at least keep the first value in the collection. What about if we do the inverse?
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value <> null
RETURN value
(no changes, no records)
Still nothing! Let’s try returning the output of our comparisons rather than filtering rows:
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
RETURN value = null AS outcome
╒═══════╤═════════╕
│"value"│"outcome"│
╞═══════╪═════════╡
│null │null │
├───────┼─────────┤
│"null" │null │
├───────┼─────────┤
│"" │null │
├───────┼─────────┤
│"Mark" │null │
└───────┴─────────┘
Ok, so that isn’t what we expected. Everything has an ‘outcome’ of ‘null’! What about if we want to check whether the value is the string “Mark”?
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
RETURN value = "Mark" AS outcome
╒═══════╤═════════╕
│"value"│"outcome"│
╞═══════╪═════════╡
│null │null │
├───────┼─────────┤
│"null" │false │
├───────┼─────────┤
│"" │false │
├───────┼─────────┤
│"Mark" │true │
└───────┴─────────┘
From executing this query, we learn that if one side of a comparison is null then the return value is always going to be null.So how do we exclude a row if it’s null? It turns out we have to use the ‘is’ keyword rather than using the equality operator. Let’s see what that looks like:
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value is null
RETURN value
╒═══════╕
│"value"│
╞═══════╡
│null │
└───────┘
And the positive case:
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value is not null
RETURN value
╒═══════╕
│"value"│
╞═══════╡
│"null" │
├───────┤
│"" │
├───────┤
│"Mark" │
└───────┘
What if we want to get rid of empty strings?
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value <> ""
RETURN value
╒═══════╕
│"value"│
╞═══════╡
│"null" │
├───────┤
│"Mark" │
└───────┘
Interestingly that also gets rid of the null value which I hadn’t expected. But if we look for values matching the empty string:
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
WITH value WHERE value = ""
RETURN value
╒═══════╕
│"value"│
╞═══════╡
│"" │
└───────┘
It’s not there either! Hmm what’s going on here:
WITH [null, "null", "", "Mark"] AS values
UNWIND values AS value
RETURN value, value = "" AS isEmpty, value <> "" AS isNotEmpty
╒═══════╤═════════╤════════════╕
│"value"│"isEmpty"│"isNotEmpty"│
╞═══════╪═════════╪════════════╡
│null │null │null │
├───────┼─────────┼────────────┤
│"null" │false │true │
├───────┼─────────┼────────────┤
│"" │true │false │
├───────┼─────────┼────────────┤
│"Mark" │false │true │
└───────┴─────────┴────────────┘
Null values seem to get filtered out for every type of equality match unless we explicitly check that a value ‘is null’.
So how do we use this knowledge when we’re parsing CSV files using Neo4j’s LOAD CSV tool?
Let’s say we have a CSV file that looks like this:
$ cat nulls.csv
name,company
"Mark",
"Michael",""
"Will",null
"Ryan","Neo4j"
So none of the first three rows have a value for ‘company’. I don’t have any value at all, Michael has an empty string, and Will has a null value. Let’s see how LOAD CSV interprets this:
load csv with headers from "file:///nulls.csv" AS row
RETURN row
╒═════════════════════════════════╕
│"row" │
╞═════════════════════════════════╡
│{"name":"Mark","company":null} │
├─────────────────────────────────┤
│{"name":"Michael","company":""} │
├─────────────────────────────────┤
│{"name":"Will","company":"null"} │
├─────────────────────────────────┤
│{"name":"Ryan","company":"Neo4j"}│
└─────────────────────────────────┘
We’ve got the full sweep of all the combinations from above. We’d like to create a Person node for each row but only create a Company node and associated ‘WORKS_FOR’ relationship if an actual company is defined – we don’t want to create a null company.
So we only want to create a company node and ‘WORKS_FOR’ relationship for the Ryan row.
The following query does the trick:
load csv with headers from "file:///nulls.csv" AS row
MERGE (p:Person {name: row.name})
WITH p, row
WHERE row.company <> "" AND row.company <> "null"
MERGE (c:Company {name: row.company})
MERGE (p)-[:WORKS_FOR]->(c)
Added 5 labels, created 5 nodes, set 5 properties, created 1 relationship, statement completed in 117 ms.
And if we visualize what’s been created:
Perfect. Perhaps this behavior is obvious, but it always trips me up. Hopefully, it’ll be useful to someone else as well!
Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments