Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Neo4j: LOAD CSV - The Sneaky Null Character

DZone's Guide to

Neo4j: LOAD CSV - The Sneaky Null Character

· Java Zone ·
Free Resource

Microservices. Streaming data. Event Sourcing and CQRS. Concurrency, routing, self-healing, persistence, clustering...learn how Akka enables Java developers to do all this out of the box! Brought to you in partnership with Lightbend.

I spent some time earlier in the week trying to import a CSV file extracted from Hadoop into Neo4j using Cypher’s LOAD CSV command and initially struggled due to some rogue characters.

The CSV file looked like this:

$ cat foo.csv
foo,bar,baz
1,2,3

I wrote the following LOAD CSV query to extract some of the fields and compare others:

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2"
==> +--------------------------------------+
==> | line.foo | line.bar | line.bar = "2" |
==> +--------------------------------------+
==> | <null>   | "2"     | false          |
==> +--------------------------------------+
==> 1 row

I had expect to see a “1” in the first column and a ‘true’ in the third column, neither of which happened.

I initially didn’t have a text editor with hexcode mode available so I tried checking the length of the entry in the ‘bar’ field:

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)
==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> | <null>   | "2"     | false          | 2                |
==> +---------------------------------------------------------+
==> 1 row

The length of that value is 2 when we’d expect it to be 1 given it’s a single character.

I tried trimming the field to see if that made any difference…

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, trim(line.bar), trim(line.bar) = "2", length(line.bar)
==> +---------------------------------------------------------------------+
==> | line.foo | trim(line.bar) | trim(line.bar) = "2" | length(line.bar) |
==> +---------------------------------------------------------------------+
==> | <null>   | "2"            | true                 | 2                |
==> +---------------------------------------------------------------------+
==> 1 row

…and it did! I thought there was probably a trailing whitespace character after the “2” which trim had removed and that ‘foo’ column in the header row had the same issue.

I was able to see that this was the case by extracting the JSON dump of the query via the Neo4j browser:

{  
   "table":{  
      "_response":{  
         "columns":[  
            "line"
         ],
         "data":[  
            {  
               "row":[  
                  {  
                     "foo\u0000":"1\u0000",
                     "bar":"2\u0000",
                     "baz":"3"
                  }
               ],
               "graph":{  
                  "nodes":[  
 
                  ],
                  "relationships":[  
 
                  ]
               }
            }
         ],
      ...
}

It turns out there were null characters scattered around the file so I needed to pre process the file to get rid of them:

$ tr < foo.csv -d '\000' > bar.csv

Now if we process bar.csv it’s a much smoother process:

load csv with headers from "file:/Users/markneedham/Downloads/bar.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)
==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> | "1"      | "2"      | true           | 1                |
==> +---------------------------------------------------------+
==> 1 row

Note to self: don’t expect data to be clean, inspect it first!


Microservices. Streaming data. Event Sourcing and CQRS. Concurrency, routing, self-healing, persistence, clustering...learn how Akka enables Java developers to do all this out of the box! Brought to you in partnership with Lightbend. 

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}