DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Curious about the future of data-driven systems? Join our Data Engineering roundtable and learn how to build scalable data platforms.

Data Engineering: The industry has come a long way from organizing unstructured data to adopting today's modern data pipelines. See how.

Threat Detection: Learn core practices for managing security risks and vulnerabilities in your organization — don't regret those threats!

Managing API integrations: Assess your use case and needs — plus learn patterns for the design, build, and maintenance of your integrations.

Avatar

Mark Needham

Software Engineer at StarTree

London, GB

Joined Feb 2009

http://www.markhneedham.com

About

Mark is a graph advocate and field engineer for Neo Technology, the company behind the Neo4j graph database. As a field engineer, Mark helps customers embrace graph data and Neo4j building sophisticated solutions to challenging data problems. When he's not with customers Mark is a developer on Neo4j and writes his experiences of being a graphista on a popular blog at http://markhneedham.com/blog. He tweets at @markhneedham.

Stats

Reputation: 2456
Pageviews: 5.3M
Articles: 92
Comments: 2
  • Articles
  • Comments

Articles

article thumbnail
Neo4j and Cypher: Using MERGE with Schema Indexes/Constraints
I wrote about cypher’s MERGE function a couple of weeks ago, and over the last few days, I’ve been exploring how it works with schema indexes and unique constraints. An Exciting Time to Be a Developer There is so much that could be said about the merging of Neo4j and Cypher right now, but it is certainly reasonable to point out that this merger will likely result in many exciting developments in the programming world. Programmers virtually always appreciate it when they are given the products and tools they require to get their job done properly, and now is the time for steps like this to be taken. The fact that Neo4J and Cypher have decided to merge means that the upsides of both will soon be apparent. You deserve to use all of the best tools to make informed decisions about your next software project, and a great way to make it happen is to use what has been given to you regarding product functionality. This is to say that you can use both the upsides of Neo4J and Cypher to come up with the exact tools you need to make a difference in your sphere of influence. Could Other Products Soon Merge? There has been some strong demand for other software development products to consider merging. Coders and programmers want to use their favorite projects in exactly how they were meant to be used, and this means getting them to merge in ways that are useful to the programmers. They just want to be able to squeeze as much use out of each program as they possibly can. You want to make sure that you can see what is going on with your codes as you are directly applying them to whichever problem you are working on at this time. To be sure, it is not an easy task, but no one ever said it would be easy. The important thing is that you get the work done so that you can start to become more productive in the coding you are doing now. A common use case with Neo4j is to model users and events where an event could be a tweet, Facebook post, or Pinterest pin. The model might look like this: We’d like to ensure that we don’t get duplicate users or events, and MERGE provides the semantics to do this: MERGE (u:User {id: {userId}) MERGE (e:Event {id: {eventId}) MERGE (u)-[:CREATED_EVENT]->(m) RETURN u, e We’d like to ensure that we don’t get duplicate users or events and MERGE provides the semantics to do this: MERGE ensures that a pattern exists in the graph. Either the pattern already exists, or it needs to be created. import org.neo4j.cypher.javacompat.ExecutionEngine; import org.neo4j.cypher.javacompat.ExecutionResult; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.factory.GraphDatabaseFactory; import org.neo4j.helpers.collection.MapUtil; import org.neo4j.kernel.impl.util.FileUtils; ... public class MergeTime { public static void main(String[] args) throws Exception { String pathToDb = "/tmp/foo"; FileUtils.deleteRecursively(new File(pathToDb)); GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase( pathToDb ); final ExecutionEngine engine = new ExecutionEngine( db ); ExecutorService executor = Executors.newFixedThreadPool( 50 ); final Random random = new Random(); final int numberOfUsers = 10; final int numberOfEvents = 50; int iterations = 100; final List userIds = generateIds( numberOfUsers ); final List eventIds = generateIds( numberOfEvents ); List merges = new ArrayList<>( ); for ( int i = 0; i < iterations; i++ ) { Integer userId = userIds.get(random.nextInt(numberOfUsers)); Integer eventId = eventIds.get(random.nextInt(numberOfEvents)); merges.add(executor.submit(mergeAway( engine, userId, eventId) )); } for ( Future merge : merges ) { merge.get(); } executor.shutdown(); ExecutionResult userResult = engine.execute("MATCH (u:User) RETURN u.id as userId, COUNT(u) AS count ORDER BY userId"); System.out.println(userResult.dumpToString()); } private static Runnable mergeAway(final ExecutionEngine engine, final Integer userId, final Integer eventId) { return new Runnable() { @Override public void run() { try { ExecutionResult result = engine.execute( "MERGE (u:User {id: {userId})\n" + "MERGE (e:Event {id: {eventId})\n" + "MERGE (u)-[:CREATED_EVENT]->(m)\n" + "RETURN u, e", MapUtil.map( "userId", userId, "eventId", eventId) ); // throw away for ( Map row : result ) { } } catch ( Exception e ) { e.printStackTrace(); } } }; } private static List generateIds( int amount ) { List ids = new ArrayList<>(); for ( int i = 1; i <= amount; i++ ) { ids.add( i ); } return ids; } } We create a maximum of 10 users and 50 events and then do 100 iterations of random (user, event) pairs with 50 concurrent threads. Afterward, we execute a query that checks how many users of each id have been created and get the following output: +----------------+ | userId | count | +----------------+ | 1 | 6 | | 2 | 3 | | 3 | 4 | | 4 | 8 | | 5 | 9 | | 6 | 7 | | 7 | 5 | | 8 | 3 | | 9 | 3 | | 10 | 2 | +----------------+ 10 rows Next, I added a schema index on users and events to see if that would make any difference, something Javad Karabi recently asked on the user group. CREATE INDEX ON :User(id) CREATE INDEX ON :Event(id) We wouldn’t expect this to make a difference as schema indexes don’t ensure uniqueness, but I ran it anyway t and got the following output: +----------------+ | userId | count | +----------------+ | 1 | 2 | | 2 | 9 | | 3 | 7 | | 4 | 2 | | 5 | 3 | | 6 | 7 | | 7 | 7 | | 8 | 6 | | 9 | 5 | | 10 | 3 | +----------------+ 10 rows If we want to ensure the uniqueness of users and events, we need to add a unique constraint on the id of both of these labels: CREATE CONSTRAINT ON (user:User) ASSERT user.id IS UNIQUE CREATE CONSTRAINT ON (event:Event) ASSERT event.id IS UNIQUE Now if we run the test, we’ll only end up with one of each user: +----------------+ | userId | count | +----------------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 1 | | 8 | 1 | | 9 | 1 | | 10 | 1 | +----------------+ 10 rows We’d see the same result if we ran a similar query checking for the uniqueness of events. As far as I can tell, this duplication of nodes that we merge on only happens if you try to create the same node twice concurrently. Once the node has been created, we can use MERGE with a non-unique index, and a duplicate node won’t get created. All the code from this post is available as a gist if you want to play around with it.
August 13, 2022
· 12,242 Views · 2 Likes
article thumbnail
Leaflet: Fit Polyline in View
We take a look at using Leaflet.js to help visualize a route on a map with the ability to ensure that the map is zoomed to show all of the points on a given route.
January 3, 2018
· 9,381 Views · 4 Likes
article thumbnail
Kubernetes 1.8: Using Cronjobs to Take Neo4j Backups
It's easier than ever to run Neo4j backup jobs against Kubernetes clusters. Check out how to use a Cronjob to execute a backup now that Kubernetes 1.8 has been released.
December 31, 2017
· 7,554 Views · 6 Likes
article thumbnail
Solving a Polyglot Error in Python
I came across some issues when I tried do analyze data about Russian Twitter trolls, but ran into some issues. Here's how I solved them.
November 30, 2017
· 9,838 Views · 1 Like
article thumbnail
How to Solve a Python 3 TypeError
Learn how to solve an error in Python that says "TypeError: unsupported format string passed to numpy.ndarray.__format__."
November 26, 2017
· 17,193 Views · 2 Likes
article thumbnail
Kubernetes: Copying a Dataset to a StatefulSet’s PersistentVolume
Explore the wonderful world of persistent storage as we learn how to copy datasets to a Kubernetes PersistentVolume using Neo4j as our sample DB.
November 21, 2017
· 16,872 Views · 2 Likes
article thumbnail
How to Use Kubernetes to Quickly Deploy Neo4j Clusters
When we first created the Kubernetes templates, I wrote a blog post about it. In the comments, someone suggested that we should create a Helm package for Neo4j. 11 months later… we have it!
November 20, 2017
· 7,809 Views · 1 Like
article thumbnail
Neo4j and Cypher: Deleting Duplicate Nodes
I accidentally ended up with a bunch of duplicate nodes on a graph that I'd failed to put unique constraints on. Oops! Here's how I fixed it.
October 11, 2017
· 6,236 Views · 1 Like
article thumbnail
Python 3: Create Sparklines Using matplotlib
I had the code to create sparklines inside a Pandas DataFrame, but I had to tweak it a bit to get it to play nicely with Python 3.6. Here's what I did.
October 6, 2017
· 8,302 Views · 2 Likes
article thumbnail
Serverless: AWS HTTP Gateway — 502 Bad Gateway
If you're running into 502 errors when making HTTP calls with Lambda, make sure you read the manual and are returning maps.
August 15, 2017
· 15,945 Views · 3 Likes
article thumbnail
Serverless and Python: ''Unable to Import Module 'Handler'''
If you're a Python fan who enjoys using the Serverless library and virtualenv, you might be running into a dependency error. Here's one solution to the problem.
Updated August 9, 2017
· 31,221 Views · 1 Like
article thumbnail
Pandas: Find Rows Where Column/Field Is Null
I did some experimenting with a dataset I've been playing around with to find any columns/fields that have null values in them. Learn how I did it!
July 10, 2017
· 367,412 Views · 3 Likes
article thumbnail
Pandas/scikit-learn: get_dummies Test/Train Sets
In my time using get_dummies in panda to generate dummy columns for categorical variables to use with scikit-learn, I realized it didn't always work. Here's why.
July 8, 2017
· 13,837 Views · 3 Likes
article thumbnail
Shell: Create a Comma Separated String
In this quick tutorial, we go over how to use Shell scripting in a pinch to create working strings of code that will help us make basic calculations.
July 5, 2017
· 10,924 Views · 1 Like
article thumbnail
Error in PostgreSQL: Argument of WHERE Must Not Return a Set
A query that I initially wrote didn't work since jsonb_array_elements returns a set of boolean values. Instead, we can use a LATERAL subquery to achieve our goal.
May 6, 2017
· 7,411 Views
article thumbnail
AWS Lambda: Programmatically Scheduling a CloudWatch Event
AWS Lambda is a solid serverless option, but setting up automatically scheduled events might not be intuitive. Let's see how CloudWatch can solve the problem.
May 1, 2017
· 20,499 Views · 4 Likes
article thumbnail
Python: Flask – Generating a Static HTML Page
Read on for a quick and easy tutorial on how to use Python's flask library to generate an HTML file and create a static page from there.
April 30, 2017
· 22,913 Views · 5 Likes
article thumbnail
AWS Lambda: Programatically Create a Python ‘Hello World’ Function
In this post we take a look at how to quickly create a Python function using AWS Lambda, including their configurations and uploading them to the service.
April 4, 2017
· 18,034 Views · 4 Likes
article thumbnail
Neo4j: How Do Null Values Even Work?
Importing CSVs is a great timesaver, but how do you get around the null values lurking within? Fortunately, we can work our way toward a query that handles them in Neo4j.
February 24, 2017
· 6,904 Views · 5 Likes
article thumbnail
Fixing the ''Unable to Query Docker Version'' Error
If you're unable to find your Docker machines and get an error with some confused certificates, try re-running the $ docker-machine env command.fix
December 29, 2016
· 11,459 Views · 3 Likes
article thumbnail
Create Dynamic Relationships With APOC
See how the APOC library can be used to automatically create relationships when loading data.
November 1, 2016
· 9,055 Views · 2 Likes
article thumbnail
Neo4j: Dynamically Add Property/Set Dynamic Property
In this post, we find out how to add dynamic properties to nodes in Neo4j via Cypher. Read on for more details!
October 30, 2016
· 10,722 Views · 2 Likes
article thumbnail
Effective Bulk Data Import into Neo4j (Part 3)
Here we are, the finale! Today, we take a look at LOAD JSON, a piece of the import puzzle that converts JSON to a CSV.
August 7, 2016
· 5,319 Views · 1 Like
article thumbnail
Hadoop: DataNode Not Starting
Running Hadoop and having problems with your DataNode? Read on to find out one possible solution.
July 26, 2016
· 22,183 Views · 1 Like
article thumbnail
Unix: Find Files Greater Than Date
This quick tutorial shows you how the simple and versatile the find command can help you sort through your data on Neo4j.
July 1, 2016
· 6,537 Views · 3 Likes
article thumbnail
Python: Regex – Matching Foreign Characters/Unicode Letters
In the world of regular expressions, matching characters outside of the usual Latin character set can be a challenge. Read on to find out how author Mark Needham tackled this issue in Python.
June 22, 2016
· 11,235 Views · 1 Like
article thumbnail
Python: Parsing a JSON HTTP Chunking Stream
How I parse a JSON HTTP chunking stream in Python using meetup.com's API to filter RSVPs for events I'm interested in.
December 4, 2015
· 22,952 Views · 4 Likes
article thumbnail
jq: Cannot Iterate Over Number / String and Number Cannot Be Added
How to use jq (which is like sed for JSON data) to extract some information from a JSON file, and what to do when you encounter the error in the title.
December 2, 2015
· 22,629 Views · 2 Likes
article thumbnail
Convert RDD to DataFrame with Spark
Learn how to convert an RDD to DataFrame in Databricks Spark CSV library.
August 7, 2015
· 66,954 Views · 1 Like
article thumbnail
The Secret to More Efficient Data Science with Neo4j and R [OSCON Preview]
It’s a sad but true fact: Most data scientists spend 50-80% of their time cleaning and munging data and only a fraction of their time actually building predictive models. This is most often true in a traditional stack, where most of this data munging consists of writing lines upon lines of some flavor of SQL, leaving little time for model-building code in statistical programming languages such as R. These long, cryptic SQL queries not only slow development time but also prevent useful collaboration on analytics projects, as contributors struggle to understand each others’ SQL code. For example, in graduate school, I was on a project team where we used Oracle to store Twitter data. The kinds of queries my classmates and I were writing were unmaintainable and impossible to understand unless the author was sitting next to you. No one worked on the same queries together because they were so unwieldy. This not only hindered our collaboration efforts but also slowed our progress on the project. If we had been using an appropriate data store (like a graph database) we would have spent significantly less time pulling our hair out over the queries. Why Today’s Data Is Different This data-munging problem has persisted in the data science field because data is becoming increasingly social and highly-connected. Forcing this kind of interconnected data into an inherently tabular SQL database, where relationships are only abstract, leads to complicated schemas and overly complex queries. Yet, several NoSQL solutions – specifically in the graph database space – exist to store today’s highly-connected data. That is, data where relationships matter. A lot of data analysis today is performed in the context of better understanding people’s behavior or needs, such as: How likely is this visitor to click on advertisement X? Which products should I recommend to this user? How are User A and User B connected? Written by Nicole White People, as we know, are inherently social, so most of these questions can be answered by understanding the connections between people: User A is similar to User B, and we already know that User B likes this product, so let’s recommend this product to User A. The Good News: Data-Munging No More Data science doesn’t have to be 80% data munging. With the appropriate technology stack, a data scientist’s development process is seamless and short. It’s time to spend less time writing queries and more time building models by combining the flexibility of an open-source, NoSQL graph database with the maturity and breadth of R – an open-source statistical programming language. The combination of Neo4j’s ability to store highly-connected, possibly-unstructured data and R’s functional, ad-hoc nature creates the ideal data analysis environment. You don’t have to spend an hour writing CREATE TABLE statements. You don’t have to spend all day on StackOverflow figuring out how to traverse a tree in SQL. Just Cypher and go. Learn More at OSCON 2015 At my upcoming OSCON session we will walk through a project in which we analyze #OSCON Twitter data in a reproducible, low-effort workflow without writing a single line of SQL. For this highly-connected dataset we will use Neo4j, an open-source graph database, to store and query the data while highlighting the advantages of storing such data in a graph versus a relational schema. Finally, we will cover how to connect to Neo4j from an R environment for the purposes of performing common data science tasks, such as analysis, prediction and visualization.
June 30, 2015
· 1,345 Views

Comments

Shell: Create a Comma Separated String

Jul 10, 2017 · Jordan Baker

Even better!

NetBeans Platform: How to Hide the Source Packages Folder

Dec 29, 2013 · Mr B Loid

Hi Peter,

Thanks for your comments. I'll try to address them one at a time:

> 1.) You miss the link between players and "players in matches".
> 2.) goal property is missing.

Good catch, hadn't noticed that.

> For instance you could give us a query comparision of a more complicated use case, let's say
> "the average amount of goals scored by french players in european champions league per
> year". My hypothesis is that a SQL guru can write this down in less than 5 minutes in a single
> sql statement whereas you cannot do this in Neo.

I don't think it'd be too difficult to write a query like that in Neo but I'll give it a try:

// get the French players who scored goals in the champions league
MATCH (g:Game)-[:in_competition]->(c:Competition)
WHERE c.name = "Champions League"
MATCH (p:Player)-[:played]-(stats)->[:in]->(game)
MATCH (p)-[:comes_from]->(country)
WHERE country.name = "France"
RETURN p.name, SUM(stats.goals)

// Find all the goals scored by French players per season
MATCH (s:Season)-[:contains_match]->(g:Game)-[:in_competition]->(c:Competition)
WHERE c.name = "Champions League"
MATCH (p:Player)-[:played]-(stats)->[:in]->(game)
MATCH (p)-[:comes_from]->(country)
WHERE country.name = "France"
RETURN country.name, s.name, SUM(stats.goals)

Is that what you meant?

> In general what is disturbing me in NoSql discussions: The way to solve a problem the "IT-way"
> is most often "select the cool hyped tool -> apply awkward transformations to the problem to be
> able to solve it with the preselected tool" BUT IT should be "analyze all facets of the problem,
> i.e. layout and needed query-paths -

Fair enough. In this case I was just hacking on this for fun and someone asked me what it would look like if it used tables instead so that's how I ended up with the comparison. Would you model it differently than I did?

Regarding Sparql - the query language I showed here (cypher) is partly based on that and partly based on SQL so there is at least some inspiration.

Didn't know about dbpedia but that sounds like a good resource to link to from my football graph - perhaps I can pull in more information from there?

I've read a bit about the semantic web but I don't know that much so thanks for the links to the course. Hopefully I can do that when it next runs.

Cheers
Mark

User has been successfully modified

Failed to modify user

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: