Modeling Events in Neo4J
Analyze customer data and decision making quickly and efficiently
Join the DZone community and get the full member experience.
Join For FreeAbout two years ago Instacart open-sourced “The Instacart Online Grocery Shopping Dataset 2017”. You can access it here. Read their blog post on the subject for more information. So what does this data look like:
Since this is anonymized data, we don’t have the actual dates on when these orders happened, but we do have a sequence number for each order allowing us to know what a User bought first, and next, and so on. We will augment our model with this information by adding previous relationships between orders.
Let’s start off by creating some Schema by running these one at a time:
CREATE CONSTRAINT ON (n:Aisle) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Aisle) ASSERT n.name IS UNIQUE;
CREATE CONSTRAINT ON (n:Department) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Department) ASSERT n.name IS UNIQUE;
CREATE CONSTRAINT ON (n:Product) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Product) ASSERT n.name IS UNIQUE;
CREATE CONSTRAINT ON (n:User) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Order) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :Order(user_id, number);
We need to do a bit of data cleanup:
Replace: \"" with \" in the products.csv file, and copy everything into the "import" directory within Neo4j.
And now we can import some of this data into Neo4j. Run the following one at a time, mind you some will take a little while to finish:
LOAD CSV WITH HEADERS FROM 'file:///aisles.csv' AS csvLine
CREATE (a:Aisle {id: toInteger(csvLine.aisle_id), name:csvLine.aisle});
LOAD CSV WITH HEADERS FROM 'file:///departments.csv' AS csvLine
CREATE (d:Department {id: toInteger(csvLine.department_id), name:csvLine.department});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS csvLine
CREATE (p:Product {id: toInteger(csvLine.product_id), name:csvLine.product_name});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS csvLine
MATCH (p:Product {id: toInteger(csvLine.product_id)}), (a:Aisle {id: toInteger(csvLine.aisle_id)})
CREATE (p)-[:IN_AISLE]->(a);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///products.csv' AS csvLine
MATCH (d:Department {id: toInteger(csvLine.department_id)}), (a:Aisle {id: toInteger(csvLine.aisle_id)})
MERGE (a)-[:IN_DEPARTMENT]->(d);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
WITH DISTINCT csvLine.user_id AS user_id
MERGE (u:User {id: toInteger(user_id)});
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
CREATE (o:Order {id: toInteger(csvLine.order_id), set: csvLine.eval_set, user_id: toInteger(csvLine.user_id), number: toInteger(csvLine.order_number),
dow: toInteger(csvLine.order_dow), hour: toInteger(csvLine.order_hour_of_day) });
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
MATCH (u:User {id: toInteger(csvLine.user_id)}), (o:Order {id: toInteger(csvLine.order_id)})
CREATE (u)<-[:ORDERED_BY]-(o);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///orders.csv' AS csvLine
MATCH (o:Order {id: toInteger(csvLine.order_id)}), (po:Order {user_id: toInteger(csvLine.user_id), number: toInteger(csvLine.order_number) - 1})
CREATE (o)-[:PREV]->(po);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///order_products__prior.csv' AS csvLine
MATCH (o:Order {id: toInteger(csvLine.order_id)}), (p:Product {id: toInteger(csvLine.product_id)})
CREATE (o)-[:HAS {order: toInteger(csvLine.add_to_cart_order), reordered: toBoolean(csvLine.reordered)} ]->(p);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///order_products__train.csv' AS csvLine
MATCH (o:Order {id: toInteger(csvLine.order_id)}), (p:Product {id: toInteger(csvLine.product_id)})
CREATE (o)-[:HAS {order: toInteger(csvLine.add_to_cart_order), reordered: toBoolean(csvLine.reordered)} ]->(p);
Once we import this data, we will be able to see chains of orders:
MATCH (n:Order) RETURN n LIMIT 25
Ok, let’s explore this dataset. How many Aisles do we have?
MATCH (n:Aisle)
RETURN COUNT(n)
134. That is one way to look at the Product Breadth. Let’s look at Product Depth by seeing how much variety we have in those aisles by finding the number of products of each Aisle.
MATCH (n:Aisle)
RETURN n.name, SIZE((n)--()) AS number_of_products
ORDER BY number_of_products DESC
Our answer below shows us we have a bunch that don’t really have one, tons of chocolate, ice cream, vitamin supplements, etc and at the bottom bulk dried fruits and vegetables.
╒═══════════════════════════════╤════════════════════╕
│"n.name" │"number_of_products"│
╞═══════════════════════════════╪════════════════════╡
│"missing" │1259 │
├───────────────────────────────┼────────────────────┤
│"candy chocolate" │1247 │
├───────────────────────────────┼────────────────────┤
│"ice cream ice" │1092 │
├───────────────────────────────┼────────────────────┤
│"vitamins supplements" │1039 │
├───────────────────────────────┼────────────────────┤
│"yogurt" │1027 │
├───────────────────────────────┼────────────────────┤
│"chips pretzels" │990 │
├───────────────────────────────┼────────────────────┤
│"tea" │895 │
├───────────────────────────────┼────────────────────┤
...
├───────────────────────────────┼────────────────────┤
│"packaged produce" │33 │
├───────────────────────────────┼────────────────────┤
│"bulk grains rice dried goods" │27 │
├───────────────────────────────┼────────────────────┤
│"bulk dried fruits vegetables" │13 │
└───────────────────────────────┴────────────────────┘
Ok, let’s now look at just one product. I’m going to pick Organic Bananas since I buy these all the time.
MATCH (productA:Product {name: "Bag of Organic Bananas"})
RETURN productA
We have an id property of 13176, from here on out we will use that to make the queries shorter.
╒════════════════════════════════════════════╕
│"productA" │
╞════════════════════════════════════════════╡
│{"name":"Bag of Organic Bananas","id":13176}│
└────────────────────────────────────────────┘
What can we learn about this product?
// What Aisle is it in?
MATCH (productA:Product {id: 13176})-[:IN_AISLE]->(aisle)
RETURN productA, aisle
// What other products are in that Aisle?
MATCH (productA:Product {id: 13176})-[:IN_AISLE]->(aisle)<-[:IN_AISLE]-(productB)
RETURN productA, aisle, productB
// How many other products are in that Aisle?
MATCH (productA:Product {id: 13176})-[:IN_AISLE]->(aisle:Aisle)
RETURN productA, aisle, SIZE((aisle)<-[:IN_AISLE]-()) AS number_of_products
// Frequently Ordered With a "Bag of Organic Bananas"
MATCH (productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
// Frequently Ordered With a "Bag of Organic Bananas" and what aisle?
MATCH (productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, aisle, COUNT(*)
ORDER BY COUNT(*) DESC
// Frequently Ordered With a "Bag of Organic Bananas" in the same aisle?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
// Counts of Frequently Ordered With Products to a "Bag of Organic Bananas"
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(*) AS both
ORDER BY both DESC
// Score of Frequently Ordered With Products to a "Bag of Organic Bananas"
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 13176})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
WITH productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(*) AS both
RETURN productA.name, productB.name, both/(1.0 * (productAOrders + productBOrders - both)) AS score
ORDER BY score DESC
This query takes a few seconds to run, but let’s go ahead and look at the results. People who buy a Bag of Organic Bananas tend to also purchase Organic Hass Avocados and Organic Strawberries. That kinda makes sense. What is the product in the same Aisle that people who buy a Bag of Organic Bananas purchases the least…
╒════════════════════════╤══════════════════════════════════════════════╤════════════════════════╕
│"productA.name" │"productB.name" │"score" │
╞════════════════════════╪══════════════════════════════════════════════╪════════════════════════╡
│"Bag of Organic Bananas"│"Organic Hass Avocado" │0.11752376389629904 │
├────────────────────────┼──────────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Organic Strawberries" │0.10680334431046294 │
├────────────────────────┼──────────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Organic Large Extra Fancy Fuji Apple" │0.05424507800084253 │
├────────────────────────┼──────────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Apple Honeycrisp Organic" │0.05325903905255133 │
├────────────────────────┼──────────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Organic Lemon" │0.05067739140018715 │
├────────────────────────┼──────────────────────────────────────────────┼────────────────────────┤
...
├────────────────────────┼──────────────────────────────────────────────┼────────────────────────┤
│"Bag of Organic Bananas"│"Bananas" │0.0000024973216225598046│
└────────────────────────┴──────────────────────────────────────────────┴────────────────────────┘
No surprise here. It was regular Bananas. Maybe if you needed lots of bananas and the store ran out of the organic ones, you would buy the regular ones, but typically you pick one or the other. So, people who tend to buy Organic Bananas tend to also buy other Organic Produce from the same Aisle. This makes sense. What else can we learn from this dataset? What about products that you tend to buy only one of from an aisle?
// Let's focus on one item in the pasta sauce
MATCH (productA:Product {name:"Tomato Basil Pasta Sauce"})
RETURN productA
// How many times have people purchased this Product?
MATCH (productA:Product {name:"Tomato Basil Pasta Sauce"})
RETURN productA, SIZE((productA)<-[:HAS]-()) AS ordered
The answer is 1,213. Keep this in mind.
// Are there Frequently Ordered With Products in the same Aisle as "Tomato Basil Pasta Sauce" ?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
WITH productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(*) AS both
RETURN productA.name, productB.name, both/(1.0 * (productAOrders + productBOrders - both)) AS score
ORDER BY score DESC
Let’s look at the results and comparing it with the Organic Bananas. Avocados and Strawberries had a score of 0.11 and 0.10. The best frequently ordered item was Tomato Basil Pasta Sauce and Garlic Marinara Pasta Sauce with scores of 0.01. Users tend to purchase one type of pasta sauce at a time.
╒══════════════════════════╤══════════════════════════════════════════════╤══════════════════════╕
│"productA.name" │"productB.name" │"score" │
╞══════════════════════════╪══════════════════════════════════════════════╪══════════════════════╡
│"Tomato Basil Pasta Sauce"│"Garlic Marinara Pasta Sauce" │0.012709710218607015 │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Spicy Marinara Pasta Sauce" │0.006402048655569782 │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Organico Bello Tomato Basil Pasta Sauce" │0.0036380172805820826 │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Vodka Pasta Sauce" │0.003054989816700611 │
├──────────────────────────┼──────────────────────────────────────────────┼──────────────────────┤
...
Now, we get to take advantage of the previous order relationships we created earlier to trace the sequence of orders and see what our users bought after buying Tomato Basil Pasta Sauce.
// What Products did people who bought "Tomato Basil Pasta Sauce" purchase in their next order from the same aisle?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV]-(nextOrder)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
Interesting. Garlic Marinara and Basil Pesto come out on top:
╒══════════════════════════════════════════════════════════════════════╤══════════╕
│"productB" │"COUNT(*)"│
╞══════════════════════════════════════════════════════════════════════╪══════════╡
│{"name":"Garlic Marinara Pasta Sauce","id":26695} │14 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Basil Pesto","id":12206} │14 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Marinara Pasta Sauce","id":38273} │11 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
...
That was just the next order. What if they still had some Tomato Basil Pasta Sauce left and didn’t really need to buy more pasta sauce in general. Let's go to the next three orders by using the magical * in Cypher, which means “keep going” and limiting it to one to three:
// What about in the last 3 orders?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*1..3]-(nextOrder)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
// What about all of the previous orders?
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
Looks like it’s neck-and-neck between Marinara Pasta sauce and the Organic Tomato Basil Pasta sauce.
╒══════════════════════════════════════════════════════════════════════╤══════════╕
│"productB" │"COUNT(*)"│
╞══════════════════════════════════════════════════════════════════════╪══════════╡
│{"name":"Marinara Pasta Sauce","id":38273} │206 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Organic Tomato Basil Pasta Sauce","id":39619} │205 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Basil Pesto","id":12206} │205 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Garlic Marinara Pasta Sauce","id":26695} │142 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Organico Bello Tomato Basil Pasta Sauce","id":8088} │135 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
That’s a bit weird isn’t it? Why doesn’t the regular Tomato Basil Pasta sauce show up in our result set? Ah… well you see in Cypher, we can only traverse a relationship once in a MATCH clause. This part:
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})... (productB)-[:IN_AISLE]->(aisle)
is causing us to filter out the original product. So what can we do?
We can move it to the WHERE clause.
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)
WHERE (productB)-[:IN_AISLE]->(aisle)
RETURN productB, COUNT(*)
ORDER BY COUNT(*) DESC
Now, things make more sense, as people tend to repurchase the same item 15 times more than purchasing something else.
╒══════════════════════════════════════════════════════════════════════╤══════════╕
│"productB" │"COUNT(*)"│
╞══════════════════════════════════════════════════════════════════════╪══════════╡
│{"name":"Tomato Basil Pasta Sauce","id":48341} │3159 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Marinara Pasta Sauce","id":38273} │206 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Organic Tomato Basil Pasta Sauce","id":39619} │205 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Basil Pesto","id":12206} │205 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
│{"name":"Garlic Marinara Pasta Sauce","id":26695} │142 │
├──────────────────────────────────────────────────────────────────────┼──────────┤
...
Wait a minute. How did we get 3,159? That’s more than the actual purchases (1,213) of the Tomato Basil Pasta Sauce. This happened because we are counting Paths of variable length. If a customer bought the same sauce multiple times, it would multiply the number of paths even though the orders only go up by one. Instead, let us count the number of distinct orders where this happened instead.
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)
WHERE (productB)-[:IN_AISLE]->(aisle)
WITH productA, productB, COUNT(DISTINCT order) AS orders
RETURN productA.name, productB.name, both
ORDER BY both DESC
╒══════════════════════════╤═════════════════════════════════════════════════════════════╤═══════╕
│"productA.name" │"productB.name" │"orers"│
╞══════════════════════════╪═════════════════════════════════════════════════════════════╪═══════╡
│"Tomato Basil Pasta Sauce"│"Tomato Basil Pasta Sauce" │611 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼───────┤
│"Tomato Basil Pasta Sauce"│"Basil Pesto" │95 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼───────┤
│"Tomato Basil Pasta Sauce"│"Organic Tomato Basil Pasta Sauce" │92 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼───────┤
│"Tomato Basil Pasta Sauce"│"Tomato Paste" │68 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼───────┤
...
Let’s try to get a score to make more sense of this:
// Score of Frequently Ordered NEXT Products to a "Tomato Basil Pasta Sauce"
MATCH (aisle)<-[:IN_AISLE]-(productA:Product {id: 48341})<-[:HAS]-(order)<-[:PREV*]-(nextOrder)-[:HAS]->(productB)
WHERE (productB)-[:IN_AISLE]->(aisle)
WITH productA, SIZE((productA)-[:HAS]-()) AS productAOrders, productB, SIZE((productB)-[:HAS]-()) AS productBOrders, COUNT(DISTINCT order) AS orders
RETURN productA.name, productB.name, orders/(1.0 * (productAOrders + productBOrders - orders)) AS score
ORDER BY score DESC
… and here we go:
╒══════════════════════════╤═════════════════════════════════════════════════════════════╤══════════════════════╕
│"productA.name" │"productB.name" │"score" │
╞══════════════════════════╪═════════════════════════════════════════════════════════════╪══════════════════════╡
│"Tomato Basil Pasta Sauce"│"Tomato Basil Pasta Sauce" │0.33663911845730027 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Garlic Marinara Pasta Sauce" │0.02945736434108527 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Organico Bello Tomato Basil Pasta Sauce" │0.018928901200369344 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼──────────────────────┤
│"Tomato Basil Pasta Sauce"│"Spicy Marinara Pasta Sauce" │0.015503875968992248 │
├──────────────────────────┼─────────────────────────────────────────────────────────────┼──────────────────────┤
...
As you can see, using the previous order relationships to build sequences of events make these kinds of queries easy to write. The Event model is a great pattern to know. You can learn more about it in a different use case.
Published at DZone with permission of Max De Marzi, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Harnessing the Power of Integration Testing
-
Guide To Selecting the Right GitOps Tool - Argo CD or Flux CD
-
Execution Type Models in Node.js
-
Security Challenges for Microservice Applications in Multi-Cloud Environments
Comments