Over a million developers have joined DZone.

Online Payment Risk Management with Neo4j

DZone's Guide to

Online Payment Risk Management with Neo4j

· Java Zone
Free Resource

Build vs Buy a Data Quality Solution: Which is Best for You? Gain insights on a hybrid approach. Download white paper now!


 I really like this saying by Corey Lanum:

Finding the relationships that should not be there is a great use case for Neo4j, and today I want to highlight an example of why. When you purchase something online, the merchant hands off your information to the payment gateway which processes your actual payment. Before they accept the transaction, they run it via series of risk management tests to validate that it is a real transaction and protect themselves from fraud. One of the hardest things for SQL based systems to do is cross check the incoming payment information against existing data looking for relationships that shouldn’t be there.

For example, given a credit card number, a phone number, email address and an IP address find:

1. How many unique phone numbers, emails and IP addresses are tied to the given credit card.
2. How many unique credit cards, emails, and IP addresses are tied to the given phone number.
3. How many unique credit cards, phone numbers and IP addresses are tied to the given email.
4. How many unique credit cards, phone numbers and emails are tied to the given IP address.

A high number of connections could mean a high potential for fraud. Given that the user is sitting there in front of their computer waiting to see if the merchant accepted their credit card, these queries need to return as fast as possible and in great number to handle peaks. So we’re going to build an unmanaged extension to perform this query quickly over the REST API, a data generator to give us something to test against, and a performance test to see just how fast Neo4j can answer these types of queries.

We’ll start with a unit test, so let’s build some data:

Node cc1 = createNode(db, "1", "cc");
Node phone1 = createNode(db, "1234567890", "phone");
Node email1 = createNode(db, "email1@hotmail.com", "email");
Node ip1 = createNode(db, "", "ip");
Node cc2 = createNode(db, "2", "cc");

Our createNode method, creates a node, adds the type property set to the value we passed in and adds the newly created node to an index of its type.

private Node createNode(GraphDatabaseService db, String value, String type) {
    Index<Node> index = db.index().forNodes(type + "s");
    Node node = db.createNode();
    node.setProperty(type, value);
    index.add(node, type, value);
    return node;

We’ll also need to create some relationships to tie them together:

cc1.createRelationshipTo(phone1, RELATED);
cc1.createRelationshipTo(email1, RELATED);
cc1.createRelationshipTo(ip1, RELATED);

Since we’ll be using this over the REST API we’ll prepare a request in JSON format, and pass it to our crossReference method (which we’ll write next) and check the actual response against our expected value

public void crossReference1() throws IOException {
    String requestOne;
    requestOne = "{\"cc\" : \"1\","
            + "\"phone\" : \"1234567890\", "
            + "\"email\" : \"email1@hotmail.com\", "
            + "\"ip\" : \"\"}";
    Response response = service.crossReference(requestOne, db);
    List<HashMap<String,Integer>> actual = objectMapper.readValue((String) response.getEntity(), List.class);
    ...prepare expected value...
    assertEquals(expected, actual);

We’ll expect a JSON POST request with a hash of the 4 attributes of our payment, and prepare a result list which will hold our answers:

public Response crossReference(String body, @Context GraphDatabaseService db) throws IOException {
    List<Map<String, AtomicInteger>> results = new ArrayList<Map<String, AtomicInteger>>();
    HashMap input = objectMapper.readValue( body, HashMap.class);

Then we’ll look up the credit card, phone number, email and ip in their respective index and add them to an array of nodes:

ArrayList<Node> nodes = new ArrayList<Node>();
IndexHits<Node> ccIndex = db.index().forNodes("ccs").get("cc", input.get("cc"));
IndexHits<Node> phoneIndex = db.index().forNodes("phones").get("phone", input.get("phone"));
IndexHits<Node> emailIndex = db.index().forNodes("emails").get("email", input.get("email"));
IndexHits<Node> ipIndex = db.index().forNodes("ips").get("ip", input.get("ip"));
nodes.add (ccIndex.getSingle());
nodes.add (phoneIndex.getSingle());
nodes.add (emailIndex.getSingle());
nodes.add (ipIndex.getSingle());

For each of the nodes, we’ll start with an empty map of counters, and traverse the “RELATED” relationship in both directions, incrementing the type of node we find on the other end in our map:

for(Node node : nodes){
    HashMap<String, AtomicInteger> crosses = new HashMap<String, AtomicInteger>();
    crosses.put("ccs", new AtomicInteger(0));
    crosses.put("phones", new AtomicInteger(0));
    crosses.put("emails", new AtomicInteger(0));
    crosses.put("ips", new AtomicInteger(0));
    if(node != null){
        for ( Relationship relationship : node.getRelationships(RELATED, Direction.BOTH) ){
            Node thing = relationship.getOtherNode(node);
            String type = thing.getPropertyKeys().iterator().next() + "s";

Finally we’ll return our results:

return Response.ok().entity(objectMapper.writeValueAsString(results)).build();

… and that’s it. Seriously. Our results are very simple since they are meant to be parsed and processed by another method that does the actual risk analysis. In the sample result below, the credit card used returned 4 ips, 7 emails and 3 phone numbers which increases the odds that it may be fraudulent.

[{"ips":4,"emails":7,"ccs":0,"phones":4}, -- cc returned 4 ips, 7 emails, and 3 phones.
{"ips":1,"emails":1,"ccs":1,"phones":0}, -- phone returned just 1 item for each cross reference check.
{"ips":2,"emails":0,"ccs":4,"phones":3}, -- email returned 2 ips, 4 credit cards and 3 phones.
{"ips":0,"emails":1,"ccs":3,"phones":2}] -- ip returned 3 credit cards and 2 phones.

Now that we have our method and unit test passing, we need to generate some data. We’ll start with the root of where this data comes from which is processed transactions. We’ll create 50k transactions, and every 100 we’ll generate some potentially fraudulent data by adding between 1 to 10 additional transactions that share some of the same fields. To make our life easier, we’ll use a random number to represent the hashed credit card number and use the Faker Gem to build realistic data for our other fields:

transactions = File.open("transactions.csv", "a")
50000.times do |t|
  values = [rand.to_s[2..8], Faker::PhoneNumber.short_phone_number, Faker::Internet.email, Faker::Internet.ip_v4_address]
  transactions.puts values.join(",")
  if (t%100 == 0)
    rand(1..10).times do
      # Select 1, 2 or 3 fields to change
      change = [0,1,2,3].sample(rand(1..3))
      newvalues = [rand.to_s[2..8], Faker::PhoneNumber.short_phone_number, Faker::Internet.email, Faker::Internet.ip_v4_address]
      change.each do |c|
        values[c] = newvalues[c]
      transactions.puts values.join(",")

With our transactions.csv file we’ll next extract the unique credit cards, phones, emails and ips into their own files:

CSV.foreach('transactions.csv', :headers => true) do |row|
  ccs.puts row[0]
  phones.puts row[1]
  emails.puts row[2]
  ips.puts row[3]
%x[awk ' !x[$0]++' ccs.csv > ccs_unique.csv]
%x[awk ' !x[$0]++' phones.csv > phones_unique.csv]
%x[awk ' !x[$0]++' emails.csv > emails_unique.csv]
%x[awk ' !x[$0]++' ips.csv > ips_unique.csv] 

…and we’ll do the same thing for the relationships:

CSV.foreach('transactions.csv', :headers => true) do |row|
  ccs_to_phones.puts [row[0], row[1], "RELATED"].join("\t")
  ccs_to_emails.puts [row[0], row[2], "RELATED"].join("\t")
  ccs_to_ips.puts [row[0], row[3], "RELATED"].join("\t")
  phones_to_emails.puts [row[1], row[2], "RELATED"].join("\t")
  phones_to_ips.puts [row[1], row[3], "RELATED"].join("\t")
  emails_to_ips.puts [row[2], row[3], "RELATED"].join("\t")
%x[awk ' !x[$0]++' ccs_to_phones.csv > ccs_to_phones_unique.csv]
%x[awk ' !x[$0]++' ccs_to_emails.csv > ccs_to_emails_unique.csv]
%x[awk ' !x[$0]++' ccs_to_ips.csv > ccs_to_ips_unique.csv]
%x[awk ' !x[$0]++' phones_to_emails.csv > phones_to_emails_unique.csv] 
%x[awk ' !x[$0]++' phones_to_ips.csv > phones_to_ips_unique.csv] 
%x[awk ' !x[$0]++' emails_to_ips.csv > emails_to_ips_unique.csv]  

With our data generated, we are now ready to import it into Neo4j using the Batch Importer. Much has changed since my last blog post about the batch importer. Michael Hunger has made our life easier by allowing us to specify a way to look up nodes by an indexed property instead of having to come up with their node ids directly. The emails_unique.csv now looks like this:


Where the header is telling us that it’s an “email” property of type “string” indexed in the “emails” index. We’ll setup our batch.properties file to use all the unique csv files we created and configure our indexes for us as well.


Now we can run the batch importer to load our data:

java -server -Xmx4G -jar batch-import-jar-with-dependencies.jar neo4j/data/graph.db

After we configure our unmanaged extension and start the server, we can write our performance test using Gatling as we’ve done before. We’ll use the transactions.csv file we created earlier as our test data, and send a JSON string containing our values to the URL we setup earlier:

class TestCrossReference extends Simulation {
  val httpConf = httpConfig
  val testfile = csv("transactions.csv").circular
  val scn = scenario("Cross Reference via Unmanaged Extension")
    .during(30) {
      http("Post Cross Reference Request")
        .body("""{"cc": "${cc}", "phone": "${phone}", "email": "${email}", "ip": "${ip}" }""")
      .pause(0 milliseconds, 1 milliseconds)

…and drumroll please:


1246 requests per second with a mean latency of 11ms on my laptop. As long as your dataset can be held in memory Neo4j will maintain these numbers regardless of your overall database size since performance is only affected by the number of relationships traversed in each query. I’ve already shown you how you can Scale UP, if you need more throughput, then a cluster of Neo4j instances can deliver it by scaling out. The code for everything shown here is available on github as always, so please don’t take my word for it and try it out yourself.

Build vs Buy a Data Quality Solution: Which is Best for You? Maintaining high quality data is essential for operational efficiency, meaningful analytics and good long-term customer relationships. But, when dealing with multiple sources of data, data quality becomes complex, so you need to know when you should build a custom data quality tools effort over canned solutions. Download our whitepaper for more insights into a hybrid approach.


Published at DZone with permission of Max De Marzi, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}