Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Simulating a Network Partition

DZone's Guide to

Simulating a Network Partition

Dive deeper into simulating a network partition from a Neo4j cluster using Kubernetes. See how you can shift your leaders to the minority and host an election for the top spot.

· Cloud Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

A couple of weeks ago I wrote a post explaining how to create a Neo4j causal cluster using Kubernetes and I wanted to work out how to simulate a network partition which would put the leader on the minority side and force an election.

We’ve done this on our internal tooling on AWS using the iptables command but, unfortunately, that isn’t available in my container, which only has the utilities provided by BusyBox.

Luckily one of these is route command which will allow us to achieve the same thing.

To recap, I have three Neo4j pods up and running:

$ kubectl get pods 
NAME READY STATUS RESTARTS AGE 
neo4j-0 1/1 Running 0 6h 
neo4j-1 1/1 Running 0 6h 
neo4j-2 1/1 Running 0 6h

And we can check that the route command is available:

$ kubectl exec neo4j-0 -- ls -alh /sbin/route 
lrwxrwxrwx 1 root root 12 Oct 18 18:58 /sbin/route -> /bin/busybox

Let’s have a look what role each server is currently playing:

$ kubectl exec neo4j-0 -- bin/cypher-shell "CALL dbms.cluster.role()" 
role "FOLLOWER"   

Bye!
$ kubectl exec neo4j-1 -- bin/cypher-shell "CALL dbms.cluster.role()" 
role "FOLLOWER"   

Bye!
$ kubectl exec neo4j-2 -- bin/cypher-shell "CALL dbms.cluster.role()" 
role "LEADER"   

Bye!

Slight aside: I’m able to call cypher-shell without a user and password because I’ve disabled authorization by putting the following in conf/neo4j.conf:

dbms.connector.bolt.enabled=true

Back to the network partitioning! We need to partition away neo4j-2 from the other two servers which we can do by running the following commands:

$ kubectl exec neo4j-2 -- route add -host neo4j-0.neo4j.default.svc.cluster.local reject && \ 
  kubectl exec neo4j-2 -- route add -host neo4j-1.neo4j.default.svc.cluster.local reject && \ 
  kubectl exec neo4j-0 -- route add -host neo4j-2.neo4j.default.svc.cluster.local reject && \ 
  kubectl exec neo4j-1 -- route add -host neo4j-2.neo4j.default.svc.cluster.local reject

If we look at the logs of neo4j-2 we can see that it’s stepped down after being disconnected from the other two servers:

$ kubectl exec neo4j-2 -- cat logs/debug.log 
... 
2016-12-04 11:30:10.186+0000 INFO [o.n.c.c.c.RaftMachine] Moving to FOLLOWER state after not receiving heartbeat responses in this election timeout period. Heartbeats received: [] 
...

Who’s taken over as leader?

$ kubectl exec neo4j-0 -- bin/cypher-shell "CALL dbms.cluster.role()" 
role "LEADER"   

Bye!
$ kubectl exec neo4j-1 -- bin/cypher-shell "CALL dbms.cluster.role()" 
role "FOLLOWER"   

Bye!
$ kubectl exec neo4j-2 -- bin/cypher-shell "CALL dbms.cluster.role()" 
role "FOLLOWER"   

Bye!

Looks like neo4j-0! Let’s put some data into the database:

$ kubectl exec neo4j-0 -- bin/cypher-shell "CREATE (:Person {name: 'Mark'})" 
Added 1 nodes, Set 1 properties, Added 1 labels   

Bye!

Let’s check if that node made it to the other two servers. We’d expect it to be on neo4j-1 but not on neo4j-2:

$ kubectl exec neo4j-1 -- bin/cypher-shell "MATCH (p:Person) RETURN p" 
p 
 (:Person {name: "Mark"})   

Bye!
$ kubectl exec neo4j-2 -- bin/cypher-shell "MATCH (p:Person) RETURN p"     

Bye!

On neo4j-2, we’ll repeatedly see these types of entries in the log as its election timeout triggers but fails to get any responses to the vote requests it sends out:

$ kubectl exec neo4j-2 -- cat logs/debug.log 
... 
2016-12-04 11:32:56.735+0000 INFO [o.n.c.c.c.RaftMachine] Election timeout triggered 2016-12-04 11:32:56.736+0000 INFO [o.n.c.c.c.RaftMachine] Election started with vote request: Vote.Request from MemberId{ca9b954c} {term=11521, candidate=MemberId{ca9b954c}, lastAppended=68, lastLogTerm=11467} and members: [MemberId{484178c4}, MemberId{0acdb8dd}, MemberId{ca9b954c}] 
...

We can see those vote requests by looking at the raft-messages.log which can be enabled by setting the following property in conf/neo4j.conf:

causal_clustering.raft_messages_log_enable=true
$ kubectl exec neo4j-2 -- cat logs/raft-messages.log 
... 
11:33:42.101 -->MemberId{484178c4}: Request: Vote.Request from MemberId{ca9b954c} {term=11537, candidate=MemberId{ca9b954c}, lastAppended=68, lastLogTerm=11467} 
11:33:42.102 -->MemberId{0acdb8dd}: Request: Vote.Request from MemberId{ca9b954c} {term=11537, candidate=MemberId{ca9b954c}, lastAppended=68, lastLogTerm=11467}   
11:33:45.432 -->MemberId{484178c4}: Request: Vote.Request from MemberId{ca9b954c} {term=11538, candidate=MemberId{ca9b954c}, lastAppended=68, lastLogTerm=11467} 
11:33:45.433 -->MemberId{0acdb8dd}: Request: Vote.Request from MemberId{ca9b954c} {term=11538, candidate=MemberId{ca9b954c}, lastAppended=68, lastLogTerm=11467}   
11:33:48.362 -->MemberId{484178c4}: Request: Vote.Request from MemberId{ca9b954c} {term=11539, candidate=MemberId{ca9b954c}, lastAppended=68, lastLogTerm=11467} 
11:33:48.362 -->MemberId{0acdb8dd}: Request: Vote.Request from MemberId{ca9b954c} {term=11539, candidate=MemberId{ca9b954c}, lastAppended=68, lastLogTerm=11467} 
...

To ‘heal’ the network partition we just need to delete all the commands we ran earlier:

$ kubectl exec neo4j-2 -- route delete neo4j-0.neo4j.default.svc.cluster.local reject && \ 
  kubectl exec neo4j-2 -- route delete neo4j-1.neo4j.default.svc.cluster.local reject && \ 
  kubectl exec neo4j-0 -- route delete neo4j-2.neo4j.default.svc.cluster.local reject && \ 
  kubectl exec neo4j-1 -- route delete neo4j-2.neo4j.default.svc.cluster.local reject

Now let’s check that neo4j-2 now has the node that we created earlier:

$ kubectl exec neo4j-2 -- bin/cypher-shell "MATCH (p:Person) RETURN p" 
p 
 (:Person {name: "Mark"})   

Bye!

That’s all for now!

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
neo4j ,kubernetes ,cloud ,network partition

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}