Rob Shoening: Lending Club is the world’s largest credit marketplace and we’re shaking up the financial services industry. We’re also trying to shake up the financial technology space and that’s what we’re going to talk about here.
We started this as a hobby, and now it’s running our infrastructure of a publicly traded company. It’s a good story.
Microservices: Everyone is Doing it
We’ve talked to nobody who’s not doing microservices.
Lending Club’s microservices exploded from five microservices (in 2013) to 139 microservices in 2015.
We had five production microservices in 2013, and right now in 2015, the last count I had was 139 microservices (out of Neo4j, of course).
Earlier today, someone asked me about my microservices transition. I told him we had moved from five to 139, and he responded, “I’m glad to hear that, because a lot of people get stuck along the way.”
This is what everybody thinks of when moving to microservices:
The container analogy is nice and neat, but this can be what happens:
I don’t think it’s unusual to have these issues either. One of the most common questions is “Is it the load balancer?” That question will be our common theme today since we’re doing a lot of load balancer management.
So how do we get from five to 139 microservices, continuously deliver agile financial technology and still meet all audit and compliance requirements? That’s what we’re going to talk about.
It’s not a joke. You probably remember MacGyver from the ’80s. (Ashley doesn’t).
What is it? It’s all the stuff that Jenkins can’t do. Jenkins is great at a whole bunch of things, and it’s fantastic at orchestrating continuous integration and continuous delivery. Jenkins is a Swiss Army Knife, but there’s certain things it just doesn’t do at all:
- It can’t do real-time API access.
- There’s no database to speak of.
- Anything interactive in managing state of infrastructure.
So we put MacGyver together based on Neo4j to fill that deficiency. With that, I’m going to hand it off to Ashley for the fun stuff.
Ashley Sun: MacGyver is hooked into a lot of different services at Lending Club.
Here are a few:
- and plenty more
There are a lot of things it’s hooked into, and it’s constantly talking to all these different services. From the load balancer, we’ll have service groups; from AWS, we’ll have EC2 instances; from the firewall rules or our SAN storage, we’ll have any number of arrays and volumes.
We needed a good way to keep track of these things. While we could use an operational data store, real-time access is costly.
With operational data stores, if you wait to get the data until you actually need it, then it’s too late, and you find yourself looking one by one through each of these silos. You also limit yourself to very simple queries, and you’re limited to a very small amount of data so it’s not scalable.
Instead, we use Neo4j. Neo4j has all of the characteristics of a database that we need:
- Really fast queries
- Really great join and traversal capabilities
- Really easy to use
- Really flexible
- Really scalable
- Really great ad hoc queries through Cypher
So, how do we use Neo4j at Lending Club?
App Check-In and Service Discovery
We have three use cases: First, we use it for app check-in and service discovery.
As Rob mentioned, just a couple years ago, we had five microservices and now we have 139. (Actually I think it’s closer to 150 now.) When we have over 1,000 servers, it’s hard to keep track of what’s out there, and what’s going on, so we configured all of our servers to phone home to MacGyver every minute, and we saved all those app instances into Neo4j.
Above, you’ll see the app instances, and every minute they phone home to MacGyver with information about their app ID, any revisions, and what environment they’re running in. We save all that information into Neo4j.
This seems super simple but just by implementing this use case, we immediately started to reap a lot of benefits. We had a real-time database with deployed infrastructure, so every minute the info is kept up to date.
Also, when we add new services they automatically report in to MacGyver. They report in, they get saved to Neo4j, and we start monitoring them right away so it’s super easy to scale.
Just from this improvement we were able to ask queries like, “Show me all the instances of app X in environment Y.”
This query was hard to look up before. We had to actually go into vCenter and look up all our different VMs, and now we can just use a simple query showing all the app instances with a given app ID and given environment.
Immediately we were able to get a lot more visibility into what app instances we had out there.
Then, we decided to take MacGyver and Neo4j a step further with deployment and release automation.
Deployment and Release Automation
In the past, our deployment and release process was highly manual and very tedious.
We would use Excel spreadsheets or even handwritten notes to cross things off of our list. It was really hard to answer questions like:
- What pool should I deploy to?
- Is the most recent revision “live” right now?
- Are live pool revisions in sync in different environments?
Answering all of these questions would require a manual look-up by DevOps team. We didn’t have a lot of visibility into our services.
Our solution to that problem was to take our app instances that we saved into Neo4j, combine it with some info from our load balancer, and then manipulate that data into Neo4j so we could expose information about our live and dark pools. This ultimately enabled us to automate our deployments in a release.
At Lending Club, we use blue-green deployments. In a blue-green deployment, we have a live pool and a dark pool at any given time, so half the servers are live and half of them are dark within a service group.
The live pool takes all the traffic, while the dark pool is inactive. During a deployment, we deploy new code to the dark pool, and then QA and Release run testing on it.
Once they say okay, the dark pool is good to go, we change the state on the dark pool so that it becomes active with a high priority, and we also lower the priority of the pool that’s on the left.
After that switch, all new connections are sent to the newly live pool on the right and we wait for all of the old connections on the previously live pool to drain out on the left. Once those old connections reach zero, we cut that pool over.
Essentially, we’ve just done a pool flip. Now the pool on the right is live, and the pool on the left is dark.
Although we had this concept of pools, the load balancer could associate servers with each service group, but it couldn’t tell which servers were in what pool.
Also, the app instances that were being reported to MacGyver knew their revision and app ID, but they didn’t know their state. They didn’t know if they were live or dark.
To solve that problem, we were able to use Neo4j to map these servers into pools and automate our deployments.
This involves what we just talked about where app instances are saved into Neo4j. We took the info from the app instances – such as app ID, revisions, environment – and we combined that information. We then polled our load balancer for information on servers – such as state; whether it’s active, inactive, draining; how many connections does that server have – and we combine that with our app instance nodes, and we were able to create virtual server nodes.
By collecting virtual servers and aggregating them according to their app ID and their state, we were able to create pools, which is where it gets a little more interesting.
These (above) are application pools in Neo4j, and you’ll see the purple dot is a pool. Each pool contains many servers and each of those servers has a one-to-one mapping to our app instances.
Because it’s pulling info from its virtual servers, each pool knows whether it’s live, dark or draining. Each service has a live pool and a dark pool; pool A and pool B. So by mapping the pools into a virtual service, for every service we have a green dot and there are two pools (below).
As before, within a pool there are many servers which are then mapped to app instances.
Because of this data model, we gained a lot of visibility into apps that we didn’t have before, and questions that had once been difficult to answer became easy with Neo4j.
For example, let’s answer the question, “What pool should I deploy to?”
Usually, we would have to bounce into the A10 GUI and look up which servers were active or inactive, but that’s all solved with a simple Cypher query (below).
Right away, it’s easy to tell what pool we want to deploy to.
Another example: “Are live pool revisions in sync in different environments?”
This is important because, say, our main load balancer fails over to our backup load balancer. In this case, we want to make sure there’s not old code running and that it’s the same revision.
This also becomes a simple Cypher query (below) which before would have required manual look-up.
Here’s an important one: “Do multiple revisions exist within a single pool?”
We don’t want old code and new code running at the same time, and so we just query: “Show me all the servers within a pool” and if there’s more than one revision within a pool then we know that’s a problem.
If you think about the Knight Capital meltdown in 2012, they lost $440 million because they had old and new code deployed at the same time.
Rob: That was the end of that company, I believe, or they got sold off for nothing.
Ashley: Yeah, this happened in like 45 minutes. So if you do the math, it’s $163,000 that they were losing per second, which is crazy. They should have been using Neo4j.
Rob: It’s an easy mistake to make, so if it were to happen, we get paged.
Ashley: A lot of these queries we periodically run throughout the day, and they’ll page us. They’ll tell us if something is wrong.
So, a high-level overview of what we just covered: The app instances report in to MacGyver, they get saved to Neo4j, we take some info from the load balancer and combine that with our app instance info and from that we are able to create virtual servers. From there, we group those servers into pools and then we connect those pools into virtual services.
The reason we were able to do all this for deployment and release automation was because of these pools that we were able to create with Neo4j.
Because Neo4j is so good at mapping relationships, it was really perfect for this use case. Not only can we monitor all the virtual services, but at the same time, we can also send commands to the load balancer and this is how we’re able to automate our deployments.
At any time you can say, “I want to deploy this revision of this app into this environment.” MacGyver can then respond, “This pool is dark. We’re going to deploy to this pool.”
So you can deploy MacGyver, and then you can start a drain in MacGyver. Then, because MacGyver knows the number of connections to all of its servers, it can automatically just cut the pool over.
The last use case: infrastructure mapping.
Similar to the problem with our services, we didn’t have a lot of visibility into what pools were live, what pools were dark, what servers were in a virtual service or even what app instances were out there. This problem extended to our entire infrastructure.
Once we got started with Neo4j, we played around with virtual servers and services and realized, “Hey, we can map out our entire infrastructure with this.”
So here you’ll see what we just talked about:
Every virtual service contains two pools, which contain a number of virtual servers, which maps to an app instance. Exposing this data allowed us to ask, “Are any servers in the live pool degraded?”
We don’t want an unhealthy server in our live pool, so basically the state and the priority is what determines if the server is live and active, if it’s healthy. If those conditions aren’t met, then it returns that server and we get a ping about it.
From there, we decided to add stuff from vCenter, so all the app instances map to compute instances which are hosted by compute hosts and so we added a bunch more nodes from vCenter into Neo4j (below).
We were able to extend this so now we can ask the question, “Do we have a single point of failure among our services?”
If at any given time within a pool all those servers are hosted on one host in vCenter, if that host is to go down, then we definitely have a problem. Because we are mapping this in Neo4j, we are able to expose this data in a way that we weren’t able to before.
This query is kind of blurry, but it shows the traversal pretty well from pool to virtual server to app instance to compute instance to compute host.
Rob: I love this because vCenter has a habit with vMotion of deciding to move things around. If you don’t have affinity set up in vCenter, it’s real easy to set yourself up for these single points of failure.
Before Neo4j, we were constantly scanning the infrastructure, and I’d be asking staff, “Is it okay? Can you look again?”
But we want the infrastructure to tell us. We want to be reacting to problems, not having to go constantly look.
Ashley: This problem is also really hard if you’re just looking in vCenter at a host. It’s hard to know if there is a single point of failure.
So really, we weren’t just taking data but making sense of it within Neo4j and exposing it in a way that was useful to us.
So to continue with our infrastructure diagram:
The nimbles at the bottom are our storage arrays and our storage volumes. If we wanted, we can traverse this graph all the way from a virtual service to, for example, a storage array or a storage volume.
Now we want to ask a question: “If this storage volume goes down, what services are going to be impacted?”
Again, it’s a really simple query, it runs really quickly and it tells us things we never had visibility into before.
Other Use Cases
I think the cool thing here, as Rob was saying, is that it started out where we were using Neo4j as a hobby. We thought, “Oh, it would be cool if we put our app instances into Neo4j.” From there, we were like, “Oh, app instances, we can wrap those pretty easily into virtual services.”
Then, because that information was there, we were able to automate our deployments and from there we kept building and building on our dataset that we already had. That is one of the things I think is cool about Neo4j: that you can develop incrementally.
When we first started, we didn’t know we were going to end up with this full deployment. And we’re still progressing.
We’re adding firewall rules; we could add databases; as we move into Amazon, we can get EC2 instances and security groups. It’s pretty cool. It’s easy to build on your dataset, and make it more complex.
Also, our information security group has recently taken an interest in MacGyver for service onboarding. We now have a service registration in MacGyver, so when we get new services, we can register with MacGyver.
We can determine if a service is allowed to talk to another one, and we have a graph of relationships of services that depend on one another and talk to each other. We also use that graph for rezoning. Is this server in the correct security zone?
Using Neo4j and Microservices for Greater Agility
Rob: We want to keep this agile culture going in the company, and we don’t want to have meetings and change review boards and all that kind of stuff that nobody really likes.
The information security use cases are a great one for the continuation of DevOps. And when we’re thinking about how we are going to move from five services to 139 to 400 – at the same time as the company is getting bigger – we can’t have meetings and change review boards to get there.
Our information security group is on board with this because they want to hook into it, and they want to react to chaining workflows around new services created by developers.
Then, they want to dig into it, asking, “Now maybe I need to start running app scans on the new service. Maybe I need to have a conversation with that scrum team to understand what it is. Maybe I’m going to ask them to provide attributes about it so that we know how to do data classification.”
The opportunities for continuing this DevOps mindset with Neo4j, it’s limitless.
We’re really excited, particularly with moving into AWS and the API control plane that has such a rich variety of information, even though it’s kind of hard to query. As we’ve started doing that automation, it’s slow – something that will take 15 seconds to make 20 different queries out to AWS – but we want it to come back in a snap so we can have fast, responsive APIs that allow us to deliver the services that we want.
Ashley: In the end, everything is awesome when you use Neo4j.
We have a lot of individual microservices (or “Lego blocks”) and we can switch them out or move them around. As Rob was saying, sometimes things can get messy, but using Neo4j to manage it has made it a lot easier, and we’ll continue to use Neo4j.
Inspired by Ashley and Rob’s talk? Register for GraphConnect Europe on April 26, 2016 at for more industry-leading presentations and workshops on the evolving world of graph database technology.
By Ashley Sun & Rob Schoening, DevOps Team at Lending Club | December 30, 2015
Editor’s Note: Last October at GraphConnect San Francisco, Ashley Sun and Rob Schoening – from the DevOps team at Lending Club – delivered this in-depth presentation on their MacGyver platform for managing microservices with Neo4j.
For more videos from GraphConnect SF and to register for GraphConnect Europe, check out graphconnect.com.