A couple of years ago at the OpenFlow Symposium in San Jose, I talked to Jeremy Stretch as the event was warming up. I was not really that knowledgeable of tools like Puppet and Opscode at the time. He told me I needed to look at Puppet and integrate it with Junos. And so began my march towards these types of tools and DevOps more generally.
Since that time, I have come to appreciate a couple of things: first, the majority of network folks don't know much about these tools. While they are in use at a lot of companies, they tend to be used on the server side more than the networking side. We have a relatively small echo chamber within the networking blogosphere and Twitterverse, but none of us should fool ourselves into thinking that mainstream understanding is high. Second, applying these tools to networking in the same way they are applied to servers is horribly missing the point.
So, given the first thing I learned, I feel like I need to do a little bit of education on how tools like Opscode's Enterprise Chef or PuppetLabs's equivalent work. Since we announced what our integration with Opscode looks like, I will start there.
The basic premise is that, when teams set up servers, they go through a bunch of steps that include putting an OS on the box, adding applications, doing initial set up and configuration, and so on. When you are adding servers by the hundreds, these tasks eat up time that could otherwise be used to do other stuff. So Opscode has built a product that lets teams use Ruby or Python to automate the tasks.
Opscode’s flagship product Enterprise Chef is designed with this problem in mind. Nodes – be they physical or virtual servers – are meaningless until they are assigned a role in the infrastructure. What function they serve determines how they need to be set up: operating system, provisioning, connectivity, security, and application installation. Chef operates by installing client software on each node. That client software is then directed to server software that runs either within the data center or in the cloud for hosted solutions.
When a new server is provisioned, users register it on the Chef Server through a Chef Workstation-initiated client install using the hostname and IP address. Then, users assign a role to the server. A role is a user-defined definition, it might be a webserver, a Hadoop node, or any other useful functional entity.
The steps that would normally be taken to configure this node are captured as Ruby code in what Chef calls recipes and cookbooks. These recipes are included in role definitions, so that whenever a new node of a certain type is initiated, the configuration steps are automatically inherited and executed. For instance, all compute servers for a specific enterprise application would require the same setup.
Recipes, cookbooks, and attributes are stored and managed on the Chef server. When a new node is brought online, the Chef server pushes the recipe contents down to the Chef client for local execution. If the setup for a particular role type changes, the recipes on the Chef server are modified, and updates are sent to all configured clients of that role type. Through Chef, Opscode has essentially automated server setup in the datacenter. But can the same principles be applied to the network that supports these servers?
So, when networking companies started to see that these tools were valuable, the first thing they did was put the client software on the device and then create the abstractions required to handle provisioning. Essentially, they were making routers and switches look like servers. Users would set up recipes (or their other-product equivalents) and then initially provision networking gear the same way that they had set up new servers.
This is actually a useful thing to do. But the reality is that networking gear is not deployed in the same volumes that computers and storage servers are. And the issue with networking is more tied to edge policy than it is to basic config. And to make things more difficult, edge policy needs to change whenever applications change. This is why people have been bemoaning the network contribution to the time it takes to make changes in the data center.
So the real goal shouldn't be to treat switches like servers but rather to tie the network (all of the network, not just the new device) to the servers that are driving its traffic.
This requires more than just putting the client software on the switch. But I didn't get that nuance for a really long time because I was thinking more about "integrating the tool" than I was about doing the right thing. Customers were asking for integration, but what they really want is subtly different.
What customers really want is for the network to be provisioned correctly when they add (or change) something on the application side. When there is a new server added or a new application turned on, the network should just come along. That means that existing network devices need to have configuration changes whenever something new is added. This goes well beyond initial router or switch set-up.
When we did our integration with Opscode, the first thing we did was enable one-touch provisioning. This isn't the same thing as the "zero-touch provisioning" you see elsewhere. The one touch we are talking about is the server touch. You set up a new web server, and the network gets provisioned along with it. We can automatically establish relationships, optimize paths for that traffic, and push config to the switch the server is connected to.
But it doesn't end there. If there is a problem with the web server, you need to troubleshoot. Troubleshooting information might reside all over the network. If you know the roles of individual servers and you know what ports they are attached to, you can do troubleshooting commands that show all the web-server-related information in the network.
The point here is that integrating tools like Chef well should result in something way more useful than just using the same tool to do initial provisioning.
I know I have been vague here, but we are doing a show and tell with this stuff on Friday.
And there is a solution brief that provides a bit more detail here.