You’re a big company, and you want to do what Google do for DevOps in order to get to Continuous Delivery (CD)? Beware the price of admission.
In this article, I’m going to outline the importance or addressing your company’s source-control use before diving too far into CD. Specifically, I’m suggesting that you should decide whether your enterprise should do Trunk Based Development (TBD) in one big trunk or not. I’m going to do that by first describing an entry-level DevOps that can facilitate an entry level CD, and then Google’s gold-standard DevOps.
Maybe you have thousands of source-control repositories, with differing branching models. Understand that Google built a system that allows them to centrally analyze source and commits to make intelligent technical, funding and resourcing decisions. I think of it as layer-cake, with each layer building on the one below it, like so:
Getting to CD nirvana is not easy, and you’re going to get fantastic value reading the Continuous Delivery book ($38). But if you’ve not read that book yet, gross misunderstandings and oversimplifications of what the price of admission is to truly be CD, can happen (a Ross Pettit quote)
Today’s entry-level DevOps
I’m going to outline a baseline for DevOps infrastructure and related habits in 2014. Many larger enterprise developer teams should:
- have development infrastructure a foundation that’s very solid. The trunk in single source-control tool, perhaps. Trunk as in TBD
- have rules that cover standards, policing of those standards through ‘continuous review’, which languages are approved for which targets, build technologies, good test automation, libraries, frameworks, techniques, methodologies IDEs and so forth
- have some reuse of code within the company, somehow
- also have a Continuous Integration (CI) daemon that can keep up with the commits made, maybe by batching commits
- do some form of Infrastructure as Code allowing a continually improving and repeatable development infrastructure
- be using a tracker for issues and back-log management (Agile preferably) instead of MS Project, Excel, etc
- have something like a wiki for documentation
Even if we had nothing other than this, and discipline/rigor amongst developers, we might be able to stay in control of all IT assets developed in-house. What I outline was starting to become regular by 2005, and is normal now.
Despite the safety of the Continuous Integration daemon, trust comes in to play here. You’re going to trust all participants to do everything right.
Pushing DevOps to Google’s level
Google had pressed the DevOps turbo button. Far beyond that entry-level above. They are perhaps the “high bar” anywhere in the industry. Revisiting the ‘should’ statements in the section above, Google:
- uses Perforce in a big-ass trunk configuration for everything (except Android source that makes it into OpenSource-land after a delay)
- has extreme reuse of code, at the source level mostly
- made the ‘Mondrian’ continuous review system, to police commit activity (it is not built in to Perforce)
- pushes QA automation to be a side function of regular development – no “over the fence” for them
Google are doing “Trust but verify”. They also have:
Scaled build services
Most companies would use Jenkins or similar, but Google’s Continuous Integration (CI) is self-built and predates Jenkins/Hudson. Google’s CI tests each commit (not batches of commits). It also tests changes as they are merely candidates for commit – donating results to the Mondrian code-review tool. Developers initiate this automatically as they prepare a commit for review. This is an elastic internal cloud.
The commits (or candidate commits) are first analyzed for impact on other modules. Imagine someone changing common logging framework that every app uses – it would cause all modules to rebuild, which would cause all dependent test modules to rebuild and execute too. As a developer you would hope that the directed graph drawn and fed into the compile & test sequencer is far smaller though, with as much as possible not recompiled and not re-executed in a test cycle. If you want to read more on that last, have a read about buck’s directed graph specifically, and my article Googlers subset their trunk generally.
As with their Perforce instance, this is scaled to allow N-thousand concurrent developers, and their read/commit activities.
QA Automation – via a Selenium “farm”
Applications being tested by CI may have a web-UI. If so a second elastic infrastructure is leveraged: their Selenium Build Farm. This allows parallel execution of functional tests. It is also available to developers at their desktops who may be building up to committing a change, or wanting to do a last validation before throwing the change into the Mondrian review system. This is a second elastic cloud for general Selenium-based functional testing, but internally available only. It is also elective – as a developer you can choose to lease browsers from it, to perform your being-developed tests. You would do that if you don’t want to use Firefox/Chrome on your Linux workstation, thereby tying up graphical resources for the duration of the tests.
Out of phase quality tools
Google use Findbugs and other tools, including white-box penetration testing technologies they’ve made in-house. IronWasp I’m reminded by colleague Prasanna Kanagasabai is in the same space. They are not triggered by a commit, or are part of the normal build pipeline. Instead they are run separately (but still frequently), and the reports/result are fed back into the team.
Big Data on commits
Having all the source in one trunk pays off for deep analysis activities. Google could compare the productivity of multiple teams, or some measure of cost effectiveness of certain technologies, especially if they can pull in runtime metrics and numbers from issues systems. They could even make a prediction that a team that is part way building something is unlikely to finish on time (or at all) if the right metrics are feeding that decision support system. The in-one-trunk aspect allows easy comparisons to other moments on a time-line. It’s really a Big Data thing, that technical people very high in the organization could use to direct resources in many ways. I talked of a trading platform for a previous client (refer my like as used sofa article) and their hedge bet. The Big Data aspect of Google’s oversight of the trunk would (not saying does) allow them to options and hedge-bet capable with respect to their ongoing development.
Extra VMs for every dev if they need them, on a forgiveness rather than permission basis. Similarly workstations upgrades that don’t require fifty signatures or weeks of delay before delivery. That’s non-live test environments (plural) too, if needed.
The Test Mercenaries programme was part of the “let’s all get better at testing” initiative inside Google (2007-2009), but at a smaller level. Google constantly refines the roles and responsibilities for Software Engineers, Software Quality Engineers, Release Engineers, Test Engineers and other groups in the DevOps landscape. They never rest on a prior definition and successes. There’s always something coming that’s an improvement in the DevOps space over something that preceded it. Given the “20% time” system, improvements can come from anywhere and are considered on merit. Build technologies, languages, frameworks, libraries are all to be expected in this space inside Google. These are much less famous than applications like Gmail and AdSense that started in 20% time, for sure, but still hugely valuable to Google.
Institutionalizing this gold-standard DevOps
Google had to make funding decisions for these. Most are in an operational expenditure territory, and yearly budgets are reviewed. Some like the Selenium Farm, and VM environment allocation could be charged back to projects. “Could be” as in I’m unaware of actual cost-justification and applicable finances, but I trust that extensive analysis was done before funding significant infrastructural investments.
There’s a people factor though. How do you get people to go along what the technical directions you want at top-level? I think it involves picking appropriate staff (including self-selection) as well as cultivating the appropriate group think. So how, precisely, does Google prevent all of its DevOps excellence regressing as they grow?
It does so in two ways, I think:
Start as you mean to carry on
I’ve not seen Google’s recruitment or on-boarding workflow, but I expect that many of the norms for developers are reinforced appropriately and progressively.
Imagine the worst case scenario was developers arriving who hate source-control. They want to dissuade those candidates at the earliest appropriate moment. In addition to apply-for-job channels, Google uses Linkedin to find developers. There is then a moment that recruiters could deselect individuals that are forward in their hatred of source-control. Candidates with that strong feeling could still get through. Phone screens can further subset people, as can the face-to-face series of interviews, and so on. ThoughtWorks is the same of course. The trick is to not give away too much in each of those gates, but still give and increasingly refined message about corporate culture.
Finally, as candidates have have the resolve to run through the whole interview series and receive a job offer, there’s the strong possibility that an individual could actually be hired.
The problem changes quite a bit at this stage. Now you’ve made and offer, and it’s been accepted, you have to groom the developer towards being a advocate for the things you want to be valued. That may include a shift in their thinking. You really only have one more planned moment for this: the developer’s on-boarding. If you’ve got it right, that on-boarding is not only solid enough to leave the Noogler clear about the core rules, but to be able to go on and elaborate on them to others when needed. This is the best moment, and it is a “start as you mean to carry on” thing. That adage applies to many aspects in life, of course.
Redefining core (Dev) values
So what if core values change, after an institutional decision to do so?
Let us suppose that you wanted your developers to be industrious with internal documentation too, but only AFTER the first 1000 developers were hired. You could install a wiki (Google did – from open-source land), you can remove permissions from pages to make it open to contribution, but what if you determine that not enough people are writing new content, or updating old?
The answer is that you need to make a big deal about changing culture, and with someone with sufficient authority mandating the decision. Maybe they should elaborate on the rationale too. Maybe and all-hands meetings to socialize the applicable changes, would be fruitful, but only if led by someone with sufficient technical. You could easily flip a core of developers from being allergic to documentation, to being on favor it it. Perhaps only if contributing is not too odious – “use Lotus Notes” for example.
Pity the poor companies that don’t have strong technical leadership though. Companies with CTO that are not respected or missing committees that would push for excellence (with authority), are going to have IT assets that decay over time.
Cost Of Change
Defects still happen, but Google have moved them leftwards on the cost of change curve. That progression left, even begins to categorize things as “could have been a defect if we did not catch it here”.
Their one big trunk setup allows them to maximize reuse. In their configuration it also forces them to do lock-step upgrades. Those upgrades could be binaries from outside Google (say Log4J), but also their own internal shared components and services. Applications can still become legacy of course, but the source-code is moved forward as lockstep upgrades happen, and the unit/integration/functional tests for it guard it despite the lack of an active development team. Many larger enterprises are in the same place, and need that desperately.
1000ft decision making.
I mentioned this above, one big trunk and all that build/commit intelligence allows the higher ups to pivot quickly re all their source-based assets:
- What apps are BEAST vulnerable?
- How big is the effort to upgrade Hibernate for all?
- How far along is the Spring Framework to Guice migration?
- What is the test coverage ranking of commits by team / developer / university / language
- Do build breakages vary based on precipitation at the locale of the committer in question?
Centralized everything allow this auditability.
Google prevent developers from sharing branches that are not the single anointed trunk. That much should be obvious.
Speaking to that, an Emerson quote applies and should make us feel a little guilty:
“A foolish consistency is the hobgoblin of little minds ..
Concrete advice for the reader
In order perhaps,
- Putting in a CI daemon is never bad. But it will quickly show you what your underlying problems are
- Review your source-control choices (all apps). Sunset all of ClearCase, PVCS, CVS, StarTeam, Synergy as soon as you can
- Review your range of branching models (all apps)
- Work out if one big trunk makes sense (facilitates common code / reuse)
- If yes to the last, you might want to enable checking out of a subset on a per-developer basis
- If your dev teams can boost their throughput, think about the capacity of CI infrastructure needed to do one build per commit
- Design some elastic infrastructure for transient environments for the CI processes to deploy applications into for automated functional testing
- Worry about the elapsed time to take a build all the way through the pipeline, including those automated functional tests
- Collect metrics about builds into some system for later analysis
- Write tools to aggregate build metrics, commits and allow analysis
Then there is also, which of those applications require legacy rejuvenation concurrently and what that means for a larger CD style reorganization. Rejuvenation strategies should really include improving or introducing tests. That’d be test coverage at the unit level, integration tests in smaller number, and at least happy-path functional tests.
ThoughtWorks spends a lot of time moving clients to more sophisticated DevOps places, often as part of trying migration towards CD. There’s a lot to it, and I’ve only outlined a checklist of mostly mechanical achievements here. To get to true CD, it’s not just developer ‘workspace’ that needs to be boosted, it’s management and stakeholder participation/expectations, shared responsibility team, flow of feature requests into working code. In short many more ‘social’ things, and methodology-driven than anything I’ve outlined in this article. Some of that the Agile/Lean industry speaks to too. Anyway you can’t just get to true CD by installing Jenkins on a machine with a high-end CPU and lots of RAM.