Policy as (Versioned) Code
Let's look at how versioning policies streamline the developer experience to deliver features and minimize downtime while meeting compliance requirements.
Join the DZone community and get the full member experience.
Join For FreeThis is a continuation of the PodSecurityPolicy Is Dead, Long Live...? article, which looks at how to construct the most effective policy for your Kubernetes infrastructure. Haven't read that? Check it out first.
“Policy as code” is one of the more recent "as-code" buzzwords to enter the discourse since "infrastructure-as-code" paved the way for the *-as-code term. The fundamental principles of it sound great: everything in version control, auditable, repeatable, and so on. However, in practice, it can often fall apart when it comes to the day 2 operational challenges which are exacerbated by adopting "GitOps."
We'll look at a common scenario and present a working example of versioned policy running through the entire process to address the issue.
The Status Quo
Let’s start with a likely (simplified) scenario:
- Person (a) writes a change to a deployment yaml file locally, yaml appears valid, so...
- Person (a) pushes it to a branch and raises a pull request to the main/master branch requesting a review.
- Person (b) looks at the diff, agrees with the change, and approves it.
- Person (a/b) merges the change, causing the change to now be in the main/master branch.
- CI/CD picks up the change and successfully applies the changed deployment yaml to the Kubernetes cluster.
- The deployment controller creates a replicaSet and submits it to the Kubernetes API (which is accepted by the API server).
- The replicaSet controller creates pods and submits to the API, and the API server rejects these pods since they are rejected by a PodSecurityPolicy Rule(or similar) admission controller.
- Unless you’re polling the API server for events on your deployment rollout, you won’t know it’s failed.
- Your Main/Master branch is now broken; you'll either need to figure out how to rollback changes or roll forward a fix, either by administering the cluster directly, or repeating the entire process from step 1.
That's just your "business as usual" flow for all your devs.
What Happens When You Want to Update the Policy Itself?
Your policy engine might allow you to "dry run" before you "enforce" a new policy rule by putting it in a "warning" or "audit" or "background" mode where a warning response is returned in the event log when something breaks the new rule.
But that will only happen if the API server re-evaluates the resources, which usually only occurs when the pod reschedules. Again, someone needs to be monitoring the event logs and acting on them, which can introduce its own challenges in exposing those logs to your teams.
All of that activity is happening a long way from the developers that are going to do something about it.
Furthermore, communicating that policy update between the well-intentioned security team and developers is fraught with common bureaucratic concerns frequently found in organizations at scale. The security policy itself might be considered somehow sensitive as it may reveal potential weaknesses.
Consequently, reproducing that policy configuration in a local development environment may also prove impracticable. This is all made much, much worse with multiple clusters for development, staging, production, and multi-tenancies with multiple teams and applications co-existing in the same cluster space all with their own varying needs.
So What Can You Do About All of This?
First and foremost, sharing the policy is imperative. Your organization has to absolutely accept the advantages of exposing policy and communicating that effectively with its developers far outweigh any potential security advantage gained through obscurity.
Along with sharing it, you need to articulate the benefits of each and every one. After all, you’ve hopefully hired some smart people, and smart people will try to find workarounds when they don’t see value in the obstruction.
Explaining the policy should hopefully help you justify it to yourselves too. Rules naturally should become based less on emotion and anecdotes, and should instead become grounded in informed threat modeling that's easier to maintain as your threat landscape changes.
The next step is collecting the policy, codifying, and assuring it is kept in version control. Once it's in version control, you can adopt the same semantic versioning strategy seen elsewhere that your developers will be used to.
Quick Recap: Semantic Versioning
- Version numbers look like 1.20.30.
- The number of digits between the points is up to you (1.2.3 is fine as is 1.002.000003).
- Don’t be fooled by the decimal points; they’re not real (1.20.0 is greater than 1.3.0).
- The first digit is a major breaking change where you make wholly incompatible changes (this will probably be the case with almost all your policy changes).
- The second digit is a minor change that might be adding functionality but is backwards compatible (this is less likely for policy changes).
- The third is for patches where you make backwards compatible bug fixes (likely quite rare for your policy changes).
- For more details, see the Semantic Versioning website.
Great! So you’ve got your policy definitions in version control, tagged with semantic versioning. The next step is consuming that within your applications so your developers can test their applications against it — locally to start with, then later in continuous integration.
Hopefully, your developers will be used to this, at least. They can treat your policy like they treat versioned dependencies.
Now they’re testing locally, implementing the same check-in CI should be straightforward. This will assure that peer reviews are only ever carried out on code that is known to pass your policy.
Given it’s now a dependency, you can use tools like Snyk/Dependabot/Renovate and so on to automate making pull requests to keep it updated and highlight to your developers when the policy update is not compatible with their app.
Awesome. Now for the really tricky bit...
Your Runtime Needs to Support Multiple Policy Versions
From a risk perspective, your organization needs to be comfortable with accepting the transitionary period for old policy versions to be retired, which comes down to communication between those settings and those consuming/subjected to the policy, forward and backward by one version, so your runtime needs to support at least three significant versions.
Show Me the Code
I’ve put together a reference model of this in a dedicated GitHub organization with a bunch of repositories. Renovate was used to make automated pull requests on policy updates. You can see examples of that.
Other tools:
- kyverno as the policy engine, but any policy engine that allows you to be selective with labels on the resources should work.
- GitHub Actions as the CI/CD but anything similar that integrates version control with pull/merge requests should work.
- GitHub for version control, but any similar Git service with a pull request capability and linked tests should work.
- KiND for the Kubernetes cluster, but any Kubernetes cluster should work, this just let me do all the testing quickly.
- Renovate automatically maintains the policy dependencies by raising pull requests for us.
Please, allow me to introduce you to Example Policy Org.
Enter Example Policy Org
app1
This app is compliant with version 1.0.0 of the company policy only the pull request to update the policy to 2.0.1 currently can’t merge
app2
This app is compliant with version 2.0.1 of the company policy.
app3
This app is compliant with version 2.0.1 of the company policy but it's only using 1.0.0 and can be updated with a pull request.
policy
1.0.0
Only one simple policy here. it requires that every resource has a label of mycompany.com/department
so long as its set it doesn’t matter.
2.0.0
Following on from 1.0.0 we found that the lack of consistency isn’t helping, some people are setting it to "HR," others to "human resources." So a breaking policy change has been introduced (hence the major version bump) to require the value to be from a known pre-determined list. So the mycompany.com/department
label must be one of these: tech|accounts|servicedesk|hr
.
2.0.1
The policy team forgot a department! So now the mycompany.com/department
label must be one of these: tech|accounts|servicedesk|hr|sales
. This was a non-breaking, very minor change, so we’re going to consider it a patch update, so we only increment the last segment of the version number.
e2e
This is an example of everything coexisting on a single cluster for simplicity and keeping this free to run I stand up the cluster each time using KiND, but this could just as well be a real cluster(s).
policy-checker
This is a simple tool to help our developers test their apps, they can simply run docker run --rm -ti -v $(pwd):/apps ghcr.io/example-policy-org/policy-checker
when in the app and it’ll test if the app passes.
The location of the policy is intentionally hard-coded, making this reusable outside of our example organization would take some significant thought and is out of scope for this.
Caveats
What I haven’t done is require the mycompany.com/policy-version
label. That’s probably part of the policy-checker and CI process’s job and also up to your cluster administrators to what they do with things that don’t have the label. You might for example exclude anything from kube-system
, otherwise require the mycompany.com/policy-version >= 1.0.0
and update that minimum version as required. In reality, it's just another rule, but separate from the rest of the policy codebase.
Now, It's Your Turn
You should be able to reuse the principles of what we've covered in this article to go forth and version your organization’s policy and make the dev experience a well-informed compliant breeze.
Published at DZone with permission of Chris Nesbitt-Smith. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments