Container Checkpointing in Kubernetes With a Custom API
This article discusses using a Kubernetes sidecar for container checkpointing: build, push, deploy to K8s, and trigger checkpoints via API for state management.
Join the DZone community and get the full member experience.
Join For FreeProblem Statement
Challenge
Organizations running containerized applications in Kubernetes often need to capture and preserve the state of running containers for:
- Disaster recovery
- Application migration
- Debug/troubleshooting
- State preservation
- Environment reproduction
However, there's no straightforward, automated way to:
- Create container checkpoints on-demand
- Store these checkpoints in a standardized format
- Make them easily accessible across clusters
- Trigger checkpointing through a standard interface
Current Limitations
- Manual checkpoint creation requires direct cluster access
- No standardized storage format for checkpoints
- Limited integration with container registries
- Lack of programmatic access for automation
- Complex coordination between containerd and storage systems
Solution
A Kubernetes sidecar service that:
- Exposes checkpoint functionality via REST API
- Automatically converts checkpoints to OCI-compliant images
- Stores images in ECR for easy distribution
- Integrates with existing Kubernetes infrastructure
- Provides a standardized interface for automation
This solves the core problems by:
- Automating the checkpoint process
- Standardizing checkpoint storage
- Making checkpoints portable
- Enabling programmatic access
- Simplifying integration with existing workflows
Target users:
- DevOps teams
- Platform engineers
- Application developers
- Site Reliability Engineers (SREs)
Forensic container checkpointing is based on Checkpoint/Restore In Userspace (CRIU) and allows the creation of stateful copies of a running container without the container knowing that it is being checkpointed. The copy of the container can be analyzed and restored in a sandbox environment multiple times without the original container being aware of it. Forensic container checkpointing was introduced as an alpha feature in Kubernetes v1.25.
This article will guide you on how to deploy Golang code that can be used to take a container checkpoint using an API.
The code takes a pod identifier, retrieves the container ID from containerd as an input, and then uses the ctr
command to checkpoint the specific container in the k8s.io
namespace of containerd:
- Kubernetes cluster
- Install
ctr commandline
tool. if you are able to run ctr commands on the kubelet or worker node; if not, install or adjust AMI to contain the ctr. kubectl
configured to communicate with your cluster- Docker installed locally
- Access to a container registry (e.g., Docker Hub, ECR)
- Helm (for installing Nginx Ingress Controller)
Step 0: Code to Create Container Checkpoint Using GO
Create a file named checkpoint_container.go
with the following content:
package main
import (
"context"
"fmt"
"log"
"os"
"os/exec"
"strings"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/session"
"github.com/aws/aws-sdk-go/service/ecr"
"github.com/containerd/containerd"
"github.com/containerd/containerd/namespaces"
)
func init() {
log.SetOutput(os.Stdout)
log.SetFlags(log.Ldate | log.Ltime | log.Lmicroseconds | log.Lshortfile)
}
func main() {
if len(os.Args) < 4 {
log.Fatal("Usage: checkpoint_container <pod_identifier> <ecr_repo> <aws_region>")
}
podID := os.Args[1]
ecrRepo := os.Args[2]
awsRegion := os.Args[3]
log.Printf("Starting checkpoint process for pod %s", podID)
containerID, err := getContainerIDFromPod(podID)
if err != nil {
log.Fatalf("Error getting container ID: %v", err)
}
err = processContainerCheckpoint(containerID, ecrRepo, awsRegion)
if err != nil {
log.Fatalf("Error processing container checkpoint: %v", err)
}
log.Printf("Successfully checkpointed container %s and pushed to ECR", containerID)
}
func getContainerIDFromPod(podID string) (string, error) {
log.Printf("Searching for container ID for pod %s", podID)
client, err := containerd.New("/run/containerd/containerd.sock")
if err != nil {
return "", fmt.Errorf("failed to connect to containerd: %v", err)
}
defer client.Close()
ctx := namespaces.WithNamespace(context.Background(), "k8s.io")
containers, err := client.Containers(ctx)
if err != nil {
return "", fmt.Errorf("failed to list containers: %v", err)
}
for _, container := range containers {
info, err := container.Info(ctx)
if err != nil {
continue
}
if strings.Contains(info.Labels["io.kubernetes.pod.uid"], podID) {
log.Printf("Found container ID %s for pod %s", container.ID(), podID)
return container.ID(), nil
}
}
return "", fmt.Errorf("container not found for pod %s", podID)
}
func processContainerCheckpoint(containerID, ecrRepo, region string) error {
log.Printf("Processing checkpoint for container %s", containerID)
checkpointPath, err := createCheckpoint(containerID)
if err != nil {
return err
}
defer os.RemoveAll(checkpointPath)
imageName, err := convertCheckpointToImage(checkpointPath, ecrRepo, containerID)
if err != nil {
return err
}
err = pushImageToECR(imageName, region)
if err != nil {
return err
}
return nil
}
func createCheckpoint(containerID string) (string, error) {
log.Printf("Creating checkpoint for container %s", containerID)
checkpointPath := "/tmp/checkpoint-" + containerID
cmd := exec.Command("ctr", "-n", "k8s.io", "tasks", "checkpoint", containerID, "--checkpoint-path", checkpointPath)
output, err := cmd.CombinedOutput()
if err != nil {
return "", fmt.Errorf("checkpoint command failed: %v, output: %s", err, output)
}
log.Printf("Checkpoint created at: %s", checkpointPath)
return checkpointPath, nil
}
func convertCheckpointToImage(checkpointPath, ecrRepo, containerID string) (string, error) {
log.Printf("Converting checkpoint to image for container %s", containerID)
imageName := ecrRepo + ":checkpoint-" + containerID
cmd := exec.Command("buildah", "from", "scratch")
containerId, err := cmd.Output()
if err != nil {
return "", fmt.Errorf("failed to create container: %v", err)
}
cmd = exec.Command("buildah", "copy", string(containerId), checkpointPath, "/")
err = cmd.Run()
if err != nil {
return "", fmt.Errorf("failed to copy checkpoint: %v", err)
}
cmd = exec.Command("buildah", "commit", string(containerId), imageName)
err = cmd.Run()
if err != nil {
return "", fmt.Errorf("failed to commit image: %v", err)
}
log.Printf("Created image: %s", imageName)
return imageName, nil
}
func pushImageToECR(imageName, region string) error {
log.Printf("Pushing image %s to ECR in region %s", imageName, region)
sess, err := session.NewSession(&aws.Config{
Region: aws.String(region),
})
if err != nil {
return fmt.Errorf("failed to create AWS session: %v", err)
}
svc := ecr.New(sess)
authToken, registryURL, err := getECRAuthorizationToken(svc)
if err != nil {
return err
}
err = loginToECR(authToken, registryURL)
if err != nil {
return err
}
cmd := exec.Command("podman", "push", imageName)
err = cmd.Run()
if err != nil {
return fmt.Errorf("failed to push image to ECR: %v", err)
}
log.Printf("Successfully pushed checkpoint image to ECR: %s", imageName)
return nil
}
func getECRAuthorizationToken(svc *ecr.ECR) (string, string, error) {
log.Print("Getting ECR authorization token")
output, err := svc.GetAuthorizationToken(&ecr.GetAuthorizationTokenInput{})
if err != nil {
return "", "", fmt.Errorf("failed to get ECR authorization token: %v", err)
}
authData := output.AuthorizationData[0]
log.Print("Successfully retrieved ECR authorization token")
return *authData.AuthorizationToken, *authData.ProxyEndpoint, nil
}
func loginToECR(authToken, registryURL string) error {
log.Printf("Logging in to ECR at %s", registryURL)
cmd := exec.Command("podman", "login", "--username", "AWS", "--password", authToken, registryURL)
err := cmd.Run()
if err != nil {
return fmt.Errorf("failed to login to ECR: %v", err)
}
log.Print("Successfully logged in to ECR")
return nil
}
Step 1: Initialize the go Module
go mod init checkpoint_container
Modify the go.mod
file:
module checkpoint_container
go 1.23
require (
github.com/aws/aws-sdk-go v1.44.298
github.com/containerd/containerd v1.7.2
)
require (
github.com/jmespath/go-jmespath v0.4.0 // indirect
github.com/opencontainers/go-digest v1.0.0 // indirect
github.com/opencontainers/image-spec v1.1.0-rc2.0.20221005185240-3a7f492d3f1b // indirect
github.com/pkg/errors v0.9.1 // indirect
google.golang.org/genproto v0.0.0-20230306155012-7f2fa6fef1f4 // indirect
google.golang.org/grpc v1.53.0 // indirect
google.golang.org/protobuf v1.30.0 // indirect
)
Run the following command:
go mod tidy
Step 2: Build and Publish Docker Image
Create a Dockerfile
in the same directory:
# Build stage
FROM golang:1.20 as builder
WORKDIR /app
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o checkpoint_container
# Final stage
FROM amazonlinux:2
# Install necessary tools
RUN yum update -y && \
amazon-linux-extras install -y docker && \
yum install -y awscli containerd skopeo && \
yum clean all
# Copy the built Go binary
COPY --from=builder /app/checkpoint_container /usr/local/bin/checkpoint_container
EXPOSE 8080
ENTRYPOINT ["checkpoint_container"]
This Dockerfile does the following:
- Uses
golang:1.20
as the build stage to compile your Go application. - Uses
amazonlinux:2
as the final base image. - Installs the AWS CLI, Docker (which includes containerd), and skopeo using yum and amazon-linux-extras.
- Copies the compiled Go binary from the build stage.
docker build -t <your-docker-repo>/checkpoint-container:v1 .
docker push <your-docker-repo>/checkpoint-container:v1
Replace <your-docker-repo>
with your actual Docker repository.
Step 3: Apply the RBAC resources
Create a file named rbac.yaml
:
apiVersion: v1
kind: ServiceAccount
metadata:
name: checkpoint-sa
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: checkpoint-role
namespace: default
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: checkpoint-rolebinding
namespace: default
subjects:
- kind: ServiceAccount
name: checkpoint-sa
namespace: default
roleRef:
kind: Role
name: checkpoint-role
apiGroup: rbac.authorization.k8s.io
Apply the RBAC resources:
kubectl apply -f rbac.yaml
Step 4: Create a Kubernetes Deployment
Create a file named deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: main-app
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: main-app
template:
metadata:
labels:
app: main-app
spec:
serviceAccountName: checkpoint-sa
containers:
- name: main-app
image: nginx:latest # Replace with your main application image
- name: checkpoint-sidecar
image: <your-docker-repo>/checkpoint-container:v1
ports:
- containerPort: 8080
securityContext:
privileged: true
volumeMounts:
- name: containerd-socket
mountPath: /run/containerd/containerd.sock
volumes:
- name: containerd-socket
hostPath:
path: /run/containerd/containerd.sock
type: Socket
Apply the deployment:
kubectl apply -f deployment.yaml
In deployment.yaml
, update the following:
image: <your-docker-repo>/checkpoint-container:v1
Step 5: Kubernetes Service
Create a file named service.yaml
:
apiVersion: v1
kind: Service
metadata:
name: checkpoint-service
namespace: default
spec:
selector:
app: main-app
ports:
- protocol: TCP
port: 80
targetPort: 8080
Apply the service:
kubectl apply -f service.yaml
Step 6: Install Ngnix Ingress Contoller
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
helm install ingress-nginx ingress-nginx/ingress-nginx
Step 7: Create Ingress Resource
Create a file named ingress.yaml
:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: checkpoint-ingress
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/ssl-redirect: "false"
spec:
rules:
- http:
paths:
- path: /checkpoint
pathType: Prefix
backend:
service:
name: checkpoint-service
port:
number: 80
Apply the Ingress:
kubectl apply -f ingress.yaml
Step 8: Test the API
kubectl get services ingress-ngnix-contoller -n ingress-ngnix
curl -X POST http://<EXTERNAL-IP>/checkpoint \
-H "Content-Type: application/json" \
-d '{"podId": "your-pod-id", "ecrRepo": "your-ecr-repo", "awsRegion": "your-aws-region"}'
Replace <EXTERNAL-IP>
with the actual external IP.
Additional Considerations
- Security.
- Implement HTTPS by setting up TLS certificates
- Add authentication to the API
- Monitoring. Set up logging and monitoring for the API and checkpoint process.
- Resource management. Configure resource requests and limits for the sidecar container.
- Error handling. Implement robust error handling in the Go application.
- Testing. Thoroughly test the setup in a non-production environment before deploying it to production.
- Documentation. Maintain clear documentation on how to use the checkpoint API.
Conclusion
This setup deploys the checkpoint container as a sidecar in Kubernetes and exposes its functionality through an API accessible from outside the cluster. It provides a flexible solution for managing container checkpoints in a Kubernetes environment.
AWS/EKS Specific
Step 7: Install the AWS Load Balancer Controller
Instead of using the Nginx Ingress Controller, we'll use the AWS Load Balancer Controller. This controller will create and manage ALBs for our Ingress resources.
1. Add the EKS chart repo to Helm:
helm repo add eks https://aws.github.io/eks-charts
2. Install the AWS Load Balancer Controller:
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=<your-cluster-name> \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller
Replace <your-cluster-name>
with your EKS cluster name.
Note: Ensure that you have the necessary IAM permissions set up for the AWS Load Balancer Controller. You can find the detailed IAM policy in the AWS documentation.
Step 8: Create Ingress Resource
Create a file named ingress.yaml
:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: checkpoint-ingress
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internet-facing
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- http:
paths:
- path: /checkpoint
pathType: Prefix
backend:
service:
name: checkpoint-service
port:
number: 80
Apply the Ingress:
kubectl apply -f ingress.yaml
Step 9: Test the API
1. Get the ALB DNS name:
kubectl get ingress checkpoint-ingress
Look for the ADDRESS field, which will be the ALB's DNS name.
2. Send a test request:
curl -X POST http://<ALB-DNS-NAME>/checkpoint \
-H "Content-Type: application/json" \
-d '{"podId": "your-pod-id", "ecrRepo": "your-ecr-repo", "awsRegion": "your-aws-region"}'
Replace <ALB-DNS-NAME>
with the actual DNS name of your ALB from step 1.
Additional Considerations for AWS ALB
1. Security groups. The ALB will have a security group automatically created. Ensure it allows inbound traffic on port 80 (and 443 if you set up HTTPS).
2. SSL/TLS: To enable HTTPS, you can add the following annotations to your Ingress:
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:region:account-id:certificate/certificate-id
3. Access logs. Enable access logs for your ALB by adding the following:
alb.ingress.kubernetes.io/load-balancer-attributes: access_logs.s3.enabled=true,access_logs.s3.bucket=your-log-bucket,access_logs.s3.prefix=your-log-prefix
4. WAF integration. If you want to use AWS WAF with your ALB, you can add:
alb.ingress.kubernetes.io/waf-acl-id: your-waf-web-acl-id
5. Authentication. You can set up authentication using Amazon Cognito or OIDC by using the appropriate ALB Ingress Controller annotations.
These changes will set up your Ingress using an AWS Application Load Balancer instead of Nginx. The ALB Ingress Controller will automatically provision and configure the ALB based on your Ingress resource.
Conclusion
Remember to ensure that your EKS cluster has the necessary IAM permissions to create and manage ALBs. This typically involves creating an IAM policy and a service account with the appropriate permissions.
This setup will now use AWS's native load-balancing solution, which integrates well with other AWS services and can be more cost-effective in an AWS environment.
Opinions expressed by DZone contributors are their own.
Comments