Open-Source GitOps at the Edge: Deploying to Thousands of Clusters With Rancher Fleet

Establish GitOps-driven CI/CD pipelines to create zero-downtime deployments across thousands of edge locations with automated rollbacks.

Prakash Velusamy

Mar. 03, 26 · Analysis

Likes (0)

Comment

Save

1.2K Views

The Edge Deployment Challenge

Modern microservice applications are moving beyond central data centers and the cloud to the edge to provide ultra-low latency and real-time processing. This enables real-time responsiveness for applications powering autonomous vehicles, remote healthcare, and IoT solutions.

A fundamental operational challenge exists when you attempt to deploy code to distributed edge computing environments. Each time that you are deploying code to containerized workloads at thousands of different edge locations, it will require coordination across unreliable networks, heterogeneous hardware, and edge locations with no technical staff available to correct failed deployments.

The edge computing environment provides limited connectivity, low bandwidth for other critical business operations, and no on-site engineers to resolve failures during deployment.

CI/CD pipelines based on traditional models use a push-based model where a centralized server connects to the target deployment environments and directly pushes changes to those environments. The traditional push-based model assumes the deployment target is always accessible, and a failure to deploy can be immediately recovered. Edge computing violates both assumptions.

For example, in a retail deployment that includes 2,500 store locations, a push-based pipeline that attempts to simultaneously deploy to all stores will experience connection timeouts as a result of connectivity issues with some stores; partially deployed code as a result of network outages during the deployment process; and lack of visibility into the status of the deployment process for the many store locations with no connectivity (see Figure 1).

Figure 1: Push-based vs. pull-based deployment models for edge environments

GitOps With Open-Source Rancher Fleet

To solve this problem, the method of deployment needs to be inverted. The current model has changes pushed to edge locations. With pull-based GitOps, each edge cluster pulls its state from a central Git repository. Rancher Fleet is designed to provide GitOps-based deployment for managing large numbers of clusters. One Fleet Controller can manage well over one million resources across thousands of clusters. This makes Rancher Fleet ideal for use in edge deployments.

Edge locations using Fleet are part of a continuous reconciliation cycle. The cycle includes observing the desired state for the cluster from the Git repository and then comparing the observed state against the actual state of the cluster. It identifies what is out of sync or "drift" and applies changes to the cluster. This model provides several key advantages for edge locations—locations that lose their connection will auto-synchronize when they regain their connection. A persistent connection to an edge cluster is not required, and failed deployments will be retried (see Figure 2).

Figure 2: Fleet architecture with upstream controller and downstream edge clusters

Fleet’s clustering capability enables deployment to thousands of different locations based on a single configuration file:

    YAML
   
 

   yaml 

apiVersion: fleet.cattle.io/v1alpha1 
kind: GitRepo 
metadata: 
  name: edge-retail-app 
  namespace: fleet-default 
spec: 
  repo: https://github.com/org/edge-manifests.git 
  branch: main 
  paths: 
    - apps/retail-app 
  pollingInterval: 5m 
  imageScanInterval: 30m 
  targets: 
    - name: canary-stores 
      clusterSelector: 
        matchLabels: 
          environment: edge 
          rollout-wave: canary 
    - name: retail-edge-all 
      clusterSelector: 
        matchLabels: 
          environment: edge 
          tier: retail 
  helmSecretName: jfrog-registry-credentials
  

When an additional edge cluster registers with matching labels, Fleet will automatically include it in all deployments; this eliminates configuration drift throughout the fleet.

The rollout behavior and customizations for each target are defined by the fleet.yaml within the application path:

    YAML
   
 

   yaml 

# fleet.yaml - Controls deployment behavior 
defaultNamespace: retail-apps 
helm: 
  releaseName: edge-pos-app 
  values: 
    image: 
      repository: artifactory.internal.com/edge/pos-app 
      tag: v2.4.1 
    resources: 
      limits: 
        memory: 256Mi 
        cpu: 200m 
    replicaCount: 1 
 
rollout: 
  autoPartitionSize: 25 
  partitions: 
    - name: canary 
      maxUnavailable: 1 
      clusterSelector: 
        matchLabels: 
          rollout-wave: canary 
    - name: production 
      maxUnavailable: 10% 
      clusterGroup: retail-fleet 
 
targetCustomizations: 
  - name: high-traffic-stores 
    clusterSelector: 
      matchLabels: 
        traffic-tier: high 
    helm: 
      values: 
        replicaCount: 3 
        resources: 
          limits: 
            memory: 512Mi
  

This configuration enables partition-based rollout, so that your canary clusters are updated first and then the production clusters in batches of predetermined size. The target customization section allows high-traffic stores to receive additional resources without having to create their own manifests.

Wave-Based Deployment Strategy

Implementing changes to all edge locations at the same time poses too much risk. For example, if there is an error in how we process payments and it is rolled out to the 2,500 stores at the same time, it would immediately cause disruption to our business across the entire fleet. Staged rollouts reduce the blast radius because they deploy small subsets of stores first and validate success before expanding the deployment (see Figure 3).

Figure 3: Four-wave deployment strategy with health validation gates

Wave-Based Rollout Schedule

Wave	Coverage	Duration	Purpose
Wave 1: Canary	1% of fleet	30 minutes	Identify Obvious Failures (with Minimal Impact)
Wave 2: Early Adopter	10% of fleet	2 hours	Validate against a variety of conditions
Wave 3: Regional	50% of fleet	4 hours	Confirm scalability and regional variations
Wave 4: Full	100% of fleet	Continuous	Complete rollout

Automated Health Assessment is required for transition in each stage of the waves. The deployment controller automatically collects performance metric data from the deployments at all locations and compares them to thresholds such as error rate, latency, and success rate that were previously established. When the collected metrics meet or exceed those thresholds, then the deployment proceeds to the next wave.

Health Check Thresholds for Wave Promotion

metric	threshold	rationale
Success Rate	≥ 99%	Determine if application is functioning as intended
Error Rate	≤ 1%	Catch error spikes in the system rapidly
P99 Latency	≤ 500ms	Detect performance degradation

metric

threshold

rationale

Success Rate

≥ 99%

Determine if application is functioning as intended

Error Rate

≤ 1%

Catch error spikes in the system rapidly

P99 Latency

≤ 500ms

Detect performance degradation

Handling Disconnected Edge Locations

Network reliability varies greatly depending on how you deploy your edge. A typical urban retail location will generally have reliable connectivity; a remote site could be down for hours. Therefore, the pipeline needs to be able to support either of these options, but no manual effort should be required.

Fleet allows for disconnection by design with its agent-based architecture. The Fleet Agent on each edge cluster maintains a connection to the upstream controller. If that fails, it simply runs in the last known desired state until the agent can reconcile the difference once connectivity is restored. Applications continue to run while the agents are working to get back into sync due to cached container images.

To ensure containers can operate in a disconnected environment, they need hierarchical caching. JFrog Artifactory is the authoritative repository at the center; the JFrog Edge nodes provide caching in each region, and the edge clusters cache locally. This enables successful pod restarts regardless of whether there is network connectivity.

Automated Rollback

When an incident occurs, it is most important to minimize recovery time. Automated rollback removes human decision-making latency from the recovery path (see Figure 4).

Figure 4: Automated rollback flow from detection to recovery

The same metrics powering deployment promotion also trigger rollback of a deployment. If the success rate falls below thresholds, Fleet will halt all future activity and initiate a rollback to the most recent good version of the application. Operations teams will be notified immediately via both Slack and PagerDuty, with full audit history available for post-deployment incident review.

Key Outcomes

Using this framework at distributed edge locations yields quantifiable benefits:

Deployment Framework Results

metric	result
Deployment Success Rate	99.7% across 2,500+ locations
Mean Time to Deploy	45 minutes (full fleet, staged)
Automatic Rollback Time	Under 5 minutes
Disconnected Recovery	Automatic sync upon reconnection

Conclusion

The development of CI/CD pipelines for distributed edge computing does not follow the traditional CI/CD pipelines of the cloud computing world. By implementing GitOps-based synchronization through Rancher Fleet, wave-based rollouts with automated analysis, and disconnected operation capabilities, organizations achieve reliable deployments across thousands of edge locations. The pull-based model changes network unreliability from a blocker to expected behavior with automatic recovery.

Git clusters

Opinions expressed by DZone contributors are their own.

Related

Trending