Cloud Agnostic MLOps: How to Build and Deploy AI Models Across Azure, AWS, and Open Source
Avoid cloud lock-in when building AI. Learn how to use open-source MLOps tools like Airflow, Kubeflow, and MLflow to build, deploy, and monitor models anywhere.
Join the DZone community and get the full member experience.
Join For FreeArtificial intelligence has become the centerpiece of every digital strategy. What began as isolated proof-of-concepts running on data scientists’ laptops is now expected to scale across clouds, business units, and continents.
Enterprises quickly discover that the challenge is not building AI models. It’s operationalizing them sustainably.
Both Azure and AWS promise an end-to-end MLOps experience. Yet many leaders reach a moment of realization: the more managed services you adopt, the less control you retain over your operations. The alternative is emerging quietly but powerfully: a cloud-agnostic, open-source MLOps stack that provides the same capabilities without the invisible handcuffs. This is not an anti-cloud movement; it’s pro-freedom architecture.
The Cloud Convenience Dilemma
Azure Machine Learning and AWS SageMaker simplify the early stages of AI adoption.
Their integrated environments, data pipelines, registries, and endpoints can enable a model to transition from notebook to production in just weeks.
But convenience hides complexity:
- Each service introduces proprietary APIs and metadata formats.
- Costs scale linearly with experimentation, even before business value appears.
- Porting workloads between regions or clouds becomes a migration project.
CTOs soon face a question larger than cost optimization:
“How do we ensure our AI remains portable, auditable, and sustainable over the next decade?”
The Tri-Stack Landscape
Here’s how Azure, AWS, and the open-source ecosystem map against one another:
| Capability | Azure | AWS | Open source/Cloud Agnostic |
|---|---|---|---|
|
Data Orchestration |
Azure Data Factory |
AWS Glue / Step Functions |
Apache Airflow / Prefect |
|
Data Lake / Storage |
ADLS Gen2 |
S3 + Lake Formation |
Apache Iceberg / Delta Lake + MinIO |
|
Feature Store |
Azure ML Feature Store |
SageMaker Feature Store |
Feast / Hopsworks |
|
Experiment Tracking |
Azure ML Workspaces |
SageMaker Experiments |
MLflow + DVC |
|
Model Registry |
Azure Model Registry |
SageMaker Model Registry |
MLflow Registry / OpenModelDB |
|
Training Compute |
AML Compute Clusters |
SageMaker Training Jobs |
Kubeflow / Argo Workflows / Ray |
|
Inference Serving |
Managed Endpoints (AKS) |
SageMaker Endpoints |
KServe / Seldon Core |
|
Pipeline CI/CD |
Azure Pipelines |
CodePipeline + Step Functions |
GitHub Actions + Argo CD / Flux |
|
Monitoring & Drift |
Azure Monitor + Insights |
CloudWatch + SageMaker Monitor |
Prometheus + Grafana + Evidently AI |
|
Security & Policy |
Defender for Cloud + Policy |
GuardDuty + Config Rules |
OPA + Vault + Trivy |
|
Cost & FinOps |
Azure Cost Management |
Cost Explorer + Budgets |
Kubecost / OpenCost |
Each column offers the same function. Only the third one, the open stack, lets you run it anywhere.
Data Pipelines and Feature Engineering
The Managed Way
- Azure Data Factory and AWS Glue provide GUI-based ETL with strong integration into their ecosystems.
- They’re excellent for quick starts, but orchestration logic remains locked inside each portal.
The Open Way
- Apache Airflow or Prefect express pipelines as Python code, versioned in Git.
- MinIO acts as an S3-compatible object store deployable on Kubernetes or bare metal.
- Apache Iceberg or Delta Lake adds table-level versioning and schema evolution.
Example Airflow snippet:
with DAG("daily_etl", schedule="@daily") as dag:
PythonOperator(task_id="extract", python_callable=extract)
PythonOperator(task_id="transform", python_callable=transform)
Result: identical reproducibility whether you run it on Azure Kubernetes Service, AWS EKS, or your own cluster.
For features, Feast replaces Azure ML Feature Store and SageMaker Feature Store. Its declarative YAMLs define entities and features once, portable forever.
Experimentation and Reproducibility
Data scientists love notebooks; compliance teams don’t. Both Azure ML and SageMaker track experiments, but the metadata lives inside each platform.
Open alternatives such as MLflow and DVC to record experiments as files under version control. Each run logs parameters, metrics, and artifacts that anyone can reproduce, regardless of their location.
with mlflow.start_run():
mlflow.log_param("lr", 0.001)
mlflow.log_metric("f1", 0.89)
This approach transforms model tracking from cloud metadata into auditable evidence.
Training at Scale
Cloud Approach
- Azure ML compute clusters and SageMaker training jobs manage autoscaling and GPUs, but abstract the underlying scheduler.
- You pay per hour per instance, even when idle.
Cloud-Agnostic Approach
- Kubeflow pipelines or Argo workflows run directly on Kubernetes, using your own scaling rules.
- Ray or Horovod distribute training efficiently across GPUs.
Example portable training job:
apiVersion: batch/v1
kind: Job
metadata:
name: train-model
spec:
template:
spec:
containers:
- name: trainer
image: myrepo/trainer:latest
command: ["python", "train.py"]
restartPolicy: Never
Move this YAML from AKS to EKS or to an on-prem cluster, and it behaves identically.
Model Packaging and Registry
Both cloud vendors offer internal registries. An open approach uses MLflow Registry storing models as versioned artifacts (Pickle, ONNX, TorchScript) in MinIO or Nexus.
mlflow models serve -m models:/Churn/1 --port 5000
Your models now travel with you. No console migration needed.
Deployment and Inference
Vendor Path
- Azure ML endpoints and SageMaker endpoints deploy models as managed APIs.
- Excellent uptime, but the serving layer is proprietary.
Open Path
- KServe and Seldon Core expose models as Kubernetes services.
- Support REST/gRPC, A/B testing, canary rollout, and autoscaling.
Example Seldon manifest:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: sentiment
spec:
predictors:
- graph:
implementation: SKLEARN_SERVER
modelUri: "s3://minio/models/sentiment"
replicas: 2
Inference now becomes infrastructure code.Portable and observable.
CI/CD Pipelines
Azure DevOps and CodePipeline work best inside their own ecosystems. Open pipelines unite DevOps and MLOps:
- GitHub Actions → build and test
- DVC → reproduce training
- Argo CD → GitOps deployment
stages:
- train
- deploy
train_model:
script:
- dvc repro
- mlflow run .
deploy_model:
script:
- kubectl apply -f seldon.yaml
This pattern turns every model into a versioned, traceable release.
Observability and Drift Detection
Cloud services offer comprehensive dashboards but often siloed metrics. An open stack unifies everything.
| Layer | Tool | Purpose |
|---|---|---|
|
Metrics |
Prometheus |
Scrape model and infra metrics |
|
Visualization |
Grafana |
Unified dashboards |
|
Drift |
Evidently AI |
Statistical drift reports |
|
Data Quality |
Great Expectations |
Schema and validation checks |
Example drift detection:
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset
report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=ref_df, current_data=prod_df)
Attach these reports to Grafana or Slack, visibility without subscriptions.
Governance, Security, and Explainability
Azure Defender and AWS GuardDuty protect their clouds; they don’t protect yours outside them.
An open model uses:
- OPA (Open Policy Agent) for policy-as-code (“no model deploys without approval”).
- Trivy for container scanning.
- Vault + Sealed Secrets for credentials.
- AI Fairness 360 and Alibi for bias detection and explainability.
Example OPA policy:
package ml.deployment
package ml.deployment
deny[msg] {
input.model.approved == false
msg = "Model lacks approval metadata"
}
Governance becomes code, not a dashboard toggle.
Continuous Retraining and Automation
Azure ML Pipelines and SageMaker Pipelines automate retraining. You can mirror that behavior with Airflow or Kubeflow Pipelines listening to drift metrics.
if drift_score > 0.3 or accuracy_drop > 0.05:
trigger_retrain()
Pipeline:
Monitor → Retrain → Validate → Register → Canary Deploy
Argo Rollouts handles canary steps just like Azure’s blue-green or AWS’s weighted deployments.
FinOps and Cost Visibility
Both clouds expose rich billing APIs — but only for their own usage. Kubecost and OpenCost aggregate spend across clusters and even compare on-prem vs cloud costs.
- See GPU utilization, pod cost, and namespace efficiency.
- Feed data to Prometheus for real-time dashboards.
- Integrate with Slack or Jira for anomaly alerts.
Suddenly, AI cost management becomes transparent; no billing console is required.
Security and Compliance Across Clouds
The modern enterprise operates under SOC2, ISO 27001, and now AI-ethics mandates. Open tools close compliance gaps:
| Concern | Open-source remedy |
|---|---|
|
Container Vulnerabilities |
Trivy / Clair |
|
Secrets |
Vault / Sealed Secrets |
|
Policy Enforcement |
OPA |
|
Network Segmentation |
Kubernetes NetworkPolicies |
|
Audit Trail |
MLflow + Git Commit Metadata |
Security shifts are left baked into pipelines rather than added after the fact.
Architecture View
Vendor architecture:
Azure Data Factory → Azure ML Studio → AKS Endpoints → Azure Monitor
AWS Glue → SageMaker Train/Deploy → CloudWatch
Open architecture:
Airflow / Prefect → Iceberg + Feast → Kubeflow Train → MLflow Registry →
KServe Deploy → Prometheus + Grafana + Evidently → OPA Governance
One runs on a cloud, the other runs across clouds.
The Business Case for Cloud-Agnostic AI
- Freedom to move: Avoiding lock-in means negotiating leverage and compliance flexibility.
- Unified skillset: Engineers learn Kubernetes, not five different proprietary portals.
- Transparent costs: FinOps is simpler when every byte and pod is observable.
- Auditability: Regulatory traceability improves when every artifact lives in Git and open databases.
- Innovation velocity: Open ecosystems evolve faster than managed ones.
What CTOs Should Ask Before Committing to Any Platform
- Can we rebuild this pipeline in another region tomorrow without making any code changes?
- Who owns the feature store metadata, us or the vendor?
- Are retraining triggers visible to auditors?
- Can our FinOps dashboard combine on-prem and cloud costs?
- If the cloud were unreachable for 48 hours, could we still deploy locally?
If any answer is “no,” lock-in already exists.
The Path Forward
The practical approach isn’t abandoning Azure or AWS, it’s decoupling from them.
- Keep data in open formats (Parquet, Iceberg).
- Standardize on MLflow for tracking.
- Use Kubernetes as the substrate everywhere.
- Implement GitOps (Argo CD) for reproducibility.
- Integrate Evidently AI, Kubecost, and OPA early.
Managed services become optional accelerators, not dependencies.
The Future of Cloud-Agnostic AI
Next-generation trends are reinforcing this philosophy:
- BentoML + OpenLLM serving open-weight models on any cluster.
- Federated frameworks like Flower enabling distributed learning across data silos.
- Composable MLOps stacks (Polyaxon, Metaflow) integrating with any storage or orchestrator.
- Policy-aware pipelines that self-validate bias and compliance before deploy.
AI’s evolution is moving toward autonomy and accountability, and openness enables both.
Conclusion: Freedom Is the Ultimate Optimization
Building AI models is no longer the competitive edge; delivering them anywhere, securely and sustainably, is.
Azure and AWS provide powerful managed experiences, but their strength is also their cage. An open, cloud-agnostic architecture built on Airflow, Kubeflow, MLflow, KServe, Prometheus, and OPA offers the same intelligence without the constraints.
Enterprises that master this model can:
- Train on Azure today
- Deploy on AWS tomorrow
- Retrain on-prem next quarter without rewriting a single line
In a world where infrastructure changes every few years, portability is the new productivity. The most innovative organizations won’t just own their data. They’ll own their destiny.
Opinions expressed by DZone contributors are their own.
Comments