Microservices Resources

DZone's Featured Microservices Resources

Data Engineering for AI-Native Architectures: Designing Scalable, Cost-Optimized Data Pipelines to Power GenAI, Agentic AI, and Real-Time Insights

By Abhishek Gupta

CORE

Editor's Note: The following is an article written for and published in DZone's 2025 Trend Report, Data Engineering: Scaling Intelligence With the Modern Data Stack. The data engineering landscape has undergone a fundamental transformation with a complete reimagining of how data flows through organizations. Traditional business intelligence (BI) pipelines were built for looking backward, answering questions like "How did we perform last quarter?" Today's AI-native architectures demand systems that can feed real-time insights to recommendation engines, provide context to large language models, and maintain the massive vector stores that power retrieval-augmented generation (RAG). For data engineers, this evolution has created both opportunity and complexity. The core principles of data quality, observability, and scalable architecture remain as important as ever, but the technical requirements have expanded dramatically. What makes this particularly challenging is that most organizations can't simply rip out their existing data infrastructure and start fresh. A successful approach involves building AI-native capabilities on top of traditional analytics systems. Organizations that successfully build scalable, cost-optimized data platforms capable of supporting both traditional analytics and cutting-edge AI applications will find themselves with a significant competitive advantage. Those that don't risk being left behind as AI capabilities become table stakes across every industry vertical. This article explores how data engineering teams can navigate this transformation, from redesigning core architecture patterns to implementing the new toolsets and practices required for AI-native data platforms. The Transformation of Data Architecture for AI Traditional data architectures were built around the assumption that data would be cleaned, structured, and processed in predictable batches. Your nightly ETL job could take six hours because the BI dashboard only refreshes once a day. AI systems, particularly modern generative and agentic AI applications, operate under completely different assumptions. If you're trying to feed a transformer model with terabytes of training data or build a RAG system that needs to retrieve semantically similar documents in under 100ms, your traditional ETL pipeline and star schema aren't going to cut it. The companies getting AI right aren't just bolting machine learning onto their existing data warehouse. Rather, they're building AI-native architectures from the ground up. Aspect Traditional Data Architecture AI-Native Data Architecture Data types Structured data only (CSV, JSON, SQL tables) Multimodal data (text, images, audio, video, sensor data) Processing model Batch processing (ETL pipelines) Real-time streaming + batch hybrid Latency requirements Hours to days acceptable Milliseconds to seconds required Processing schedule Scheduled jobs (nightly, weekly) Event driven, continuous processing Data storage Data warehouses with star schema Data lakes + vector databases + feature stores Storage format Relational tables, normalized data Object storage (Parquet, Delta Lake, Iceberg) Query patterns SQL-based analytics queries Vector similarity search + traditional queries Infrastructure Cloud native, serverless, GPU-accelerated Cloud native, serverless, GPU accelerated Consumption BI dashboards, reports AI applications, recommendation engines, chatbots Table 1. Traditional vs. AI-native data architectures From Batch to Streaming: Real-Time Data for AI Systems Modern AI applications, especially those involving recommendation systems, fraud detection, and conversational AI, require continuous learning from fresh data streams. When a user interacts with a ChatGPT-style application, the system needs to understand context, not just from the current conversation but also from real-time signals about user behavior, system performance, and external data sources. This creates a fundamentally different set of requirements than traditional analytics workloads. Streaming platforms like Apache Kafka and Apache Pulsar provide distributed event streaming capabilities for real-time data processing: Kafka uses a log-based architecture with topics partitioned across brokers, supporting configurable consistency levels from at-least-once to exactly-once delivery semantics.Pulsar implements a multi-layered design, separating serving from storage through Apache BookKeeper. BookKeeper enables infinite retention without performance degradation and native schema registry support for enforcing data contracts across different services and versions. These are key parts of streaming architectures for generative AI that implement AI-specific patterns. Real-time streams capture user interactions, model outputs, and feedback signals, feeding them immediately to online learning systems and safety monitoring. Meanwhile, batch processes handle compute-intensive operations like embedding generation for RAG systems and large-scale model fine-tuning on accumulated conversation data. Managing Unstructured Data Workflows Unstructured data ranging from text and images to audio and video is a first-class citizen in the world of data engineering. Unlike structured data, which fits neatly into rows and columns, unstructured data is inherently messy and diverse. Yet it is precisely this richness that makes it indispensable for training large language models, building multimodal AI systems, and powering next-gen user experiences. Organizations are increasingly investing in data pipelines that can ingest, store, transform, and serve unstructured data reliably. Modern orchestration tools like Apache Airflow, Dagster, and Prefect are evolving to support these needs, often in combination with cloud-native object storage solutions such as Azure Blob Storage. Processing unstructured data at scale typically involves distributed compute frameworks like Apache Spark or Dask, containerized workloads running on Kubernetes, or GPU-accelerated tasks for specialized transformations. Pipelines are increasingly modular, chaining together tasks such as format normalization, metadata extraction, vectorization, and enrichment. These stages must be resilient, parallelizable, and auditable, thereby capable of handling large volumes while preserving the fidelity and lineage of the original content. Also bear in mind that the challenge isn't just processing unstructured data. It's also about maintaining quality and provenance throughout the pipeline. When the AI model starts producing biased or incorrect outputs, you need to be able to trace those problems back to specific data sources, processing steps, and transformation logic. This requires a level of observability and metadata management that goes far beyond traditional data engineering practices. Building Scalable AI Data Infrastructure The shift toward AI-native architectures demands infrastructure that can simultaneously serve a business analyst running quarterly reports and a research team fine-tuning a foundation model on terabytes of unstructured data. What's particularly challenging is that AI workloads are inherently unpredictable. A traditional ETL pipeline processes the same data volumes at scheduled intervals, but AI training runs can consume massive compute resources for days before completion, while inference workloads might spike suddenly when a new product feature launches. This variability requires infrastructure that can scale both horizontally and vertically without manual intervention, often leveraging cloud-native services that can provision resources on demand. High-Volume Dataset Strategies for AI Training Training foundation models and agentic AI systems requires infrastructure that can handle massive, heterogeneous datasets. Object storage systems are favored for their scalability and cost efficiency, supporting large-scale ingestion through batch or streaming pipelines. Columnar formats such as Apache Parquet, combined with partitioning, reduce I/O load and improve access speed for downstream processing. Data lakes serve as the standard architecture for managing heterogeneous AI datasets, accommodating raw, unstructured data while maintaining essential metadata for governance. Apache Iceberg and Delta Lake provide atomicity, consistency, isolation, and durability (ACID) transactions and time-travel capabilities, enabling dataset versioning and rollback functionality when model performance degrades. Integration with distributed frameworks like Spark and Dask creates sophisticated ETL pipelines that transform raw data into training-ready formats while maintaining provenance tracking for enterprise compliance requirements. Distributed computing frameworks form the computational backbone for large-scale AI operations, with Ray and Spark dominating based on workload characteristics. Ray excels in fine-grained parallelism and stateful computations for reinforcement learning and agentic systems, while Spark remains preferred for batch processing and supervised learning pipelines. GPU-aware schedulers and CUDA-enabled processing libraries accelerate throughput but require careful memory management and data locality optimization to minimize movement between storage and compute layers. Vector Databases and Embedding Management Vector embeddings are numerical representations of data (e.g., text, images, audio) that capture semantic meaning in high-dimensional space, enabling generative AI apps to understand context, perform similarity searches, and retrieve relevant information for tasks like RAG and semantic search. Vector databases like Pinecone, Weaviate, and Chroma solve a fundamental problem: how to efficiently store, index, and query high-dimensional embeddings that represent the semantic meaning of text, images, audio, or any other data types. These databases use specialized indexing algorithms like hierarchical navigable small world (HNSW) or inverted file (IVF) that enable approximate nearest neighbor searches across millions of vectors with low latency. Figure 1. RAG with vector databases Managing embeddings across diverse models and use cases presents significant complexity. A single organization might use OpenAI's text-embedding-ada-002 for document search, custom-trained image models for visual similarity, and specialized models for code search, each producing vectors with different dimensionalities and semantic characteristics. This needs careful namespace management, version control, and compatibility planning. During model upgrades or migrations, new or fine-tuned models generate embeddings in different semantic spaces, making existing vectors non-interchangeable. To address this, teams typically version their embedding collections and implement migration strategies such as dual-serving or fallback mechanisms, allowing multiple versions to coexist during phased transitions. Operationally, embedding management extends beyond storage into the design of efficient generation pipelines. For large datasets, especially with multimodal inputs like images or video, embedding computation is resource intensive. Successful systems often combine batch generation for historical data with real-time streaming for new content, enabling both large-scale reprocessing during model changes and incremental updates as data evolves. Implementation and Optimization Strategies The gap between AI ambitions and production reality often comes down to implementation choices that seem minor but compound into major operational headaches. You can architect the most elegant AI-native data platform on paper, but if you can't deploy it reliably, scale it cost effectively, or maintain it without a team of specialists, you've built an expensive science project. Smart organizations are learning that the key isn't choosing between open-source and commercial solutions. It's understanding where each fits in your specific context. The companies succeeding with AI at scale aren't necessarily the ones with the biggest budgets; they're the ones that have figured out how to balance flexibility, cost, and operational complexity in ways that actually work for their teams. The most effective AI data platforms share a common trait: They're designed for evolution, not perfection. They anticipate that your AI workloads will change, your data volumes will grow unpredictably, and your team will need to adapt to technologies that don't exist yet. Building for this reality requires a fundamentally different approach to implementation and optimization than traditional data infrastructure. DataOps for AI: Automation and Orchestration When your data pipeline breaks at 2 AM because someone changed a schema upstream, you quickly realize that manual processes don't scale with AI workloads. DataOps for AI isn't just about applying DevOps principles to data; it's about recognizing that AI systems have fundamentally different reliability requirements than traditional analytics. Unlike traditional ETL jobs that follow predictable patterns, AI pipelines often involve retraining models based on data quality thresholds, triggering inference updates when embedding models change, or coordinating between multiple agentic systems that need to share state. The most successful implementations treat AI pipeline orchestration as a first-class engineering discipline. This means version controlling not just your code but also your pipeline definitions, model artifacts, and even the configurations that determine when models get retrained. Companies like Netflix and Uber have shown that you can achieve remarkable reliability by building orchestration systems that understand the relationships between data, models, and business outcomes. Testing becomes particularly crucial when your pipelines feed AI systems that make autonomous decisions. In addition to unit testing for individual transformations, you need: Integration tests that verify end-to-end behaviorSchema evolution tests that catch breaking changes before they propagateMonitoring that can detect subtle data drift that might degrade model performance over time Your team should be able to deploy changes confidently, understand why failures happen, and recover quickly when things go wrong. This requires building orchestration systems that provide clear visibility into dependencies, explicit control over rollback scenarios, and detailed logging that helps you debug complex interactions between multiple AI systems. Cross-Functional Collaboration and Governance The biggest technical challenge in AI data engineering often isn't the technology; it's the organizational complexity that emerges when data engineers, data scientists, ML engineers, and business stakeholders all need different things from the same underlying infrastructure: Data scientists want flexibility to experiment with new models and datasets.Business stakeholders want reliable and auditable systems that meet compliance requirements.Data engineers want maintainable architectures that don't require constant firefighting. Effective governance starts with establishing clear ownership boundaries through technical mechanisms rather than relying solely on process documentation. This includes implementing automated data quality checks, feature stores that enforce consistent transformations, and metadata systems that track lineage from raw data through model training to business outcomes. Model governance becomes particularly complex in agentic AI systems where multiple models interact autonomously, requiring frameworks for understanding how changes in one model might affect dependent systems and monitoring that can detect emergent behaviors. Privacy and compliance add another layer of complexity as GDPR right-to-deletion requests become nightmarish when you need to remove specific data points from trained models, embedding stores, and vector databases. The goal is creating collaboration patterns that scale with your AI initiatives by treating governance as an engineering problem that can be solved with the right combination of tooling, automation, and organizational design, enabling teams to move quickly without creating systemic risks. Conclusion We're witnessing the emergence of an architectural paradigm where the distinction between "analytics data" and "AI data" is dissolving. Systems like recommendation engines, fraud detection, and chatbots increasingly rely on shared data infrastructure but with vastly different performance and consistency requirements. The path forward is to design data platforms with AI as the primary consumer, retrofitting for traditional workloads rather than the other way around. This shift enables support for evolving needs like large context windows, vector-based retrieval, and complex agentic workflows. The shift also introduces specific technical demands: Cost management must include storage format efficiency, cross-region data movement, and duplication across batch and real-time paths.Performance monitoring needs to capture embedding drift, vector index quality, and inference latency under load.Orchestration has to coordinate interdependent steps across ingestion, transformation, training, and serving. These aren't speculative problems; they already surface in production environments that support retrieval-augmented generation, fine-tuned models, and agent-based systems. The organizations that get this right will move beyond efficiency gains to build data architectures that unlock durable AI capabilities. Additional resources: Getting Started With Large Language Models by Tuhin Chattopadhyay, DZone RefcardGetting Started With Agentic AI by Lahiru Fernando, DZone RefcardGetting Started With Vector Databases by Miguel Garcia Lorenzo, DZone RefcardGetting Started With Data Quality by Miguel Garcia Lorenzo, DZone RefcardApache Kafka Patterns and Anti-Patterns by Abhishek Gupta, DZone Refcard This is an excerpt from DZone's 2025 Trend Report, Data Engineering: Scaling Intelligence With the Modern Data Stack.Read the Free Report More

Expert Techniques to Trim Your Docker Images and Speed Up Build Times

By Chirag Agrawal

Key Takeaways Pick your base image like you're choosing a foundation for your house. Going with a minimal variant like python-slim or a runtime-specific CUDA image, is hands down the quickest way to slash your image size and reduce security risks.Multi-stage builds are your new best friend for keeping things organized. Think of it like having a messy workshop (your "builder" stage) where you do all the heavy lifting with compilers and testing tools, then only moving the finished product to your clean showroom (the "runtime" stage).Layer your Dockerfile with caching in mind, always. Put the stuff that rarely changes (like dependency installation) before the stuff that changes all the time (like your app code). This simple trick can cut your build times from minutes to mere seconds.Remember that every RUN command creates a permanent layer. You've got to chain your installation and cleanup commands together with && to make sure temporary files actually disappear within the same layer. Otherwise, you're just hiding a mess under the rug while still paying for the storage.Stop treating .dockerignore like an afterthought. Make it your first line of defense to keep huge datasets, model checkpoints, and (yikes!) credentials from ever getting near your build context. So you've built your AI model, containerized everything, and hit docker build. The build finishes, and there it is: a multi-gigabyte monster staring back at you. If you've worked with AI containers, you know this pain. Docker's convenience comes at a price, and that price is bloated, sluggish images that slow down everything from developer workflows to CI/CD pipelines while burning through your cloud budget. This guide isn't just another collection of Docker tips. We're going deep into the fundamental principles that make containers efficient. We'll tackle both sides of the optimization coin: The Architecture: Making smart choices about base images and how you structure your builds.The Mechanics: Getting your hands dirty with layers, caching, and cleanup techniques. To keep things real, we'll work through an actual example: a text classification app using BERT. We'll take this beast from a 2.37GB container that takes forever to build down to a slim 720MB image that rebuilds in 25 seconds. Let's dive in. The Starting Point: Diagnosing a 2.37GB AI Image Our starting project, naive_image, works fine, but it's definitely not winning any optimization awards. A quick build tells the whole story: we're looking at a 2.37GB image that takes 56 seconds to build on an Apple M1 Max. Ouch. Dockerfile (from naive_image): Dockerfile # naive_image/Dockerfile # This is the initial, naive Dockerfile. # It aims to be simple and functional, but NOT optimized for size or speed. # Use a standard, general-purpose Python image. FROM python:3.10 RUN apt-get update && apt-get install -y curl # Set the working directory inside the container # All subsequent commands will run from this directory WORKDIR /app # Copy requirements first for better layer caching COPY naive_image/requirements.txt ./requirements.txt # Install all dependencies listed in requirements.txt. RUN pip install --no-cache-dir -r requirements.txt # Copy application code and data COPY naive_image/app/ ./app/ COPY naive_image/sample_data/ ./sample_data/ RUN echo "Build complete" > /app/build_status.txt # Command to run the application when the container starts. # This runs the predictor script with the sample text file. CMD ["python", "app/predictor.py", "sample_data/sample_text.txt"] Running docker history immediately shows us what's going wrong. We've got two main offenders here: that chunky python:3.10 base image and a massive pip install layer that's adding about 1.5GB all by itself. It's installing everything and the kitchen sink. Now that we know what we're dealing with, let's fix it. docker history bert-classifier-naive You'll see something like this: Shell IMAGE CREATED CREATED BY SIZE COMMENT b0693be54230 2 minutes ago CMD ["python" "app/predictor.py" "sample_dat… 0B buildkit.dockerfile.v0 <missing> 2 minutes ago RUN /bin/sh -c echo "Build complete" > /app/… 15B buildkit.dockerfile.v0 <missing> 2 minutes ago COPY naive_image/sample_data/ ./sample_data/… 376B buildkit.dockerfile.v0 <missing> 2 minutes ago COPY naive_image/app/ ./app/ # buildkit 12.2kB buildkit.dockerfile.v0 <missing> 2 minutes ago RUN /bin/sh -c pip install --no-cache-dir -r… 1.51GB buildkit.dockerfile.v0 <missing> 3 minutes ago COPY naive_image/requirements.txt ./requirem… 362B buildkit.dockerfile.v0 <missing> 3 minutes ago WORKDIR /app 0B buildkit.dockerfile.v0 <missing> 3 minutes ago RUN /bin/sh -c apt-get update && apt-get ins… 19.4MB buildkit.dockerfile.v0 <missing> 3 weeks ago CMD ["python3"] 0B buildkit.dockerfile.v0 <missing> 3 weeks ago RUN /bin/sh -c set -eux; for src in idle3 p… 36B buildkit.dockerfile.v0 <missing> 3 weeks ago RUN /bin/sh -c set -eux; wget -O python.ta… 58.2MB buildkit.dockerfile.v0 <missing> 3 weeks ago ENV PYTHON_SHA256=4c68050f049d1b4ac5aadd0df5… 0B buildkit.dockerfile.v0 <missing> 3 weeks ago ENV PYTHON_VERSION=3.10.17 0B buildkit.dockerfile.v0 <missing> 3 weeks ago ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8… 0B buildkit.dockerfile.v0 <missing> 3 weeks ago RUN /bin/sh -c set -eux; apt-get update; a… 18.2MB buildkit.dockerfile.v0 <missing> 3 weeks ago ENV LANG=C.UTF-8 0B buildkit.dockerfile.v0 <missing> 3 weeks ago ENV PATH=/usr/local/bin:/usr/local/sbin:/usr… 0B buildkit.dockerfile.v0 <missing> 16 months ago RUN /bin/sh -c set -ex; apt-get update; ap… 560MB buildkit.dockerfile.v0 <missing> 16 months ago RUN /bin/sh -c set -eux; apt-get update; a… 183MB buildkit.dockerfile.v0 <missing> 2 years ago RUN /bin/sh -c set -eux; apt-get update; a… 48.5MB buildkit.dockerfile.v0 Part 1: Blueprint for Efficiency: Base Images and Multi-Stage Builds Before we start tweaking the small stuff, let's fix the big architectural issues. A. The Quick Win: Slim Base Images Think of your FROM instruction as choosing the foundation for a house. Pick a heavy foundation, and you're stuck with a heavy house. The standard python:3.10 image comes with the full Debian experience, complete with development tools and libraries you'll never need in production. Our first move? Switch to something leaner. Check out what happens when we make this one simple change in our slim_image project's Dockerfile: Dockerfile # slim_image/Dockerfile FROM python:3.10-slim WORKDIR /app COPY ./requirements.txt ./ COPY ./app/ ./app/ COPY ./sample_data/ ./sample_data/ RUN pip install --no-cache-dir -r requirements.txt CMD ["python", "app/predictor.py", "sample_data/sample_text.txt"] Just this one line change and boom: we go from 2.54GB to 1.66GB, and build time drops from 56s to 51s. Running docker history shows exactly why this works. The python:3.10-slim base is hundreds of megabytes smaller. The same principle applies if you're doing GPU work: always go for the lean nvidia/cuda:<version>-runtime image instead of the bloated nvidia/cuda:<version>-devel for production. docker history bert-classifier-slim Here's what you'll see: Shell IMAGE CREATED CREATED BY SIZE COMMENT 4633330c13b5 9 seconds ago CMD ["python" "app/predictor.py" "sample_dat… 0B buildkit.dockerfile.v0 <missing> 9 seconds ago RUN /bin/sh -c pip install --no-cache-dir -r… 1.34GB buildkit.dockerfile.v0 <missing> 58 seconds ago COPY slim_image/sample_data/ ./sample_data/ … 376B buildkit.dockerfile.v0 <missing> 58 seconds ago COPY slim_image/app/ ./app/ # buildkit 5.51kB buildkit.dockerfile.v0 <missing> 58 seconds ago COPY slim_image/requirements.txt ./requireme… 334B buildkit.dockerfile.v0 <missing> 58 seconds ago WORKDIR /app 0B buildkit.dockerfile.v0 <missing> 5 weeks ago CMD ["python3"] 0B buildkit.dockerfile.v0 <missing> 5 weeks ago RUN /bin/sh -c set -eux; for src in idle3 p… 36B buildkit.dockerfile.v0 <missing> 5 weeks ago RUN /bin/sh -c set -eux; savedAptMark="$(a… 46.4MB buildkit.dockerfile.v0 <missing> 5 weeks ago ENV PYTHON_SHA256=ae665bc678abd9ab6a6e1573d2… 0B buildkit.dockerfile.v0 <missing> 5 weeks ago ENV PYTHON_VERSION=3.10.18 0B buildkit.dockerfile.v0 <missing> 5 weeks ago ENV GPG_KEY=A035C8C19219BA821ECEA86B64E628F8… 0B buildkit.dockerfile.v0 <missing> 5 weeks ago RUN /bin/sh -c set -eux; apt-get update; a… 9.18MB buildkit.dockerfile.v0 <missing> 5 weeks ago ENV LANG=C.UTF-8 0B buildkit.dockerfile.v0 <missing> 5 weeks ago ENV PATH=/usr/local/bin:/usr/local/sbin:/usr… 0B buildkit.dockerfile.v0 <missing> 5 weeks ago # debian.sh --arch 'arm64' out/ 'bookworm' '… 97.2MB debuerreotype 0.15 B. The Architectural Leap: Isolating Environments With Multi-Stage Builds Okay, our image is smaller, but let's be honest: is it really production-ready? Take a peek at our dependencies. We've got the essentials like transformers and torch, but also a bunch of dev tools like pytest, black, and jupyter. These are great for development, but in production? They're just dead weight and potential security holes. Shipping dev tools to production is like bringing your entire toolbox when you only need a screwdriver. This is where multi-stage builds come to the rescue. Here's the mental model: imagine building a piece of furniture. You do all the messy work in your garage workshop, which is full of saws, drills, and sawdust. Once you're done, you move only the finished furniture to your living room. The garage stays messy, but who cares? Your living room is pristine. In Docker terms, your "builder" stage is that messy garage. You can install compilers, testing frameworks, whatever you need. Then, in your "runtime" stage, you start fresh and cherry-pick only the finished pieces you actually need. When the build completes, Docker throws away the entire garage. Here's how it actually works: FROM python:3.10 AS builder: This creates your workshop and gives it a name.Inside this stage, go wild. Install everything, run tests with RUN pytest, whatever you need.FROM python:3.10-slim AS runtime: Start fresh with a clean stage.COPY --from=builder <source> <destination>: This is where the magic happens. You can selectively grab stuff from your builder stage. Let's see this in action with our multistage_image project's Dockerfile: Dockerfile # multistage_image/Dockerfile # ====== BUILD STAGE ====== FROM python:3.10 AS builder WORKDIR /app COPY multistage_image/requirements.txt runtime_requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt # Installs ALL deps, including pytest # You could run tests here: RUN pytest # ====== RUNTIME STAGE ====== FROM python:3.10-slim AS runtime WORKDIR /app COPY multistage_image/runtime_requirements.txt ./ RUN pip install --no-cache-dir -r runtime_requirements.txt # Installs ONLY runtime deps COPY multistage_image/app/ ./app/ COPY multistage_image/sample_data/ ./sample_data/ CMD ["python", "app/predictor.py", "sample_data/sample_text.txt"] The results speak for themselves: we're down to 827MB and builds take just 24 seconds. We've eliminated hundreds of megabytes of dev-only Python packages. This pattern really shines when you're dealing with compiled dependencies too, since compilers like gcc or nvcc can stay in the builder stage where they belong. Now, you might be wondering: "Wait, why aren't we using COPY --from=builder for the Python packages? Why run pip install twice?" Great question! This Dockerfile uses what I call the "Re-install from Lockfile" pattern: Why do this? We want to guarantee that our final image contains only the packages in runtime_requirements.txt. No chance of a random dev dependency sneaking in.The good: It's crystal clear what's happening, easy to audit, and your production dependencies are pristine. The builder stage acts like a CI check, making sure everything plays nicely together.The not-so-good: You need network access in the final stage to hit PyPI again. The Alternative: The "Build and Copy Artifacts" Pattern But what if you've got a custom C extension that needs compiling? The "Re-install from Lockfile" pattern won't work because pip install in the runtime stage won't have your compiled goodies from the builder. That's when you'd use a more direct approach: Dockerfile # Alternative "Build and Copy Artifacts" Pattern # ====== BUILD STAGE ====== FROM python:3.10 AS builder WORKDIR /build-env # Install build tools like gcc, cmake, etc. RUN apt-get update && apt-get install -y build-essential # Prepare a clean directory of only runtime packages WORKDIR /runtime_packages COPY runtime_requirements.txt . # --target installs packages to a specified directory instead of the global site-packages RUN pip install --no-cache-dir --target=. -r runtime_requirements.txt # If you had a C extension: # COPY ../my_c_extension ./src/my_c_extension # RUN pip install --no-cache-dir --target=. ./src/my_c_extension # ====== RUNTIME STAGE ====== FROM python:3.10-slim AS runtime WORKDIR /app # This is the "magic link" in action for packages. # It copies the fully prepared dependencies directly from the builder. COPY --from=builder /runtime_packages /usr/local/lib/python3.10/site-packages/ COPY ./app ./app # ... Why do this? You get a completely self-contained build with zero network dependencies in the final stage, plus it handles compiled code.The good: It's hermetic and works for any compiled dependencies. The runtime stage builds super fast since it's just copying files.The tricky part: You need to be careful in the builder stage to keep that /runtime_packages directory clean. Picking the Right Approach For our multistage_image project, where we mainly wanted to keep pytest away from transformers, the "Re-install from Lockfile" pattern works perfectly. If we had compiled dependencies, we'd have no choice but to use the "Build and Copy Artifacts" pattern. Knowing both patterns gives you options. And speaking of results, our multi-stage build brings us down to 827MB. Not bad! Part 2: The Mechanical Tuning — Cache, Layers, and Context Now that we've got the architecture sorted, let's get into the nitty-gritty of making builds blazing fast. Our final layered_image project shows how it's done. A. Mastering the Build Cache (The Key to Fast Iteration) Experiment 1: Experience the Cache Magic First, let's build our properly structured layered_image. The first build takes about 23 seconds. time docker build -t bert-classifier:layered -f layered_image/Dockerfile layered_image/ Now here's where it gets fun. Open layered_image/app/predictor.py and make a tiny change, like adding a comment. Run the exact same build command again. Watch this: it finishes in less than a second. You'll see Docker saying "Using cache" for that slow pip install step because its input (runtime_requirements.txt) didn't change. Only the final COPY runs again. That's the power of proper caching! Experiment 2: How to Destroy Your Cache (And Your Productivity) Let's break things on purpose to see why layer order matters. Edit your layered_image/Dockerfile and move the line COPY layered_image/app/ ./app/ to before the RUN pip install ...line. Make another small change to app/predictor.py and rebuild. What happens? The build takes the full 23 seconds again! Your innocent code change busted the cache at the (now earlier) COPY step. Since pip install comes after this busted cache, it has to run from scratch too. This is why getting your layer order right isn't just a nice-to-have; it's essential for your sanity. B. The Art of the RUN Command (The Key to Microscopic Layers) Here's something that trips up a lot of people: every RUN command creates a new, permanent layer. Once a file exists in a layer, you can't truly delete it from your image size, even if you remove it in a later layer. It's like trying to erase something written in permanent marker by writing over it. The original is still there underneath, taking up space. Our testing shows that a basic RUN pip install ... creates a layer weighing in at 679MB. But watch what happens when we chain everything into one command: Dockerfile # LAYER OPTIMIZATION: Install runtime dependencies and clean up in a single layer RUN pip install --no-cache-dir -r runtime_requirements.txt && \ pip cache purge && \ rm -rf /tmp/* /var/tmp/* && \ find /usr/local/lib/python*/site-packages/ -name "*.pyc" -delete && \ find /usr/local/lib/python*/site-packages/ -name "__pycache__" -type d -exec rm -rf {} + || true This single command creates a layer of just 572MB. That's 107MB saved just by doing our cleanup in the same breath as the install. If you check docker history, you'll see one lean 572MB layer instead of a bloated 679MB layer followed by useless cleanup attempts. Pro tip for BuildKit users: If you're running a recent Docker version, throw # syntax=docker/dockerfile:1.7 at the top of your Dockerfile to unlock some sweet BuildKit features: Dockerfile # syntax=docker/dockerfile:1.7 # ... RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt This tells BuildKit to maintain a persistent pip cache that lives outside your image layers. You get faster builds without the bloat. Win-win! C. The Gatekeeper: Mastering .dockerignore When you run docker build ., that innocent-looking. tells Docker to package up everything in the current directory and ship it to the build daemon. For AI projects, "everything" might include multi-gigabyte datasets, model checkpoints, Jupyter notebooks, and (heaven forbid) your virtual environment. Sending all this junk isn't just slow; it's a security nightmare waiting to happen. Enter .dockerignore, your bouncer at the door. It stops files from even getting into the build context. For AI projects, a solid .dockerignore isn't optional. Use this comprehensive .dockerignore file suitable for AI projects. Plain Text # .dockerignore # Python virtual environments .venv/ # Python caches __pycache__/ # IDE and OS specific .vscode/ .idea/ # ... etc. Quick note about .git: Usually, you want to exclude it. But if you're doing something fancy like embedding commit info in your container, you might need it. For 99% of cases, though? Ignore it. The Final Result: A Production-Ready Blueprint By combining smart architecture with mechanical precision, we end up with our final layered_image Dockerfile. Here's how far we've come: IMAGeSIzetime to build naive_image 2.37 GB 56 seconds slim_image 1.5 GB 51 seconds multistage_image 827 MB 24 seconds layered_image 720 MB 25 seconds Conclusion Your Dockerfile isn't just some build script; it's the blueprint for your app's entire runtime environment. Once you understand how it really works, you can craft containers like a pro. We've seen how smart architecture (like multi-stage builds) combined with attention to detail (like cache optimization and layer management) can transform a bloated mess into a lean, professional container. These aren't just tricks; they're fundamental engineering principles for building efficient AI systems. Now it's your turn. Go look at your own Dockerfiles. Can you reorder those layers for better caching? Could you chain those RUN commands to actually delete temporary files? Is your .dockerignore actually doing its job? More

*You* Can Shape Trend Reports: Join DZone's Kubernetes Research

By DZone Editorial

What We Learned Migrating to a Pub/Sub Architecture: Real-World Case Studies from High-Traffic Systems

By Ravi Teja Thutari

CORE

Consumer Ecosystem Design for Efficient Configuration Based Product Rollouts

By Ashish Shetti

Beyond Netflix: Why Fintech Recommendations Need a Completely Different Playbook

Let’s dive into how to create a recommendation system for fintech—hearing for the first time? But don’t worry, I’ll break it down into bite-sized pieces. The Unique Nature of Financial Recommendations First off, financial recommendations are a whole different ballgame compared to those you’d get from Netflix or an online store. If Netflix suggests a bad movie, it’s just 90 minutes wasted. But if a fintech app makes a bad investment suggestion, folks could lose their hard-earned savings. That’s why we’ve gotta be super careful in how we build these systems. Now, here are some big challenges we face: First up is trust. Users need to believe that the recommendations are actually in their best interest—not just a way for the company to make a buck. It’s a lot more serious than recommending a song or a product. Then there’s regulatory stuff. Financial services are under strict rules—think FINRA and SEC guidelines. Every recommendation we make has to be defendable and follow the law. Managing risk is also huge. A risky investment shouldn’t pop up for someone who’s more of a conservative investor—ever. Also, financial decisions can stick around for a long time. What we suggest today could really affect someone’s ability to retire nicely or buy a house years down the line. And let’s not forget our diverse crowd. Financial apps hit all sorts of users, from those just starting out with money to seasoned investors—so we can’t use a one-size-fits-all approach. Key Success Metrics Now, Let’s discuss success metrics. Unlike typical recommendation systems that chase clicks, fintech needs to find a balance between user engagement and actual outcomes: Here’s what to look out for: Primary metrics include how many new products users adopt, improvements in customer lifetime value, user retention rates, and fewer questions about the recommended products. For secondary metrics, we can check click-through rates, time spent checking out recommended products, cross-selling success, and user satisfaction scores. Compliance is crucial too—how complete our audit trails are, if we recommend the right risks, how clear our explanations are, and how many regulatory complaints we get. System Architecture and Design Moving on to system design, we need a solid architecture. A good fintech recommendation system should have different layers to keep things organized: Start with the data layer—this is where all the info gets stored and processed. We’re talking about transaction data, user profiles, and market data. It should handle both batch processing for training and real-time processing for live recommendations.Then we’ve got the feature engineering layer, turning raw data into cool insights about user behavior, their financial health, and how they handle risk. In finance, we need to consider timing trends and economic situations.The model layer comes next, considering all our recommendation models—like collaborative filtering, content-based filtering, and the like. This is where we also support testing and rollouts of new algorithms.The API layer is crucial for serving up real-time recommendations super fast. It’s all about keeping users happy and reducing wait times.Don’t forget the compliance layer—it ensures everything meets regulations and keeps audit trails intact.Lastly, we’ve got the monitoring layer, keeping tabs on performance and making sure everything stays in tip-top shape. On the design front, security comes first. All user data, especially financial info, needs to be encrypted and access controlled. We’ve also got to ensure our recommendations are clear and understandable for users; this builds trust. If something goes wrong, our system should still work even if one part fails. And we need to keep a complete log of all decisions for audits. Data Strategy and Feature Engineering As for data collection, let’s talk about sources. We need transaction data to understand spending habits and transaction history. User profile data helps shape understanding, but we’ve got to be careful with privacy and biases. Market data—like interest rates and economic indicators—plays a part too. Behavioral data tells us how people use the app. And product performance data gives us insight into what really works. Now, feature engineering is key. Rather than looking at how much money someone spends, we should analyze patterns—like spending a percentage on housing. We can create scores that indicate financial stability by looking at savings rates and debt-to-income ratios. For risk, we need to match what users say about their risk tolerance with their actual behaviors. And we need to recognize life stages and adjust recommendations accordingly. In terms of privacy, we should collect only what we need and anonymize data where possible. Consent management is important too, allowing users to have control over their info. Recommendation Algorithm Approaches When it comes to algorithms, collaborative filtering can help us find users who act similarly with their finances for recommendations. But we need to make sure we're focusing on their actual behaviors and not just superficial data. Content-based filtering is useful for matching user profiles to product traits. It’s a lifesaver for new users or new products without much data. We can also use hybrid approaches, combining different methods to get the best results. And as we get fancy with deep learning and reinforcement learning, we should keep track of how these affect recommendations. When we implement this in real-time, we can't afford delays as sub-second response times are a must. We’ll need to cache things smartly to keep user engagement high and performance smooth. Scaling the architecture means using microservices so different parts can grow independently. We’ll distribute requests smartly and keep an eye on our databases to ensure they’re running as they should. Once we’re up and running, we need to monitor everything to catch issues early, both in terms of performance and compliance. We’re building trust with users—so clear explanations for why we recommend what we do are vital. We also have to be on top of bias detection and make sure all demographics have fair access. So there we have it, a wrap-up on building a fintech recommendation system. It’s all about keeping the end-users in mind, because we’re really making a difference in their financial lives. Taking it step by step, being transparent, and ensuring compliance is key to success.

By Swati Tyagi

Virtualized Containers vs. Bare Metal: The Winner Is…

The blanket statement that bare metal is superior to containers in VMs for running containerized infrastructure, such as Kubernetes, no longer holds true. Each has pros and cons, so the right choice depends heavily on specific workload requirements and operational context. Bare metal was long touted as the obvious choice for organizations seeking both the best compute performance and even superior security when hosting containers compared to VMs. But this disparity in performance has slowly eroded. For security, it is now hard to make the case for bare metal’s benefits over those of VMs, except for very niche use cases. Today, it can be validated that a VM infrastructure will offer a long list of superior operational capabilities and almost on-par performance for managing containers in cloud-native environments as well as in the private cloud. The actual compute and performance gap continues to narrow and, in many cases, has almost closed completely. VM platform providers can also offer abstraction layers for private clouds and edge deployments that combine the advantages of both VMs and containers in a unified configuration. That said, the current state of adoption and applicability shows that containers running on VMs are the better choice for the vast majority of use cases. The Consensus The analysts and HyperScalers consensus across ReveCom, 451 Research, and Gartner advocates for containers running on a VM layer in at least many, if not the vast majority of, use cases. That choice favors operational efficiency over the very slight and usually negligible performance advantage measured largely by latency metrics of containerized infrastructure running on bare metal. While bare metal may still provide marginal latency benefits in specific scenarios, the performance gap has become negligible for many, if not the majority, of workloads. In fact, industry benchmarks like MLPerf have demonstrated that VM-based container environments can match — or even outperform — bare metal in latency-sensitive tasks such as machine learning inference. Beyond performance, VMs continue to offer a wider range of operational control, strong isolation (think "hard" resource limits), and better security than bare metal. Whether for managing numerous workloads at scale or maintaining stability in multi-tenant setups, VMs have the edge. While the performance gap has largely decreased to become negligible according to analyst consensus and hyperscalers, bare metal — just by pure definition or configuration — does still offer a very small performance advantage depending on the application. After all, the containerized infrastructure is running directly on the server. The VMs provide an additional layer between the application and the server. So on that very rudimentary level, just by definition alone, the performance is superior when running on containers, but reality is even bare metal containers run on underlying Linux. Meanwhile, the advantages of VMs extend to traditional and more recent workloads, such as for Kubernetes infrastructure. VMs continue to offer superior elasticity for scaling, rollback, snapshot capabilities, and provisioning in both cases. These are reasons why an organization operating a hypervisor at scale, extending across numerous clusters, would likely be unenthusiastic about manually managing individual bare-metal server operating systems and infrastructure. The hypervisor can offer control at scale over a vast number of Kubernetes deployments on VMs, which maintain consistently superior operational control, as described above. Hyperscalers offer containers primarily on top of VMs instead of bare metal. This approach provides stronger security and isolation, evidenced by better isolation between different workloads as each VM has its own kernel. An attacker is still confined within the VM if there is a container escape vulnerability. Additionally, this enables efficient resource management, highlighted by the ability to enforce hard resource limits (CPU, memory, disk) at the VM level. Guaranteed performance characteristics are achieved due to dedicated vCPU, memory, and I/O queues for VMs. These aspects directly support the operational control, security, and isolation wins of virtualized containers. In multi-tenant environments, running containers directly on bare metal can lead to unpredictable performance and stability issues. A single misbehaving or resource-hungry container can impact other workloads sharing the same physical resources. This is particularly problematic in industries where reliability is critical. For example, in financial services, a resource spike in a risk analysis job could disrupt real-time transaction processing. In healthcare, intensive medical imaging analysis could slow down patient record systems. Online education platforms might see grading automation or video transcoding containers affect the entire student experience during peak times. Similarly, SaaS providers risk one customer’s workload impacting others, and media companies could see a single streaming job degrade the quality of other live broadcasts. VMs address these challenges by providing strong isolation and hard resource boundaries for each workload. The hypervisor ensures that each VM receives dedicated CPU or GPU, memory, and I/O resources, preventing one workload from interfering with another. This results in more predictable performance, greater stability, and improved security. Come Together The advantages of VMs for containerized infrastructure are certainly an easier sell than in the past. When determining whether to opt for, or when it's appropriate to offer, VMs, bare metal, or a hybrid alternative for running containerized workloads, especially at scale, careful consideration is needed. Elasticity, isolation, and other operational advantages that VMs offer can arguably become buzzwords when taken out of context. But a proper assessment will reveal how, again, the number of workloads or microservices — or workloads per node — makes the case for bare metal especially hard. When a single microservice can cause a container to malfunction and the entire server’s kernel to crash. Conversely, VMs support one of the key functionalities that Kubernetes is meant to do: offering the ability to scale up and down resources as needed, i.e., elasticity and isolation to the extent that separate microservices’ damage radius remain limited to the container. On a more rudimentary level, in reference to elasticity, an organization running Kubernetes and containerized workloads in a private cloud data center — or even in a traditional data center — will face difficulties in determining its future needs. Because of that, redundancy and over-provisioning must be added to bare metal servers in anticipation of future needs. Those scenarios include when saturation is reached or when business operations can no longer function once full capacity has been achieved. Therefore, the required capacity for bare metal — and the associated cost — will often outstrip the more elastic options that VMs offer, whether used on private clouds or not. The elasticity that VMs provide for running containers at scale aligns with the ability to manage and extend Kubernetes in a unique way. With a platform that allows VMs to be managed in a flexible and elastic manner, Custom Resource Definitions (CRDs) can be applied by separate teams for various K8s versions on the same host as their needs fluctuate and evolve more dynamically. Bare metal is limited to one K8s version per server, and VMs eliminate that challenge. This flexibility extends across environments, whether the actual data or applications are housed on a single server or across Kubernetes environments in a multi-cluster structure. Such CRD flexibility, while a core Kubernetes feature, can pose significant challenges when attempting to achieve the same level of isolation, elastic management, and support of multiple K8s versions on a bare metal server as is readily available with virtualized infrastructure. Again, whether it makes sense to run containers bare metal or to run containers on the abstraction of VM comes down, of course, to the individual needs of the organization. Technology leaders and DevOps teams should assess the environment accordingly and determine what makes the most sense, which likely still will be VMs for the vast majority of their containerized workloads.

By Chris Ward

CORE

Architecting Compound AI Systems for Scalable Enterprise Workflows

The convergence of generative AI, large language models (LLMs), and multi-agent orchestration has given rise to a transformative concept: compound AI systems. These architectures extend beyond individual models or assistants, representing ecosystems of intelligent agents that collaborate to deliver business outcomes at scale. As enterprises pursue hyperautomation, continuous optimization, and personalized engagement, designing agentic workflows becomes a critical differentiator. This article examines the design of compound AI systems with an emphasis on modular AI agents, secure orchestration, real-time data integration, and enterprise governance. The aim is to provide solution architects, engineering leaders, and digital transformation executives with a practical blueprint for building and scaling intelligent agent ecosystems across various domains, including customer service, IT operations, marketing, and field automation. Image Source: Arvix The Rise of Compound AI Traditional AI applications were often isolated, with one bot dedicated to service, another focused on analytics, and yet another for marketing. However, real-world workflows are interconnected, requiring the sharing of context, handoff of intent, and adaptive collaboration. Compound AI systems address this by: Enabling autonomous, yet cooperative agents (e.g., Planner, Retriever, Executor)Facilitating multi-modal interactions (text, voice, events)Supporting enterprise-level guidelines for explainability, privacy, and control This reflects how complex systems operate in human organizations: each unit (agent) has a role, but together they create a value chain. Design Principles for Enterprise-Grade Agentic Workflows Designing effective compound AI systems requires a thoughtful approach to ensure modularity, scalability, and alignment with enterprise goals. Below are key principles to guide the development of agentic workflows: 1. Modular Agent Design Each AI agent should be designed with a specific, well-defined responsibility, following the single responsibility principle. This modularity makes maintenance, testing, and scalability easier. For instance: Planner Agent: Breaks down overarching goals into manageable sub-tasks.Retriever Agent: Retrieves and collects pertinent data from diverse sources.Executor Agent: Executes actions according to the planner's directives.Evaluator Agent: Evaluates outcomes and offers feedback for ongoing improvement. By clearly defining responsibilities, agents can operate independently while working together cohesively within the system. 2. Event-Driven and Intent-Centric Architecture Transitioning from static, synchronous workflows to dynamic, event-driven architectures enhances responsiveness and adaptability. Implementing intent-centric designs enables the system to effectively interpret and act on user or system intents. Key components include: Intent Routers: Classify and direct intents to the appropriate agents.Event Brokers facilitate communication among agents via event messaging.Memory Modules: Preserve context over time, allowing agents to make informed decisions based on historical data. This architecture enables scalability and resilience, which are essential for enterprise environments. 3. Enterprise Data Integration and Retrieval-Augmented Generation (RAG) Integrating both structured and unstructured data sources ensures AI agents operate with a comprehensive context. Utilizing Retrieval-Augmented Generation techniques enables agents to access external knowledge bases, improving their decision-making abilities. Strategies include: Data Connectors: Create secure connections to enterprise databases and APIs.Vector Databases: Enhance semantic search and retrieval of pertinent information.Knowledge Graphs: Offer structured representations of relationships among data entities. This integration ensures that agents are informed, context-aware, and able to deliver accurate outcomes. 4. Security and Governance Framework Ensuring the security and compliance of agentic systems is crucial. Implementing robust governance frameworks helps maintain trust and accountability. Key practices include: Access Controls: Establish and enforce permissions for data and agent interactions.Audit Trails: Keep records of agent activities for transparency and compliance.Compliance Checks: Regularly evaluate systems against regulatory standards such as GDPR and HIPAA. A well-structured governance model protects against risks and ensures the ethical deployment of AI. 5. Observability and Continuous Monitoring Implementing observability practices enables real-time monitoring and diagnostics of agent behaviors and system performance. Key components include: Logging: Record comprehensive logs of agent actions and decisions.Metrics Collection: Collect performance indicators such as response times and error rates.Alerting Systems: Promptly notify stakeholders of anomalies or system failures. Continuous monitoring allows for proactive maintenance and ongoing improvements. 6. Human-in-the-Loop (HITL) Mechanisms Incorporating human oversight ensures that AI agents operate within acceptable boundaries and adapt to nuanced scenarios. HITL approaches consist of: Approval Workflows: Ensure human validation for critical decisions or actions.Feedback Loops: Enable users to give input on agent performance, guiding future behavior.Intervention Protocols: Allow humans to modify or adjust agent actions when necessary. Balancing automation and human judgment enhances system reliability and builds user trust. 7. Scalability and Performance Optimization Designing systems that can scale effectively to manage growing workloads is essential. Strategies to achieve this include: Load Balancing: Distribute workloads uniformly among agents and resources.Asynchronous Processing: Enable agents to function independently, minimizing bottlenecks.Resource Management: Effectively monitor and allocate computational resources to maintain performance. Optimizing for scalability guarantees that the system stays responsive and effective as demand increases. By following these design principles, businesses can create robust, efficient, and reliable agentic workflows that align with their organizational objectives and adapt to evolving challenges. Real-World Use Case: Field Service Agent Mesh Scenario: A utilities organization can enhance field response operations using a trio of specialized AI agents: Planner Agent: Assesses incoming user complaints and defines a resolution plan.Retriever Agent: Fetches asset location, historical ticket data, and compliance checklists.Executor Agent: Schedules technicians and sends alerts to mobile service teams. Impact: Improves task assignment efficiency, faster resolution cycles, and higher technician productivity. Conclusion Compound AI systems are transforming enterprise architecture by facilitating intelligent, adaptable, and scalable workflows. Designing modular, orchestrated agentic systems assists organizations: Accelerate AI-driven transformationEnhance operational resilience and flexibilityDeliver improved results for both customer and employee experiences The future lies in transitioning from isolated AI tasks to compound ecosystems of agents, a strategy that combines innovation with strong governance and domain relevance.

By Srinivas Sandiri

The Myth of In-Place Patching: Unpacking Protocol Buffers, FieldMasks, and the "Last Field Wins" Conundrum

Data serialization frameworks like Google Protocol Buffers (Protobuf) have become indispensable. They offer compact binary formats and efficient parsing, making them ideal for everything from inter-service communication to persistent data storage. But when it comes to updating just a small part of an already serialized data blob, a common question arises: can we "patch" it directly, avoiding the overhead of reading, modifying, and rewriting the entire thing? The short answer, for most practical purposes, is no. While Protobuf provides clever mechanisms that seem to offer direct patching, the reality is more nuanced. Let's dive into why the full "read-modify-write" cycle remains largely unavoidable and where the true efficiencies lie. The Core Challenge: Binary Data's Unfixed Nature Imagine a book where every word's length can change, and there are no fixed page numbers for individual words. If you change a single word, all subsequent words on that page (and potentially the entire book) would shift, requiring a complete re-layout. This is akin to the challenge of patching a binary serialized blob. Protobuf, like Apache Thrift, uses a compact, variable-length binary encoding. Fields are identified by unique numeric tags, and their values are encoded efficiently, often with variable-length integers or length-prefixed strings. This design is fantastic for minimizing data size and maximizing parsing speed. However, it means that the exact byte offset and length of any given field are not fixed. Changing the value of a field, especially a string, can alter its byte length, which would then shift the positions of all subsequent fields in the binary stream. Attempting an "in-place" modification without recalculating and shifting all subsequent bytes would lead to data corruption. Misconception 1: The "Last Field Wins" Magic Trick One intriguing feature of protocol buffers is its "last field wins" merge behavior for non-repeated fields. This means if you have two serialized Protobuf messages for the same type, and you concatenate their binary forms, when the combined stream is deserialized, the value of a non-repeated field from the last occurrence in the stream will be used. For repeated fields, new values are appended, not overwritten. How it seems to work (and why it's misleading for patching): Let's say you have an original Person object serialized into a blob: Plain Text Original Blob: [name="Alice", age=30, phone_number=["111", "222"]] You want to update only the name to "Alicia." You could create a new, small Protobuf message containing just the updated name: Plain Text Patch Blob: [name="Alicia"] Then, you could concatenate this Patch Blob to the Original Blob: Plain Text Combined Blob: [name="Alice", age=30, phone_number=["111", "222"]] + [name="Alicia"] When a Protobuf parser reads this Combined Blob, due to "last field wins," the name will indeed resolve to "Alicia," while age and phone_number will retain their original values. The catch: While this appears to be a patch, it's a deserialization rule, not a binary patching mechanism. The parser still has to read and process the entire concatenated stream to determine the final state of the message. You haven't avoided the deserialization cost; you've just changed how the parser resolves conflicts during deserialization. Furthermore, this approach has severe limitations: Only for root objects and non-repeated fields: It "only works well for the root object" and "doesn't work for repeated" fields. If you tried to update a specific phone number, or a field within a nested message, this concatenation trick would fail or lead to unintended appends.Increased storage/transmission size: You're now storing or transmitting more data (original + patch) than if you had simply re-serialized the whole object. Misconception 2: FieldMask Saves Re-serialization Cost Google's official Protobuf best practices recommend using FieldMask for supporting partial updates in APIs. This is an excellent pattern, but it's crucial to understand where its efficiency gains truly lie. How FieldMask Works A FieldMask is a separate Protobuf message that explicitly lists the paths of the fields a client intends to modify (e.g., name, address.street). When a client wants to update a resource, it sends a small request containing: The FieldMask itself.Only the partial data for the fields specified in the mask. Example of a network payload using FieldMask: Instead of sending: Plain Text { "name": "Alicia", "age": 30, "phone_number": ["111", "222"] } // (full object) A client might send: Plain Text { "update_mask": { "paths": ["name"] }, "person": { "name": "Alicia" } } // (much smaller payload) Where FieldMask truly shines (and why re-serialization is still needed): FieldMask significantly improves efficiency, but not by avoiding the deserialization/re-serialization cycle on the server's persistent data. Its benefits are primarily at the network communication and application logic layers: Bandwidth optimization: By sending only the FieldMask and the partial data, the request payload size is drastically reduced. This saves network bandwidth, especially critical for mobile clients or high-volume APIs.Reduced server-side processing: The server receives explicit instructions on which fields to update. This streamlines the application logic, preventing the server from having to infer changes or process a large object where most fields are unchanged. However, once the server receives this partial update request, to apply it to the stored, serialized data, it still performs the following steps: Retrieve existing data: The server fetches the full, existing serialized blob from its storage.Deserialize: The entire blob is deserialized into a complete in-memory Protobuf object.Apply patch: The application logic uses the FieldMask to update only the specified fields on this in-memory object.Re-serialize: The entire modified in-memory object is then re-serialized into a new binary blob.Persist: This new blob replaces the old one in storage. The Unavoidable Truth: Read-Modify-Write For any robust and reliable modification of a protocol buffer serialized data blob, the read-modify-write cycle is the standard and necessary approach. This is because: Data integrity: It ensures that the entire object remains consistent and correctly encoded after the modification.Schema evolution: It gracefully handles schema changes (adding/removing fields) by allowing the parser to correctly interpret the full data structure.Binary format constraints: The variable-length nature of Protobuf's encoding makes direct byte-level manipulation impractical and prone to corruption. Conclusion Protocol buffers are incredibly powerful for efficient data serialization and schema evolution. Features like "last field wins" and FieldMask are valuable tools, but their utility for "patching" existing serialized blobs is often misunderstood. The "last field wins" behavior is a deserialization rule that can be leveraged for simple, non-repeated field updates via concatenation, but it still requires full deserialization and is not a general-purpose binary patching solution. The FieldMask is an excellent API design pattern that optimizes network bandwidth and simplifies application logic for partial updates, but the server still performs a full read-modify-write cycle on the underlying data. Ultimately, if you need to modify a Protobuf serialized blob, prepare for the full read-modify-write dance. The true efficiencies come from optimizing the communication of the patch (e.g., with FieldMask) and the in-memory processing, rather than magically altering bytes on disk. Further Reading Protocol Buffers DocumentationProtobuf Language Guide (proto3)Google Protobuf FieldMaskGoogle Cloud API Design Guide - Partial Updates

By Prithviraj Kumar Dasari

Software Security Treat or Threat? Leveraging SBOMs to Control Your Supply Chain Chaos [Infographic]

Editor's Note: The following is an article written for and published in DZone's 2025 Trend Report, Software Supply Chain Security: Enhancing Trust and Resilience Across the Software Development Lifecycle. Software supply chain security is on the rise as systems advance and hackers level up their tactics. Gone are the days of fragmented security checkpoints and analyzing small pieces of the larger software security puzzle. Now, software bills of materials (SBOMs) are becoming the required norm instead of an afterthought. So the question is: Are supply chains and SBOMs a sweet pairing or a sticky solution? Below are your playing cards, featuring key data from DZone audience's responses to our Software Supply Chain Security survey, to help guide you along your journey. So dodge the sour code, unwrap the SBOM mystery flavors, and follow the sweet trail toward a strengthened security posture. SBOM Savories 63% generate and use SBOMs in their development and security processes.53% use SBOM generation and validation as their primary strategy to minimize attack surfaces.51% update their SBOMs on a scheduled basis. Sticky Vulnerabilities 63% name inconsistent or duplicate security controls as their top challenge in complex toolchains.26% feel fully prepared to meet evolving regulatory compliance standards.47% cite container images as their top supply chain threat. Sweet Dependencies 63% note zero-trust architecture as the most critical strategy to secure hybrid or multi-cloud environments.61% say using AI/ML for threat detection led to better threat prioritization.68% use data masking or tokenization to protect data across CI/CD workflows. This is an excerpt from DZone's 2025 Trend Report, Software Supply Chain Security: Enhancing Trust and Resilience Across the Software Development Lifecycle.Read the Free Report

By DZone Editorial

Scheduler-Agent-Supervisor Pattern: Reliable Task Orchestration in Distributed Systems

The Scheduler-Agent-Supervisor (SAS) pattern is a powerful architectural approach for managing distributed, asynchronous, and long-running tasks in a reliable and scalable way. It is particularly well-suited for systems where work needs to be orchestrated across many independent units—each capable of failing and retrying—while maintaining observability and idempotency. This pattern divides responsibilities into three well-defined roles: Scheduler: Initiates workflows and tracks high-level progressAgent: Executes individual task unitsSupervisor: Monitors and manages task execution Key Components With C# Implementation 1. Scheduler Component The scheduler triggers the workflow. Here's a C# example using a timer: C# public class DataExportScheduler : BackgroundService { private readonly ILogger<DataExportScheduler> _logger; private readonly ISupervisorClient _supervisorClient; private readonly Timer _timer; public DataExportScheduler(ILogger<DataExportScheduler> logger, ISupervisorClient supervisorClient) { _logger = logger; _supervisorClient = supervisorClient; _timer = new Timer(ExecuteScheduledJob, null, Timeout.Infinite, Timeout.Infinite); } protected override Task ExecuteAsync(CancellationToken stoppingToken) { // Run every day at 2 AM _timer.Change(GetNextRunTime(), Timeout.Infinite); return Task.CompletedTask; } private TimeSpan GetNextRunTime() { var now = DateTime.Now; var nextRun = now.Date.AddDays(1).AddHours(2); // Tomorrow at 2 AM return nextRun - now; } private async void ExecuteScheduledJob(object state) { _logger.LogInformation("Initiating data export workflow"); try { var fileList = await GetFileListToProcess(); await _supervisorClient.StartWorkflowAsync(fileList); } catch (Exception ex) { _logger.LogError(ex, "Failed to initiate workflow"); } finally { // Reset the timer for next run _timer.Change(GetNextRunTime(), Timeout.Infinite); } } private async Task<List<string>> GetFileListToProcess() { // Implementation to fetch files from storage return new List<string> { "file1.csv", "file2.csv" /* ... */ }; } } 2. Agent Component Agents perform the actual work. Here's an idempotent file processor: C# public class FileProcessingAgent { private readonly ILogger<FileProcessingAgent> _logger; private readonly IBlobStorageService _storageService; private readonly IDatabaseRepository _repository; public FileProcessingAgent( ILogger<FileProcessingAgent> logger, IBlobStorageService storageService, IDatabaseRepository repository) { _logger = logger; _storageService = storageService; _repository = repository; } [FunctionName("ProcessFile")] public async Task ProcessFile( [ActivityTrigger] string fileName, ExecutionContext context) { // Check if already processed (idempotency check) if (await _repository.IsFileProcessed(fileName)) { _logger.LogInformation($"File {fileName} already processed. Skipping."); return; } try { _logger.LogInformation($"Processing file: {fileName}"); // 1. Download file var fileContent = await _storageService.DownloadFileAsync(fileName); // 2. Parse content var records = CsvParser.Parse(fileContent); // 3. Transform data var transformedData = DataTransformer.Transform(records); // 4. Upload to database await _repository.BulkInsertAsync(transformedData); // 5. Mark as completed await _repository.MarkFileAsProcessed(fileName); _logger.LogInformation($"Successfully processed {fileName}"); } catch (Exception ex) { _logger.LogError(ex, $"Failed to process file {fileName}"); // Clean up any partial state await _repository.RollbackFileProcessing(fileName); throw; // Let supervisor handle retry } } } 3. Supervisor Component The supervisor orchestrates and monitors the workflow: C# public class FileProcessingSupervisor { private readonly ILogger<FileProcessingSupervisor> _logger; private readonly IAgentClient _agentClient; private readonly INotificationService _notificationService; public FileProcessingSupervisor( ILogger<FileProcessingSupervisor> logger, IAgentClient agentClient, INotificationService notificationService) { _logger = logger; _agentClient = agentClient; _notificationService = notificationService; } [FunctionName("FileProcessingOrchestrator")] public async Task RunOrchestrator( [OrchestrationTrigger] IDurableOrchestrationContext context) { var files = context.GetInput<List<string>>(); var retryOptions = new RetryOptions( firstRetryInterval: TimeSpan.FromSeconds(30), maxNumberOfAttempts: 3); _logger.LogInformation($"Starting processing of {files.Count} files"); // Parallel processing with retry logic var processingTasks = new List<Task>(); foreach (var file in files) { var task = context.CallActivityWithRetryAsync( "ProcessFile", retryOptions, file); processingTasks.Add(task); } try { await Task.WhenAll(processingTasks); _logger.LogInformation("All files processed successfully"); } catch (Exception ex) { _logger.LogError(ex, "Some files failed processing after retries"); // Get failed files var failedFiles = processingTasks .Where(t => t.IsFaulted) .Select(t => (string)t.AsyncState) .ToList(); await _notificationService.SendAlert( "File Processing Failure", $"Failed to process {failedFiles.Count} files: {string.Join(", ", failedFiles)}"); // Persist failed files for manual intervention await context.CallActivityAsync("PersistFailedFiles", failedFiles); // Continue workflow with remaining files throw; } } } Complete System Integration Here's how to wire up the components in a .NET application: C# var builder = Host.CreateDefaultBuilder(args) .ConfigureServices((context, services) => { // Register components services.AddHostedService<DataExportScheduler>(); services.AddSingleton<ISupervisorClient, DurableFunctionsSupervisorClient>(); services.AddSingleton<IAgentClient, AzureFunctionsAgentClient>(); // Register dependencies services.AddSingleton<IBlobStorageService, AzureBlobStorageService>(); services.AddSingleton<IDatabaseRepository, SqlDatabaseRepository>(); services.AddSingleton<INotificationService, EmailNotificationService>(); // Configure Durable Functions services.AddDurableTask(options => { options.HubName = "FileProcessingHub"; options.StorageProvider["maxQueuePollingInterval"] = "00:00:10"; }); }) .ConfigureLogging(logging => { logging.AddApplicationInsights(); logging.AddConsole(); }); await builder.Build().RunAsync(); When to Use the SAS Pattern Ideal Use Cases: ETL pipelines: Processing large volumes of data with reliability requirementsOrder fulfillment systems: Where each step must be tracked and retriedDistributed computations: Breaking large problems into smaller, parallel tasks Anti-Patterns: Simple CRUD operations: Where the overhead isn't justifiedReal-time processing: Consider event streaming patterns insteadSynchronous workflows: Where immediate response is required Best Practices Idempotency: C# // Example idempotent operation public async Task ProcessOrder(Order order) { // Check if already processed if (await _repository.OrderExists(order.Id)) return; // Process with transaction using var transaction = await _repository.BeginTransactionAsync(); try { await _inventoryService.ReserveItems(order.Items); await _paymentService.ProcessPayment(order.Payment); await _repository.SaveOrder(order); await transaction.CommitAsync(); } catch { await transaction.RollbackAsync(); throw; } } Observability: C# // Enhanced logging with correlation IDs public async Task ProcessItem(string itemId) { using var scope = _logger.BeginScope(new Dictionary<string, object> { ["CorrelationId"] = Guid.NewGuid(), ["ItemId"] = itemId }); _logger.LogInformation("Starting processing"); var stopwatch = Stopwatch.StartNew(); try { // Processing logic... _logger.LogInformation("Processing completed in {ElapsedMs}ms", stopwatch.ElapsedMilliseconds); } catch (Exception ex) { _logger.LogError(ex, "Processing failed after {ElapsedMs}ms", stopwatch.ElapsedMilliseconds); throw; } } Circuit Breakers: C# // Using Polly for resilient HTTP calls var circuitBreaker = Policy .Handle<HttpRequestException>() .Or<TimeoutException>() .CircuitBreakerAsync( exceptionsAllowedBeforeBreaking: 3, durationOfBreak: TimeSpan.FromMinutes(1)); public async Task CallExternalService() { await circuitBreaker.ExecuteAsync(async () => { var response = await _httpClient.GetAsync("https://api.example.com/data"); response.EnsureSuccessStatusCode(); return await response.Content.ReadAsStringAsync(); }); } Conclusion The Scheduler-Agent-Supervisor pattern provides a robust framework for building distributed systems that require: Resilience: Automatic retries and failure handlingScalability: Parallel processing of independent tasksMaintainability: Clear separation of concernsAuditability: Comprehensive tracking of task states

By Arun Kumar Rajamandrapu

Secure Private Connectivity Between VMware and Object Storage: An Enterprise Architecture Guide

As an architect, security is the first thing that comes to mind when defining an architecture for a customer. One of the key things that you need to keep in mind is minimizing the network traffic routed through the public internet. This article discusses how to bring private connectivity to cloud services, working with compute platforms like VMware on Cloud. Modern cloud architecture follows a "defense-in-depth" philosophy where network isolation forms the foundational security layer. Public internet exposure creates unacceptable risks for enterprise workloads handling sensitive data, financial transactions, or regulated content. Private connectivity addresses this by implementing a critical architectural principle: Zero Trust Network Access (ZTNA). Unlike perimeter-based security models, ZTNA assumes all external networks are hostile and requires verification at every access point. By routing traffic through private backbones rather than the public internet, organizations eliminate the most common attack vectors. DNS poisoning, SSL stripping, and credential sniffing, while gaining: Intrinsic security through network isolationReduced attack surface by removing public IP exposureCompliance enforcement via architecture rather than configurationData sovereignty assurance by keeping traffic within provider-controlled networks This architectural approach transforms connectivity from a vulnerability to a security control, making private links non-negotiable for production workloads. Introduction to VMware Cloud Foundation VMware Cloud Foundation (VCF) is an integrated software-defined data center platform that combines compute virtualization (vSphere), software-defined storage (vSAN), advanced networking (NSX), and cloud management into a unified stack. Enterprises adopt VCF to maintain operational consistency across hybrid environments using familiar VMware tools, automate lifecycle management for reduced administrative overhead, enable workload portability for seamless cloud migrations, and implement granular security through micro-segmentation. VMware Cloud Foundation as a Service (VCFaaS) on IBM Cloud delivers these capabilities as a fully managed offering, eliminating infrastructure management burdens while preserving VMware's operational model and enterprise-grade features. Understanding Cloud Object Storage Cloud Object Storage (COS) is a scalable, durable cloud storage service designed for modern unstructured data workloads like backups, media files, and AI datasets. Unlike traditional block or file storage, COS organizes data as discrete objects containing the file content, customizable metadata (retention policies, security tags), and a globally unique identifier. Key advantages include massive scalability to exabyte levels, cost-efficient pay-as-you-grow pricing, industry-leading data durability through geographic replication, and S3-compatible REST APIs for seamless integration. COS is ideally suited for data lakes, backup repositories, media distribution, and IoT data streams, with immutability features like S3 Object Lock providing critical ransomware protection. Cloud Networking: Private CIDR Ranges IBM Cloud reserves specific RFC1918-compliant private IP ranges exclusively for its internal infrastructure and services. These ranges prevent overlap with customer subnets and enable secure service connectivity: CIDR RangePurposeAccessibility10.0.0.0/14IBM internal management planesCustomer workloads blocked10.198.0.0/15Core service orchestrationFiltered by IBM backbone10.200.0.0/14Hypervisor and storage infrastructureRestricted to IBM systems166.9.0.0/16Cloud Service Endpoints (e.g., COS private access)Customer workloads via CSE161.26.0.0/16IBM DNS and internal service resolutionAutomatically routed Architectural Benefits Infrastructure isolation: Dedicated addressing for IBM’s control planes.Overlap prevention: Ensures no conflict with customer VPC subnets.Private service endpoints: COS private endpoints resolve to 166.9.x.x addresses.Traffic segregation: Backbone filters block customer access to IBM-reserved ranges. The Critical Need for Private Connectivity Public internet access to cloud storage introduces three critical risks: Security vulnerabilities: Exposure to eavesdropping, MITM attacks, and malicious scanning.Performance volatility: Unpredictable latency from internet congestion and bandwidth throttling.Cost inflation: Data egress fees that accumulate exponentially with large transfers. Private connectivity via IBM’s backbone solves these by: Keeping traffic within IBM’s controlled network (166.9.0.0/16).Eliminating public internet exposure.Providing consistent sub-10ms latency.Waiving data transfer fees entirely. Architecture: VCFaaS to COS via Cloud Service Endpoints Private connectivity flow Connectivity Workflow Initiation: VMs in VCFaaS target COS private endpoints (e.g., s3.private.us-south.cloud-object-storage.appdomain.cloud).Routing: VCFaaS Provider Gateway directs traffic through pre-configured Cloud Service Endpoints (CSE).Transport: Data traverses IBM’s private backbone via RFC1918 addresses.Termination: Secure delivery to COS without public internet exposure. COS Endpoint Comparison Endpoint TypeURL PatternSecurity LevelPerformancePublics3.[region].cloud-object-storage.appdomain.cloudStandard TLSInternet-dependentPrivates3.private.[region]...No public exposureConsistent low latencyDirects3.direct.[region]...Regional isolationOptimal throughput Implementation Guide Phase 1: Network Configuration Gateway attachment: Ensure VCFaaS networks are attached to the Provider Gateway.Verify firewall rules permit IBM service networks (166.9.0.0/16).Subnet planning: Use non-conflicting RFC1918 ranges (e.g., 192.168.0.0/16) for customer workloads . Phase 2: COS Private Endpoint Setup Python # Python SDK configuration for private access import ibm_boto3 from ibm_botocore.client import Config cos = ibm_boto3.client( 's3', endpoint_url='https://s3.private.us-south.cloud-object-storage.appdomain.cloud', config=Config(signature_version='oauth') Phase 3: DNS Validation Configure VMs to use IBM DNS resolvers (161.26.0.10, 161.26.0.11).Confirm private resolution: Shell nslookup s3.private.us-south.cloud-object-storage.appdomain.cloud # Expected: 166.9.x.x (never public IP) Security and Operational Benefits Security Comparison LayerPublic EndpointPrivate via CSENetwork ExposureInternet-facingIBM private backbone onlyAttack SurfaceScannable by malicious actorsInvisible to the internetCompliance SupportLimited certificationsHIPAA/FINRA/GxP compliant Operational Advantages Cost elimination: No data transfer fees for private connectivityCompliance acceleration: Pre-built controls for regulated industriesIncident reduction: Significantly fewer security eventsArchitecture simplicity: No VPNs or complex firewall rules required Real-World Applications Financial data pipeline Security Achievements Regulatory compliance with S3 Object LockCryptographic proof of data integrityZero public exposure of sensitive data Healthcare Diagnostics Platform HIPAA-compliant medical image storagePrivate connectivity for AI diagnostic toolsCertified audit trails for data accessPatient data never traverses public networks Media Content Supply Chain End-to-end encrypted media transfersContent watermarking via metadataRegional content sovereignty enforcementTamper-proof archival with Object Lock Conclusion: Security-First Cloud Architecture The integration of VMware Cloud Foundation with IBM Cloud Object Storage through private connectivity establishes a robust enterprise architecture that transforms security from a compliance requirement to a strategic advantage. This pattern delivers critical benefits: 1. Architectural Security Private connectivity implements security at the network layer - the foundation of cloud architecture. By eliminating public internet exposure, organizations gain inherent protection against external threats through IBM's private backbone. This "secure-by-design" approach provides more reliable protection than bolt-on security solutions. 2. Compliance by Design The architecture validates controls for strict regulations through guaranteed data residency, immutable audit trails, and cryptographic proof of data handling integrity. This significantly reduces compliance validation efforts for regulated industries. 3. Enterprise-Grade Performance Private backbone connectivity delivers consistent low-latency with high availability, ensuring business-critical operations operate with deterministic performance, unaffected by internet congestion. 4. Economic Efficiency Beyond eliminating egress fees, the architecture reduces security incident response costs and compliance audit preparation time while optimizing operational efficiency. 5. Future-Ready Foundation This security-first approach enables next-generation workloads like confidential AI, zero-trust hybrid cloud operations, and quantum-safe cryptography readiness without architectural changes. For enterprises navigating digital transformation, this pattern demonstrates how security can become a competitive advantage. By implementing private VCFaaS-COS connectivity, organizations achieve the gold standard of cloud architecture, where security, performance, and efficiency converge to enable business innovation without compromise. If you are looking for additional information on VCFaaS and how it supports private connectivity, check this article.

By Vidyasagar (Sarath Chandra) Machupalli FBCS

CORE

Architecture Lessons from Two Digital Transformations

I have been fortunate to lead not just one, but two digital transformation projects as an Architect. And I would say I got lucky under many different counts. First piece of luck – one of the projects was a failure! How can that be lucky you ask? Read on. There were other strokes of luck too, and each shaped how I think about architecture today. The fact that I’m here, reflecting on both, is proof that I learnt a lot. My goal is to share those insights, so they serve anyone navigating their own transformations. One project was in healthcare; the other was in finance. One was with a large enterprise, the other a mid-sized firm. One offered flexibility, thanks to being in the Research department. The other was rigidly premeditated, with the ‘climax’ already decided by the executives. Let me break each down, what was the project about, what I did and what I learnt. You’ll see why one failed while the other succeeded and where architecture made all the difference when leveraged right, and was a liability when ignored. The One That Went Right A major hospital chain set out to digitally transform, looking to move to cloud, leverage AI/ML for advanced analytics. But they were also understandably cautious, concerned about security, regulations, effectiveness and cost. At the time, I was with a large sized and well-established organization known for its healthcare expertise. The project was being done by the Research wing of the organization, which meant, we were free to experiment, work in a “startup-like mode” and most importantly have the “fail-fast” approach. Interestingly, the organization had an already established Scaled Agile Path, which meant we were a Value Stream, with Value Stream Owner, Architect and Release Train Engineer lead management team driving nearly ten scrum teams, each with cross functional skills, supported by a Scrum Master and a Product Owner. Our first step was understanding the geographical spread of the client, here primarily across North America, and addressing their data security. We are talking HIPAA compliances, data privacy, de-identification of personal and medical data etc. They were very open to the cloud concept, but factoring the size of data generated was an important design pattern. A single ICU patient, monitored every 5 minutes for 2 days generated 576 data points just for heart rate! Now multiply that across vitals, across patients, and we are working with massive volumes! Our design strategy then tuned a hybrid approach – data on-premises, application on-cloud. Use of AI for pattern detection in disease progression, patient response to treatment, ML for trend detection, forecasting ICU demand, predicting readmission of patients were the value propositions we, as a research-based value stream, were offering to our customers. Designing these models to run across the hybrid infrastructure, with distributed data centers, was both a challenge and a very key architectural decision. The approach was always, “Lets work in a start-up mode”. Try a concept, show it to the customer using an extremely simple working model, even a jupyter notebook for that matter to demonstrate a model, analyze if it can be expanded to meet their requirements and if not, just scrap it and begin the next iteration. This discipline ensured that weren’t wasting time and effort in the wrong activity, and at the same time getting closer to building what our client was looking for. What was interesting was the extra innovation week we got each PI (Program Increment) cycle, to actually record our findings that were unique, into Intellectual Property building material. This led to several patent submissions in the 1 year that we spent on the project. The outcome was strong; the client accepted the solution, and we transitioned implementation to the engineering division. The Transformation That Taught Me More I next joined a digital transformation effort as an Enterprise Architect for a niche financial institution. The company was operating in a much smaller market, for a much smaller customer base, on a much smaller portfolio of products and services. Having been around for over a century, it is no surprise that much of their infrastructure was outdated and applications were deeply legacy. The prevailing mindset was that of “Don’t touch what isn’t broken”. This approach, though seemingly practical, reflected a deeper inertia, rooted in a cash-strapped culture and leadership priorities that often leaned towards prestige over progress. Over the years, the organization had acquired others in an attempt to grow its customer base. These mergers and acquisitions lead to inheritance of a lot more legacy estate. The mess burgeoned to an extent that they needed a transformation, not now, but yesterday! That is exactly where the Enterprise Architecture practice comes into picture. Strategically, a green field approach was suggested. A brand-new system from scratch, that has modern data centers for the infrastructure, cloud platforms for the applications, plug and play architecture or composable architecture as it is better known, for technology, unified yet diversified multi-branding under one umbrella and the whole works. Where things slowly started taking a downhill turn is when they decided to “outsource” the entire development of this new and shiny platform to a vendor. The reasoning was that the organization did not want to diversify from being a banking institution and turn into an IT heavy organization. They sought experienced engineering teams who could hit the ground running and deliver in 2 years flat. A secret we all know is that IT service providers survive on long term contracts with their clients. When this vendor then chanced upon the cash-strapped client, they drew out a long 7-year transformation plan. Not only did the organization commit at a steep price but also hired senior leaders from the vendor for fancy roles like “transformation director”, “head of cloud”, and similar. For the vendor, this was a well-paying engineering contract, not a mission. The new hires, still loyal to their previous employer, focused more on maintaining the engagement than on building internal capability. It was not that passionate “this is the IT strategy that is going to ensure robust and secure solutions and products that makes the organization stand out to customers, investors and regulatory authorities” or “this Architectural North Star is going to increase sales because it addresses these technology aspects”. And thus, it is fair to say: maybe it will be profitable in the long run. But as a digital transformation, it failed. Lessons From an Architect Looking back, here is what stood out, having a mission and vision, understanding the “why”, aligning architectural choices to the client’s true needs – that’s what makes a transformation robust. Money may buy “projects”, but it’s principles and thoughtful technology choices that pay off in the long run. Comparison Table (Success vs Failure) Category Project that went right Project that taught me more Clarity of Vision and Mission Clear R&D-driven purpose, patient-focused Vague strategic intent, reactive urgency Architectural Considerations Hybrid setup – cloud + on-premCompliances – HIPAAData volume GreenfieldComposableModern data centersMulti-brand Architectural Patterns Used MicroservicesFederated architectureML model staging Composable patternBrand orchestration Governance and Alignment Strong Value Stream Leadership (VSO, RTE, Architect) Outsourced Leadership Governance structure (Architecture Review Board, Technical Design Authority) existed, but lacked real influence or decision-making power. Delivery Approach Scaled Agile with 10 cross-functional teams Vendor-led waterfall (agile on paper) with 7-year contract Team Autonomy and Ownership Startup-mode, IP-focused innovation Low ownership, vendor-dependent execution Use of Cloud Used for better implementation of models and more value to the applications Primarily adopted to shift maintenance to 3rd party Use of AI/ML Trend prediction, pattern detection. Tactical use cases like code and unit test generation. Outcome Accepted, transferred to engineering, multiple patents Outgoing vendor dependency, transformation progress stalled What I Ask Before Starting Any New Digital Transformation What is the “why” behind this transformation? Do business and tech teams share the same understanding of it?How measurable are the architecture visions? Are they aligned and clearly tied to the outcomes? Who owns the transformation? How much space do the tech team get to experiment?Will this be really scalable or are we just trying to get through a short-term gain? If I then summarize this experience in one sentence, I’d say “success or failure, an architect is always taking away tech lessons from everything”. That’s the value they bring to the next project; that’s the experience that makes an architect an architect!

By Syamanthaka B

Building a Scalable GenAI Architecture for FinTech Workflows

Generative AI (GenAI) is rapidly transforming the financial services landscape. According to McKinsey, GenAI could unlock up to $340 billion in annual cost savings and productivity gains across the global banking sector. With this momentum, forward looking fintech leaders are embedding GenAI into critical workflows ranging from customer onboarding and credit decisioning to fraud detection and compliance. This article provides a practical architecture guide to help technology leaders adopt GenAI safely, effectively, and at scale. Why GenAI Matters for Financial Services Financial institutions are under constant pressure to operate faster, smarter, and leaner. GenAI provides a strategic edge by: Accelerating decisions in credit, fraud resolution, and customer service.Lowering costs by reducing manual effort through automation.Improving compliance with explainable AI outputs.Enhancing customer experience via intelligent, real time interactions. How GenAI Fits into FinTech Architecture Here are foundational components of a modern GenAI stack tailored for FinTech domain. The overall architecture of the proposed GenAI system is illustrated in Figure 1. To stay ahead in a fast evolving GenAI landscape, this architecture is designed with flexibility in mind featuring modular, model agnostic components and plug-and-play layers that can easily adapt to emerging tools, models and regulatory environments. Figure 1. Foundational components of a modern GenAI stack tailored for FinTech domain 1. Fintech Specific Input Layer: Receives Financial Data The Input Layer is the first point of entry for data into the GenAI system. As shown in Figure 1, this layer captures raw inputs from digital channels used by customers, employees, vendors or internal systems such as core banking, investment, regulation platforms, and backend services. It initiates the GenAI workflow by ingesting both structured and unstructured data ranging from finance documents and chat interactions to transactional logs. Additionally, it brings in external signals including third party APIs, compliance data feeds, financial market events, regulatory bulletins, news articles, and social media. 2. Fintech Pre-processing Layer: Prepares Financial Data for the GenAI Core This layer converts raw financial inputs into structured, enriched, and privacy compliant formats tailored for GenAI workflows. It addresses the unique challenges of fintech data ranging from banking statements, payment instructions, transaction logs, and credit bureau reports to regulatory disclosures and compliance forms by applying specialized techniques for validation, anonymization, and contextual structuring. Given the precision required in financial operations, the pre-processing layer plays a critical role in minimizing downstream errors and enabling accurate AI driven insights. Data Cleaning & Validation: Resolves inconsistencies, fills missing values, and verifies document accuracy especially for sensitive forms like KYC or income proofs.Data Masking & Anonymization: Strips or redacts personally identifiable information (PII) to align with data privacy regulations such as GDPR, PCI DSS, and FFIEC.Text Chunking & Embedding: Breaks down financial disclosures, T&Cs, or risk reports into vectorized representations for contextual retrieval.Entity Recognition & Linking: Identifies financial entities like account numbers, legal entities, transactions, and ties them to internal databases or external registries.Feature Engineering & Extraction: Derives structured indicators such as risk flags, spending patterns, or loan eligibility scores from raw statements and documents. Tools like AWS Textract, Google Document AI, and spaCy are used for OCR, entity recognition, PII redaction, and the parsing of both structured and unstructured financial documents. 3. Fintech GenAI Core Layer: Central AI Reasoning and Domain Specific Intelligence This layer is the heart of the system, performing deep financial reasoning and decision support using specialized LLMs, RAG, vector databases, and knowledge graphs tailored to core fintech tasks. Fintech Specific Fine Tuned and Custom LLMs: Tailored models trained on proprietary financial datasets to enhance task accuracy across fintech use cases. KYC/Onboarding LLM: Verifies identity documents, evaluates onboarding risks, and parses KYC forms using domain aligned prompts and policies. Lending Operation LLM: Automates credit underwriting, suggests personalized loan offers, and evaluates business loan viability based on financial health. Fraud Detection LLM: Detects anomalies in transactions, analyzes behavioral risk, and delivers real time fraud alerts across digital channels. Personal Financial Advisor LLM: Provides goal based investment recommendations, portfolio optimization, and personalized savings plans using user data and market trends. Regulatory Compliance LLM: Interprets compliance policies, audits financial activity, and assesses impact of regulatory changes on business operations. Trading Platform LLM: Analyses market sentiment, evaluates portfolio risk, and delivers company level health insights for trading strategies.Fintech Prompt Engineering: Designs task specific prompt templates and context injection for automated customer dispute resolution, product recommendation or fraud explanation.Fintech RAG (Retrieval Augmented Generation): Augments LLM responses with contextual data such as past decisions, product policies, and regulatory texts.Vector Database: Stores semantically indexed financial data like knowledge snippets, prior queries, and FAQs for fast and relevant retrieval.Fintech Product/Regulation/Terms Knowledge Graphs: Encodes relationships between financial products, regulatory rules, and policy terms to enable grounded reasoning, traceable outputs, and contextual understanding across use cases. This layer leverages fine tuned or custom Large Language Models (LLMs) including open source models like LLaMA and BERT adapted for specific FinTech tasks. These models are typically trained using frameworks like PyTorch or TensorFlow in combination with the Hugging Face Transformers library. This layer also incorporates Retrieval Augmented Generation (RAG) using vector databases such as FAISS and Pinecone and applies FinTech specific knowledge graphs to deliver contextualised, auditable, and regulation compliant AI reasoning. 4. LLM Orchestration Layer: Coordinates GenAI Tasks for Financial Precision The LLM Orchestration Layer serves as the control plane for GenAI workflows within the fintech ecosystem, coordinating how prompts, models, and policies are applied across critical operations. It manages complex decision flows such as real time fraud prevention, portfolio optimization, risk assessment and automated loan document processing by ensuring accurate prompt engineering, dynamic model routing based on sensitivity and SLA, and traceable, policy aligned execution. This layer provides centralized oversight for prompt strategies, fallback logic, and audit logging, making it essential to achieving scalable, secure, and regulation compliant GenAI adoption. Prompt Engineering & Context Injection: Builds prompts tailored for regulatory summaries, dispute resolutions, or lending rationales with embedded policy context.Model Routing & SLA based Selection: Routes queries to appropriate LLMs (e.g., private for KYC processing, public APIs for customer Q&A) based on task, risk, sensitivity, and latency requirements.Multi Model Coordination: Dynamically leverages various models (e.g., Claude for summarization, GPT-4 for document generation, fine tuned in house model for dispute policies).Session & Interaction Management: Maintains dialogue state for use cases like credit advisor copilots or onboarding chatbots.Fallback Handling & Versioning: Switches models or prompt variants if a response fails to meet financial domain accuracy or compliance thresholds.Guardrails & Compliance Filtering: Applies tone, domain, and response boundaries based on financial services regulations (e.g., FINRA, PCI-DSS).Telemetry, Auditing, and Usage Analytics: Tracks token level performance, latency, and anomaly alerts critical for audit trails and continuous improvement in regulated environments. Popular orchestration frameworks such as LangChain and LlamaIndex help coordinate prompt flows, multi model routing, and human in the loop validation in compliance sensitive GenAI deployments. 5. LLM Inference Service Layer: Serving Domain Aware Models for Financial Intelligence This layer enables secure, real time access to Fintech custom LLMs used in production grade financial applications. It ensures SLA compliance, low latency execution, and seamless integration with GenAI workflows. Model Access & Hosting: Provides scalable access to models hosted on platforms like Amazon SageMaker, Google Vertex AI, Azure Machine Learning, and Nvidia Triton with options for isolated or hybrid deployment for regulated workloads.Inference Modes: Supports real time (e.g., for customer queries or fraud alerts) and batch processing (e.g., nightly audit reporting or bulk document summarization).Security & Isolation: Enforces encryption, API rate limits, workload isolation, and role based access key for PCI-DSS and FFIEC compliance.Performance & Scalability: Delivers high throughput, low latency endpoints for use cases like real time customer service chatbots, market data analysis and instant loan eligibility checksTraffic & Resource Management: Manages traffic routing, caching, throttling and compute resources to ensure low latency inference, cost efficiency, and service reliability under dynamic financial workloadsMonitoring & Compliance Logging: Tracks usage metrics and inference reliability while maintaining auditable trails for governance and explainability. 6. Fintech System Integration Layer This layer embeds GenAI into the operational fabric of fintech systems by enabling seamless integration with mission critical platforms. It ensures that AI generated insights are contextually applied to core banking, compliance, customer servicing, and risk workflows and drives real time automation, regulatory alignment, and data exchange at scale. Core Banking & Transaction Systems: Facilitates account analysis, loan automation, and payment risk evaluations through direct integration.CRM & Personal Finance Apps: Synchronizes customer insights, intent based messaging, and AI driven recommendations.Payment & Settlement Gateways: Aligns GenAI generated decisions (e.g., dispute resolution) with live transaction routing or exception handling.Compliance and Risk Platforms: Sends GenAI outputs for review or escalation in fraud detection, transaction monitoring, and KYC operations.Investment Trading Systems: Integrates with portfolio management, risk assessment, and execution platforms to support real time decisioning, trade recommendations, and market analytics powered by GenAI insights.API & Event Management: Uses platforms like Kafka to enable secure, scalable, real time streaming and bidirectional data integration across fintech systems. 7. Human Feedback and Oversight Layer In highly regulated financial environments, this layer plays a pivotal role in ensuring that GenAI outputs remain ethical, traceable, compliant and aligned with business policies. It combines human in the loop validation with technical monitoring to uphold accuracy, transparency, and fairness across all AI driven financial workflows. Human in the loop Review: Facilitates structured review and approval flows for high risk outputs like loan decisions, compliance summaries, or customer disputes.Drift and Behavior Monitoring: Continuously tracks GenAI predictions to detect data drift and maintain relevance over time.Bias Detection & Fairness Checks: Identifies and mitigates algorithmic bias, ensuring fair treatment across different demographics.Audit Trails & Regulatory Alerts: Logs model behavior, input/output history, and triggers real time alerts for anomalies or policy violations.Business Policy Adherence Checks: Checks if GenAI outputs follow internal business rules and financial policies, helping ensure decisions are accurate, compliant, and easy to review during audits. Strategic Takeaways for Tech Leaders GenAI is no longer a futuristic addition, it's a foundational force reshaping financial services. This architecture enables organizations to reduce operational costs by 20-40% through intelligent automation, accelerate customer service from hours to seconds, maintain regulatory compliance while scaling AI capabilities, and future proof their technology stack. The question is no longer whether to adopt GenAI, but how quickly and effectively you can integrate it into your core operations. Those who lead with clear strategy and strong technical foundations will steer their institutions toward a faster, smarter, and more resilient future. Appendix A: Use Case: Credit Risk Assessment – Enabling Smarter Loan Approvals 1. Input Data Customer Data: Income, employment status, credit history, assetsCredit Bureau Reports: Scores, delinquencies, repayment historyApplication Inputs: Loan amount, term, purpose 2. Preprocessing Steps OCR & Parsing: Extracts data from uploaded income proofs or bank PDFsFeature Engineering: Derives debt to income ratio, credit utilization, risk indicatorsPII Redaction: Protects sensitive data before GenAI processing 3. GenAI Core Processing Fine Tuned Lending LLM: Evaluates repayment capacity, financial behavior patternsClassifies applicant risk (e.g., low, moderate, high)RAG Integration: Brings in internal underwriting policies or regulatory lending thresholdsCites prior approval patterns for similar profiles 4. Orchestration Logic Triggered upon loan application submission and LLM assesses credit risk categoryAcceptable risk → offer generation, High risk/edge case → manual underwriter reviewMay trigger secondary API checks (e.g., fraud, income validation) 5. Inference Output Credit Risk Score, Loan term and amount suggestionsApproval Recommendation (Approve / Review / Reject)Justification summary citing key financial indicators and policies used 6. Integration with Systems Output passed to loan origination platformApproved → rate/term offer stage, Rejected → rationale logged for audit or appealCRM integration for customer notification and follow up 7. Business Benefits Faster Decisioning: Cut assessment time from days to minutesReduced Defaults: Improved accuracy in identifying high risk borrowersExplainable AI: Ensures underwriters and auditors understand model logicOperational Efficiency: Handles more applications with fewer manual resources Appendix B: Use Case: Customer Onboarding & KYC Streamlining Identity Verification 1. Input Data Customer Inputs: Name, address, date of birth, contact detailsScanned Documents: Government issued IDs, utility bills, bank statementsAPI Feeds: Credit bureaus, identity verification providers 2. Preprocessing Steps OCR: Extracts text from scanned documentsValidation: Checks for format consistency (e.g., DOB, address fields)Normalization: Standardizes names, addresses 3. GenAI Core Processing Fine Tuned KYC LLM: Understands document types and extracts key dataMatches input vs. extracted vs. API dataFlags red flags (e.g., mismatches, expired IDs)RAG Integration: Pulls KYC policy documents or region specific regulations into prompt context 4. Orchestration Logic Workflow starts on application submissionLLM outputs confidence score for identity verificationHigh score → auto approvalLow score or conflict → routed to manual reviewExternal APIs may be triggered for deeper verification 5. Inference Output Identity verification confidence score, List of flagged inconsistenciesFinal recommendation: Approve, Review, or Reject 6. Integration with Systems Outputs passed to onboarding systemApproved customers proceed to account setupRejected or flagged cases enter manual resolution workflow 7. Business Benefits Faster Onboarding: Reduces processing time from days to minutesHigher Accuracy: Minimizes human error and fraud riskRegulatory Compliance: KYC rules are always enforcedBetter Experience: Customers enjoy a seamless digital journey

By Elakkiya Daivam

Development of System Configuration Management: Introduction

Series Overview This article is part 1 of a multi-part series: "Development of system configuration management." The complete series: IntroductionMigration end evolution Working with secrets, IaC, and deserializing data in GoBuilding the CLI and APIHandling exclusive configurations and associated templatesPerformance considerationSummary and reflections Introduction SCM is software that facilitates the widespread deployment of configurations across infrastructure. It is a tool that can orchestrate the parameters of computers to prepare them for the desired environment. The necessity of SCM is recognized across a large number of computer systems. A well-organized SCM can improve the productivity of the SRE team. The larger the number of hosts, the greater the productivity it introduces. Conversely, poorly organized SCM in small infrastructures can lead to decreased productivity. Typically, bare metal and VM-based infrastructure are suitable for deployment via SCM. While the deployment of applications using the SCM API is possible, it is not very convenient. Orchestrators like Kubernetes and Nomad are not designed to work with SCM. Infrastructure as Code (IaC) is more effective for provisioning. As a result, on average, we have at least three different tools for configuration deployment. While this isn't necessarily detrimental, it is common practice. On the other hand, custom infrastructure providers introduce their own challenges. Consequently, any additional tools incur overhead costs related to adjustment, development, and maintenance. My colleagues and I decided to develop our own SCM to address this issue. I authored the initial code, and then my colleagues joined me. Unfortunately, it is not an open-source system, but in this article, we will discuss the challenges we encountered during development, the solutions we found, the approaches we took, and the common principles for developing your own SCM. This information may be useful for those who face the same choice. What I Dislike about Popular SCM The most popular SCM in open source: Ansible, Saltstack, Puppet, Chef, CFEngine. These are good engines to use as SCM for most cases. Our case was the other. First of all, we used Ansible and SaltStack. The primary issue connected with them is the requirement for an up-to-date version of Python and its modules on the server. This necessity can lead to increased maintenance costs. The second issue is that most integration modules do not adapt to our specific use cases. This leads to a situation where the most commonly used features will be a file deployer, service runner, and package installer. Overall, the user will need to describe the integration with services in all cases. If it is a straightforward case, the process will be simple. However, if it is more complex, the difficulty will increase, and users may not notice a significant difference between developing their own SCM and using an open-source SCM. For instance, if we want to bootstrap ACL in Consul, we should perform the following steps: Run query to /v1/acl/bootstrap locally on the server.Store the obtained AccessorID and SecretID in Vault.Make this secret available on all servers in the cluster to enable management of ACLs through SCM. The second example is using internal TLS certificates for mTLS: Generate a new certificate and key.Store them in the vault.Deploy from the vault to a group of servers. The password deployment process is similar. These operations are bidirectional. At the start, we must generate the secret, store it in the secret storage, and then deploy it where necessary. In traditional SCM, such cases lead to the impossibility of deploying the system with just one push to Git and a single SCM call. Motivation to Develop a New SCM There are many pros and cons to changing the SCM to find a better solution for us. However, based on the experience of other colleagues, each SCM has its own advantages and disadvantages. This is acceptable, as even if we develop a new one, it will also have some drawbacks. Nevertheless, we can focus our efforts on increasing the benefits of our development. Ultimately, we identified several reasons why we believe developing a new SCM will lead us to success: Dissatisfaction with the old SCM: It may sound strange, but when many engineers struggle with a particular tool, they are often motivated to participate in developing and pushing for a new tool.Complaints about requirements and conditions: For instance, our new SCM would need to closely integrate with our private cloud and our own CMDB, taking into account the roles and host group semantics (in the future I will refer to it as hostgroup) we use in our live processes, as well as the specific open-source tools we integrate with.Development of new functionality: The new SCM can offer features relevant to our SRE needs that are not available in the current open-source SCM options. However, developing this codebase will require time. For instance, it can include: Automatic restoration of servicesIaC functionalityAutomation of cluster assembly and node joiningIndependence from irrelevant features: A new SCM, developed as ordinary software, will mitigate issues relating to security updates, unnecessary feature overload, and potential backward compatibility breaks. Specifically, the new SCM will: Have updates implemented only when we need them.Include only relevant functions (avoiding unnecessary functionality and bugs).Maintain backward compatibility in many cases where the open-source SCM cannot do so due to its universality and irrelevant features for us.Improved configurations of services: Even if we miss our goals, moving the configuration from one SCM to another allows us to eliminate irrelevant elements, remove unnecessary workarounds in service configuration, and ultimately create a cleaner configuration.Interest from other teams: Numerous teams are keen to learn from our experiences and may be interested in making similar decisions. How We Envisioned an Effective SCM For Us In our opinion, the SCM can prepare the empty host to be production-ready without the participation of engineers. SCM should create directories, manipulate files, run services, initialize and join nodes to the clusters, add users to the software, set permissions, and so on. This approach will improve the productivity of the SRE team and ensure the reproducibility of infrastructure. Earlier, we envisioned a system that could build new services in infrastructure, from creating and pushing a single file to Git. Moreover, we wanted to unite SCM and IaC. In our company, we use a self-developed private cloud to provision VMs. At that time, we did not have Terraform integration, and creating it from scratch was similarly labor-intensive. In our vision, a new SCM must create VMs, connect to them, and provision them to production in just 10 minutes. We wanted to write this in Go to minimize the number of dependencies installed on each machine. The main manifest for host groups is a simple YAML file and can be parsed by yamllint. This opens up opportunities for pre-commit checks that highlight syntax-level issues. The next critical integration is with a persistent database to store dynamically configured parameters. We integrated it with Consul, which allows us to deploy applications dynamically by setting new versions of applications — such as changing Docker images on the fly with an API — rather than hardcoding them into files in the Git repository. Another important aspect for us is integration with Vault, which enables the creation and retrieval of secrets and certificates for deployment on hosts. This would allow for bidirectional schemas, where our developed SCM generates secrets, automatically stores them in the Vault, and deploys them to the hosts. From What We Started, the Development Overview Both IaC and SCM functionalities operate as follows: According to the scheme, the SR Engineer pushes the configuration file to Git that describes the hostgroups, including resources and the number of replicas. This file contains all configuration details for the hostgroups: resources, software, and settings. The API periodically checks the API of the inventory manager for key fields: Do we have such host groups? If no, create thisDo we have enough hosts in this host group? If no, loop to create more hosts Once a host starts, the initial scripts (which can be the old SCM running in automatic mode, kickstarts in RHEL-based environment, or cloud-init) initiate the installation of the SCM agent. After the SCM agent starts, it registers with the SCM API and periodically retrieves its configuration. From that point onward, all hosts will be managed by the SCM agent for new configurations and deployments. There are many sources of data. The SCM API retrieves the data from all sources and merges it. We have a default.yaml file that contains the configuration relevant for all hosts. The target configuration for a hostgroup is stored in '{hostgroup name}.yaml'. Additionally, Consul and Vault provide extra information, including dynamic configuration and secrets. Consul is an important part of SCM as it allows for dynamic configurations to be stored without requiring a push to Git, which is useful for deployments. The SCM API includes a reverse proxy feature to route requests to Consul. This approach provides a unified access model and a single entry point for interactions with the configuration. As a result, SCM provides an interface to store dynamic configurations in Consul while keeping static configurations on the filesystem under Git control. There was a small service developed in Golang, consisting of three parts: the API, the agent, and the CD client. The first functionality that was developed was package installation. On CentOS 7, it uses YUM for this purpose. Our hosts were united in the host group in our resource manager (CMDB). The main idea was that the hosts in the host group must be similar or the same. The API then returned the configuration based on the determined host group of each host. Each host group had its own unique configuration, as follows: YAML packages: lsof: name: lsof-4.87-4 There are two main repositories: The SCM source codeConfiguration files that describe the hostgroup manifests for deployment The first repository contains the source code for the SCM, which operates according to the declarative configuration specified in the second repository. It checks various files, packages, and services for compliance with specified conditions. This code also provides for complex cases developed in Golang. In SaltStack or Ansible terminology, it is referred to as roles or formulas. The second repository contains declarative configurations in YAML. In Saltstack or Ansible, it is referred to as facts or pillars/grains. For convenience, a file containing the default settings for all hosts was introduced. This allows the option to avoid using SaltStack to deploy packages widely in the early stages of development, providing deployment opportunities for a base configuration as extensive as possible. The first modules introduced in the SCM included: Directory managerFile managerCommand run managerService managerPackage managerUser manager Code Explanation Configuration files open up opportunities for us to configure most resources on the system. These common managers had the following components: Declarative Config Handler It operates only with user-defined host groups via YAML files. For example, below is a piece of code that implements the file state checking logic: YAML func FilesDeclarativeHandler(ApiResponse map[string]interface{}, parsed map[string][]resources.File) { for _, file := range parsed[key] { if file.State == "absent" { FileAbsent(file.Path) continue } err := FileMkdir(file.Path, file.DirMode) if err != nil { logger.FilesLog.Println("Cannot create directory", err) } if file.Template == "go" { TemplateFileGo(file.Path, file.Data, file.FileMode, ApiResponse) } else if file.Symlink != "" { CreateSymLink(file.Path, file.Symlink, file.DirMode) } else if file.Data != "" { temp_file := GenTmpFileName(file.Path) ioutil.WriteFile(GenFile, data, file.FileMode) TemplateFileGo(file.Path, file.Data, file.FileMode) if CompareAndMoveFile(file.Path, GenFile, file.FileMode, file.FileUser, file.FileGroup) { CompareAndMoveFile(temporaryPath, file) FileServiceAction(file) } } ... The API part responsible for retrieving files from the filesystem and sharing them with specific hosts is as follows: YAML func FilesMergeLoader() { ... if Fdata.From != "" { filesPath := conf.LConf.FilesDir + "/data/" + Fdata.From loadedBuffer, err := ioutil.ReadFile(filesPath) if err != nil { logger.FilesLog.Println("hg", hostgroup, "err:", err) continue } Fdata.Data = base64.StdEncoding.EncodeToString(loadedBuffer) } ... If the 'from' field is defined, the API loads this file as a base64-encoded string into a new JSON field called data, allowing binary files to be transferred within JSON. An agent with a pull model periodically checks the API, retrieves these fields, and stores them on the hosts' destination filesystems. This is just a small part of the functionality that allows for file configurations with parameters such as: YAML files: /path/to/destination/filesystem: from: /path/from/source/filesystem template: go /etc/localtime: symlink: /usr/share/zoneinfo/UTC /etc/yum.repos.d/os.repo: state: absent These two parts of the code function in a similar manner: The Git repository consists of a set of declarative configuration files and the files that must be transferred to the agents. Two other modules operate similarly, although with slight differences. They contain significantly more logic related to their area. Almost all managers have flags for restarting or reloading services after making changes, which necessitates identifying differences before changes are made. As a result, if the SCM agent wants to create a directory, it must first check for its existence. The "running" state of service indicates that the service must be running and enabled, while the "dead" state signifies that the service must be disabled and stopped. In our infrastructure, such an operation doesn't need to be separated, and we have not implemented functionality to distinguish between the functions that enable and run services. The package handler workflow is illustrated in the following flowchart: On the other hand, the package handler is similar but with its own specific requirements: Macros for Calling Managers at the API Level In many cases, the merger part creates declarative configurations with common elements from certain macros. The API merely enriches the main YAML declaration for each host group. In the future mentions I will refer to them as Mergers. YAML // ApiResponse is a JSON that contains all fields declared by the user in group.yaml func HTTPdMerger(ApiResponse map[string]interface{}) { if ApiResponse == nil { return } // Check for the existence of the 'httpd' field. If not specified, skip since it's not a group for the httpd service. if ApiResponse["httpd"] == nil { return } // Get the statically typed struct from the main YAML var httpd resources.Httpd err := mapstructure.WeakDecode(ApiResponse["httpd"], &httpd) // Define the service name that should be run on the destination hosts httpdService := "httpd.service" if httpd.ServiceName != "" { httpdService = httpd.ServiceName } // Define the expected service state state := httpd.State if state != "" { // Add the service state to the response JSON common.APISvcSetState(ApiResponse, httpdService, state) } else { // Default to "running" if no state is specified common.APISvcSetState(ApiResponse, httpdService, "running") } // Define the httpd package name httpPackage := "httpd" if httpd.PackageName != "" { httpPackage = httpd.PackageName } // Add the package name to the response JSON common.APIPackagesAdd(ApiResponse, httpPackage, "", "", []string{}, []string{httpdService}, []string{}) Envs := map[string]interface{}{ "LANG": "C", } // Add user for the httpd service common.UsersAdd(ApiResponse, "httpd", Envs, "", "", "", "", 0, []string{}, "", false) // Add an empty directory for logs common.DirectoryAdd(ApiResponse, "/var/log/httpd/", "0755", "httpd", "nobody") // Add the Go templated file httpd.conf, which should be obtained from httpd/httpd.conf on the SCM API host from the GIT directory, and passed to /etc/httpd/httpd.conf on the destination server. common.FileAdd(ApiResponse, "/etc/httpd/httpd.conf", "httpd/httpd.conf", "go", "present", "root", "root", "", []string{}, []string{}, []string{httpdService}, []string{}) Url := "http://localhost/server-status" // We utilize the Alligator monitoring agent to collect metrics from HTTPd. Add configuration context with httpd. AlligatorAddAggregate(ApiResponse, "httpd", Url, []string{}) } As a result, the user can work in two ways: Declare the resources themselves.Declare a macro like httpd, and everything relevant to this service is automatically enriched in the resulting response. To create your own macros, you need to write Go code. To support, there are many functions, like common.DirectoryAdd and common.FileAdd only enriches the JSON. For example, here is an example of the FileAdd function: Go func FileAdd(ApiResponse map[string]interface{}, path, from, template, state, file_user, file_group, file_mode string, restart, reload, flags, cmdrun []string) { if ApiResponse == nil { return } if ApiResponse["files"] == nil { ApiResponse["files"] = map[string]interface{}{} } Files := ApiResponse["files"].(map[string]interface{}) NewFile := map[string]interface{}{ "from": from, "state": state, "template": template, "services_restart": restart, "services_reload": reload, "flags": flags, "cmd_run": cmdrun, "file_user": file_user, "file_group": file_group, "file_mode": file_mode, } Files[path] = NewFile } func APIPackagesAdd(ApiResponse map[string]interface{}, pkg string, Name string, Name9 string, Restart []string, Reload []string, CmdRun []string) { if ApiResponse["packages"] == nil { ApiResponse["packages"] = map[string]interface{}{} } Packages := ApiResponse["packages"].(map[string]interface{}) if Packages[pkg] == nil { NewPkg := map[string]interface{}{} if Name != "" { NewPkg["name"] = Name } if Name9 != "" { NewPkg["el9"] = Name9 } if Restart != nil { NewPkg["services_restart"] = Restart } if Reload != nil { NewPkg["services_reload"] = Reload } if CmdRun != nil { NewPkg["cmd_run"] = CmdRun } Packages[pkg] = NewPkg } } func APISvcSetState(ApiResponse map[string]interface{}, svcname string, state string) { if ApiResponse["services"] == nil { ApiResponse["services"] = map[string]interface{}{} } service := ApiResponse["services"].(map[string]interface{}) _, serviceDefined := service[svcname] if !serviceDefined { service[svcname] = map[string]interface{}{"state": state} } service[svcname].(map[string]interface{})["state"] = state } func DirectoryAdd(ApiResponse map[string]interface{}, Path string, Mode string, User string, Group string) { if ApiResponse["directory"] == nil { ApiResponse["directory"] = map[string]interface{}{} } Directory := ApiResponse["directory"].(map[string]interface{}) NewFile := map[string]interface{}{ "dir_mode": Mode, "user": User, "group": Group, } Directory[Path] = NewFile } func UsersAdd(ApiResponse map[string]interface{}, UserName string, Envs map[string]interface{}, Home string, Shell string, Group, Groups string, Uid int, Keys []string, Password string, CreateHomeDir bool) { if ApiResponse == nil { return } if ApiResponse["users"] == nil { ApiResponse["users"] = map[string]interface{}{} } Users := ApiResponse["users"].(map[string]interface{}) NewUser := map[string]interface{}{ "envs": Envs, "home": Home, "shell": Shell, "groups": Groups, "uid": Uid, "keys": Keys, "genpasswd": Password, "group": Group, "create_home_dir": CreateHomeDir, } Users[UserName] = NewUser } func AlligatorAddAggregate(ApiResponse map[string]interface{}, Parser string, Url string, Params []string) { if ApiResponse["alligator"] == nil { return } AlligatorMap := ApiResponse["alligator"].(map[string]interface{}) if AlligatorMap["aggregate"] == nil { var Aggregate []interface{} AlligatorMap["aggregate"] = Aggregate } AggregateMap := AlligatorMap["aggregate"].([]interface{}) AggregateNode := map[string]interface{}{ "parser": Parser, "url": Url, "params": Params, } AggregateMap = append(AggregateMap, AggregateNode) AlligatorMap["aggregate"] = AggregateMap } However, the file manager has additional logic due to the necessity to load the file body into the JSON. It works well by adding the file loader at the end of scanning other mergers. Other cases function similarly but are simpler. For the end user, the definition: YAML httpd: state: running Will be transformed into: YAML httpd: state: running packages: httpd: name: httpd flags: - httpd.service service: httpd.service: state: running files: /etc/httpd/httpd.conf: template: go from: httpd/httpd.conf services_reload: - httpd.service user: root group: root directory: /var/log/httpd/: dir_mode: "0755" user: httpd group: nobody users: httpd: envs: LANG: C alligator: aggregate: - url: http://localhost/server-status parser: httpd This opens up all the opportunities of modern SCM and remains flexible enough to change parameters. The Codebase That Operates at the Agent Level SCM allows for custom resource definitions. This is part of the role that must be performed on destination servers within the general JSON pulled from the SCM API. The SCM agent provides an interface with many functions to synchronize or template files, create symlinks, install packages on the operating system, and start or stop services. These functions act as wrappers that check for differences between the state declared by the SCM and the state at the host level. For example, before changing a file, the agent should check for its existence, identify differences, and synchronize that file from the SCM. This process is necessary to trigger actions related to state changes, such as running commands, restarting or reloading services, or performing other tasks. Many configuration parameters on Linux can be transferred via files, services, packages, and so on, and in most cases, there is no need for additional custom logic. However, sometimes there are cases where certain services cannot be restarted simultaneously across multiple servers. In such instances, we can describe the logic using locks, as shown in the code below: Go func HTTPdParser(ApiResponse map[string]interface{}) { if ApiResponse == nil { return } if ApiResponse["httpd"] == nil { return } var httpd resources.Httpd err := mapstructure.WeakDecode(ApiResponse["httpd"], &httpd) HttpdFlagName := "httpd.service" var Group string if ApiResponse["group"] != nil { Gropu = ApiResponse["group"].(string) } if common.GetFlag(HttpdFlagName) { LockKey := Hostgroup + "/" + HttpdFlagName LockRestartKey := "restart-" + HttpdFlagName if common.SharedLock(LockKey, "0", ApiResponse["IP"].(string)) { if !common.GetFlag(LockRestartKey) { common.SetFlag(LockRestartKey) common.DaemonReload() common.ServiceRestart(HttpdFlagName) } } if common.GetFlag(LockRestartKey) { if WaitHealthcheck(httpd, ApiResponse) { common.SharedUnlock(LockKey) common.DelFlag(LockRestartKey) common.DelFlag(HttpdFlagName) } } } } Visually, it works like this: This is just one example of such a case, but there can be many more. For instance, as I mentioned earlier, bootstrapping Consul ACLs must also be performed on the local node. Parsers only process JSON and perform actions to bring the configuration into compliance. Author Contributions Primary author: Kashintsev Georgii Developed the concept, outlined the structure, and authored diagrams as well as the majority of the content.Co-author: Alexander Agrytskov wrote key sections in Evolution, Unsatisfied Expectations, and Incidents. Also contributed to editing and review across all other sections to improve clarity and technical consistency. Reviewed the final draft.

By Georgii Kashintsev

Microservices

DZone's Featured Microservices Resources

Top Microservices Experts

The Latest Microservices Topics