DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Intelligent Observability: Learn about outcome-driven engineering, platform consolidation, AI-assisted ops, HITL automation, and more.

Observability is evolving, see what’s next in the 2025 DZone Trend Report.

DZone Spotlight

Sunday, November 9 View All Articles »
Tactical Domain-Driven Design: Bringing Strategy to Code

Tactical Domain-Driven Design: Bringing Strategy to Code

By Otavio Santana DZone Core CORE
In the previous article, I discussed the most often overlooked aspect of Domain-Driven Design: the strategic side. When it comes to software development, teams tend to rush toward code, believing that implementation will clarify the domain. History shows the opposite — building without understanding the underlying reason or direction often leads to systems that are technically correct but conceptually wrong. As the old Latin root of strategy (strategos, “the art of the general”) suggests, the plan must precede the movement. Now that we’ve explored the “why” and “what,” it’s time to turn to the “how.” Tactical DDD represents this next step — the process of transforming a well-understood domain into expressive, maintainable code. While strategic design defines boundaries and fosters a shared understanding, tactical design brings those ideas to life within each bounded context. Tactical DDD focuses on implementing the domain model. It provides a rich vocabulary of design patterns — entities, value objects, aggregates, repositories, and domain services — each serving a precise purpose in expressing business logic. These patterns were not invented from scratch by Eric Evans; instead, they emerged from decades of object-oriented design thinking, later refined to fit complex business domains. The term “entity,” for instance, descends from the Latin “entitas” — “being” — emphasizing identity and continuity through change, while “value objects” recall the algebraic notion of equality by content rather than identity. What makes tactical DDD essential is its ability to create models that are not only accurate but also resilient to change in an era of a vast amount of tools and architecture patterns, such as distributed systems, microservices, and cloud-native architectures. Without a good direction, we can mislead and generate unnecessary complexity. This layer bridges the conceptual clarity of the strategic model with the practical demands of implementation. Tactical design ensures that business rules are captured in code rather than scattered across services, controllers, or database scripts. It’s about writing software that behaves like the business, not merely one that stores its data. As the strategic part defines direction, the tactical part defines execution. It consists of seven essential patterns that turn concepts into code. Entities – Objects with identity that persist and evolve.Value Objects – Immutable objects defined only by their attributes.Aggregates – Groups of related entities ensuring consistent boundaries.Repositories – Interfaces that abstract persistence of aggregates.Factories – Responsible for creating complex domain objects.Domain Services – Hold domain logic that doesn’t fit an entity or value object.Domain Events – Capture and communicate significant occurrences in the domain. Each plays a specific role in expressing business logic faithfully within a bounded context. Together, they bring the domain model to life, ensuring that design decisions remain aligned with the business, even as they are implemented deep within the code. Entities Entities represent domain objects with a unique identity, or ID, that persists over time, even as their attributes might change. They model continuity — something that remains the same even when its data evolves. They capture real-world concepts like Order, Customer, or Invoice, where identity defines existence. In e-commerce, an Order remains the same object whether it’s created, updated, or completed. Java public class Order { private final UUID orderId; private List<OrderItem> items = new ArrayList<>(); private OrderStatus status = OrderStatus.NEW; public void addItem(OrderItem item) { items.add(item); } } Value Objects Value Objects describe elements of the domain that are defined entirely by their values, not by their identity. They are immutable, replaceable, and ensure equality through content. In practice, value objects like Money, Address, or DateRange make models safer and more precise. For example, a Money object adds two amounts of the same currency, ensuring correctness and immutability. Java public record Money(BigDecimal amount, String currency) { public Money add(Money other) { if (!currency.equals(other.currency())){ throw new IllegalArgumentException("Currencies must match"); } return new Money(amount.add(other.amount()), currency); } } Aggregates Aggregates organize related entities and value objects under a single consistency boundary, ensuring that business rules remain valid. The aggregate root acts as the guardian of its internal state. A typical example is an Order controlling its OrderItems. All modifications are routed through the root, preserving invariants such as total price and item limits. Java public class Order { private final UUID orderId; private final List<OrderItem> items = new ArrayList<>(); public void addItem(Product product, int quantity) { items.add(new OrderItem(product, quantity)); } public BigDecimal total() { return items.stream() .map(OrderItem::subtotal).reduce(BigDecimal.ZERO, BigDecimal::add); } } Repositories Repositories abstract the way aggregates are stored and retrieved, allowing the domain to stay independent of database concerns. They act as in-memory collections that handle persistence transparently. A repository enables the domain to operate at a higher level, focusing on business logic rather than SQL or API calls. For example, an OrderRepository manages how Order objects are saved or found, without exposing infrastructure details. Java public interface OrderRepository { Optional<Order> findById(UUID id); void save(Order order); void delete(Order order); } Factories Factories are responsible for creating complex domain objects while ensuring that all invariants are satisfied. They centralize creation logic, keeping entities free from construction complexity. When creating an Order, for example, a factory ensures the object starts in a valid state and respects business rules — avoiding scattered creation logic throughout the code. Java public class OrderFactory { public Order create(Customer customer, List<Product> products) { Order order = new Order(UUID.randomUUID(), customer); products.forEach(p -> order.addItem(p, 1)); return order; } } Domain Services Domain Services hold domain logic that doesn’t naturally belong to an entity or value object. They express behaviors that involve multiple aggregates or cross-cutting business rules. For instance, a PaymentService could coordinate payment processing for an order. It operates at the domain level, preserving the model’s purity while integrating with external systems when necessary. Java public class PaymentService { private final PaymentGateway gateway; public PaymentService(PaymentGateway gateway) { this.gateway = gateway; } public PaymentReceipt processPayment(Order order, Money amount) { return gateway.charge(order.getOrderId(), amount); } } Domain Events Domain Events capture meaningful occurrences within the business domain. They represent something that happened — not an external trigger, but a fact that the domain itself wants to share. This makes the model more expressive, reactive, and aligned with real business language. For example, when an Order is placed, it can publish an OrderPlacedEvent. Other parts of the system, such as billing, shipping, or notification services, can then react independently, promoting decoupling and scalability. Java public record OrderPlacedEvent(UUID orderId, Instant occurredAt) { public static OrderPlacedEvent from(Order order) { return new OrderPlacedEvent(order.getOrderId(), Instant.now()); } } Application Services — Orchestrating Use Cases Although Application Services are not part of the original seven tactical DDD patterns, they deserve mention for their role in modern architectures. They act as use-case orchestrators, coordinating domain operations without containing business logic themselves. Application services sit above the domain layer, ensuring that controllers, APIs, or message handlers remain thin and focused on their primary purpose: communication. For example, when placing an order, an application service coordinates the creation of the Order, its persistence, and the payment process. The domain remains responsible for what happens, while the application service decides when and in which sequence those actions occur. Java public class OrderApplicationService { private final OrderRepository repository; private final PaymentService paymentService; private final OrderFactory factory; public OrderApplicationService(OrderRepository repository, PaymentService paymentService, OrderFactory factory) { this.repository = repository; this.paymentService = paymentService; this.factory = factory; } @Transactional public void placeOrder(Customer customer, List<Product> products) { Order order = factory.create(customer, products); repository.save(order); paymentService.processPayment(order, order.total()); } } In practice, application services serve as the entry points for use cases, managing transactions, invoking domain logic, and triggering external integrations as needed. They maintain the model’s purity while enabling the system to execute coherent business flows from end to end. Conclusion Tactical Domain-Driven Design brings strategy to life. While the strategic side defines boundaries and shared understanding, the tactical patterns — entities, value objects, aggregates, repositories, factories, domain services, and domain events — translate that vision into expressive, maintainable code. Even the application service, although not part of the original seven, plays a vital role in orchestrating use cases and maintaining the model's purity. More
Optimizing Write-Heavy Database Workloads for Low Latency

Optimizing Write-Heavy Database Workloads for Low Latency

By Felipe Mendes
Write-heavy database workloads bring a distinctly different set of challenges than read-heavy ones. For example: Scaling writes can be costly, especially if you pay per operation, and writes are 5X more costly than readsLocking can add delays and reduce throughputI/O bottlenecks can lead to write amplification and complicate crash recoveryDatabase backpressure can throttle the incoming load While cost matters quite a lot, in many cases, it’s not a topic we want to cover here. Rather, let’s focus on the performance-related complexities that teams commonly face and discuss your options for tackling them. What Do We Mean by “a Real-Time Write Heavy Workload”? First, let’s clarify what we mean by a “real-time write-heavy” workload. We’re talking about workloads that: Ingest a large amount of data (e.g., over 50K OPS)Involve more writes than readsAre bound by strict latency SLAs (e.g., single-digit millisecond P99 latency) In the wild, they occur across everything from online gaming to real-time stock exchanges. A few specific examples: Internet of Things (IoT) workloads tend to involve small but frequent append-only writes of time series data. Here, the ingestion rate is primarily determined by the number of endpoints collecting data. Think of smart home sensors or industrial monitoring equipment constantly sending data streams to be processed and stored.Logging and monitoring systems also deal with frequent data ingestion, but they don't have a fixed ingestion rate. They may not necessarily be only, and may be prone to hotspots, such as when one endpoint misbehaves.Online gaming platforms need to process real-time user interactions, including game state changes, player actions, and messaging. The workload tends to be spiky, with sudden surges in activity. They’re extremely latency sensitive since even small delays can impact the gaming experience.E-commerce and retail workloads are typically update-heavy and often involve batch processing. These systems must maintain accurate inventory levels, process customer reviews, track order status, and manage shopping cart operations, which usually require reading existing data before making updates.Ad tech and real-time bidding systems require split-second decisions. These systems handle complex bid processing, including impression tracking and auction results, while simultaneously monitoring user interactions such as clicks and conversions. They must also detect fraud in real time and manage sophisticated audience segmentation for targeted advertising.Real-time sock exchange systems must support high-frequency trading operations, constant stock price updates, and complex order-matching processes — all while maintaining absolute data consistency and minimal latency. Next, let’s look at key architectural and configuration considerations that impact write performance. Storage Engine Architecture The choice of storage engine architecture fundamentally impacts write performance in databases. Two primary approaches exist: LSM trees and B-Trees. Databases known to handle writes efficiently, such as ScyllaDB, Apache Cassandra, HBase, and Google BigTable, use Log-Structured Merge Trees (LSM). This architecture is ideal for handling large volumes of writes. Since writes are immediately appended to memory, this allows for very fast initial storage. Once the “memtable” in memory fills up, the recent writes are flushed to disk in sorted order. That reduces the need for random I/O. For example, here’s what the ScyllaDB write path looks like: With B-tree structures, each write operation requires locating and modifying a node in the tree — and that involves both sequential and random I/O. As the dataset grows, the tree can require additional nodes and rebalancing, leading to more disk I/O, which can impact performance. B-trees are generally better suited for workloads involving joins and ad-hoc queries. Payload Size Payload size also impacts performance. With small payloads, throughput is good, but CPU processing is the primary bottleneck. As the payload size increases, you get lower overall throughput, and disk utilization also increases. Ultimately, a small write usually fits in all the buffers, and everything can be processed quite quickly. That’s why it’s easy to get high throughput. For larger payloads, you need to allocate larger buffers or multiple buffers. The larger the payloads, the more resources (network and disk) are required to service those payloads. Compression Disk utilization is something to watch closely with a write-heavy workload. Although storage is continuously becoming cheaper, it’s still not free. Compression can help keep things in check — so choose your compression strategy wisely. Faster compression speeds are important for write-heavy workloads, but also consider your available CPU and memory resources. Be sure to look at the compression chunk size parameter. Compression basically splits your data into smaller blocks (or chunks) and then compresses each block separately. When tuning this setting, realize that larger chunks are better for reads while smaller ones are better for writes, and take your payload size into consideration. Compaction For LSM-based databases, the compaction strategy you select also influences write performance. Compaction involves merging multiple SSTables into fewer, more organized files to optimize read performance, reclaim disk space, reduce data fragmentation, and maintain overall system efficiency. When selecting compaction strategies, you could aim for low read amplification, which makes reads as efficient as possible. Or, you could aim for low write amplification by avoiding compaction from being too aggressive. Or, you could prioritize low-space amplification and have compaction purge data as efficiently as possible. For example, ScyllaDB offers several compaction strategies (and Cassandra offers similar ones): Size-tiered compaction strategy (STCS): Triggered when the system has enough (four by default) similarly sized SSTables.Leveled compaction strategy (LCS): The system uses small, fixed-size (by default 160 MB) SSTables distributed across different levels.Incremental Compaction Strategy (ICS): Shares the same read and write amplification factors as STCS, but it fixes its 2x temporary space amplification issue by breaking huge SSTables into SSTable runs, which are comprised of a sorted set of smaller (1 GB by default), non-overlapping SSTables.Time-window compaction strategy (TWCS): Designed for time series data. For write-heavy workloads, we warn users to avoid leveled compaction at all costs. That strategy is designed for read-heavy use cases. Using it can result in a regrettable 40x write amplification. Batching In databases like ScyllaDB and Cassandra, batching can actually be a bit of a trap – especially for write-heavy workloads. If you're used to relational databases, batching might seem like a good option for handling a high volume of writes. But it can actually slow things down if it’s not done carefully. Mainly, that’s because large or unstructured batches end up creating a lot of coordination and network overhead between nodes. However, that’s really not what you want in a distributed database like ScyllaDB. Here’s how to think about batching when you’re dealing with heavy writes: Batch by the partition key: Group your writes by the partition key so the batch goes to a coordinator node that also owns the data. That way, the coordinator doesn’t have to reach out to other nodes for extra data. Instead, it just handles its own, which cuts down on unnecessary network traffic.Keep batches small and targeted: Breaking up large batches into smaller ones by partitioning keeps things efficient. It avoids overloading the network and lets each node work on only the data it owns. You still get the benefits of batching, but without the overhead that can bog things down.Stick to unlogged batches: Considering you follow the earlier points, it’s best to use unlogged batches. Logged batches add extra consistency checks, which can really slow down the write. So, if you’re in a write-heavy situation, structure your batches carefully to avoid the delays that big, cross-node batches can introduce. Wrapping Up We offered quite a few warnings, but don’t worry. It was easy to compile a list of lessons learned because so many teams are extremely successful working with real-time write-heavy workloads. Now you know many of their secrets, without having to experience their mistakes. :-) If you want to learn more, here are some firsthand perspectives from teams who tackled quite interesting write-heavy challenges: Zillow: Consuming records from multiple data producers, which resulted in out-of-order writes that could result in incorrect updatesTractian: Preparing for 10X growth in high-frequency data writes from IoT devicesFanatics: Heavy write operations like handling orders, shopping carts, and product updates for this online sports retailer Also, take a look at the following video, where we go even deeper into these write-heavy challenges and walk you through what these workloads look like on ScyllaDB. More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone’s 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you’re on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes’ demands head-on.

Kubernetes in the Enterprise

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta DZone Core CORE
Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett
Java Caching Essentials

More Articles

Unlocking Modernization: SUSE Virtualization on Arm64 With Harvester
Unlocking Modernization: SUSE Virtualization on Arm64 With Harvester

As the number of data centers and their size grow worldwide, requiring increased efficiency, scalability, and agility from IT infrastructure, the convergence of virtual machines (VMs) and cloud-native technologies is crucial for success. A recent conversation between Dave Neary of Ampere Computing and Alexandra Settle, Product Manager for SUSE Virtualization, highlights a significant step forward: the general availability of SUSE Virtualization for Arm64 architecture, and Harvester’s pivotal role within SUSE’s ecosystem. This white paper summarizes their discussion, highlighting how SUSE is empowering organizations to modernize infrastructure with energy-efficient, high-performance solutions. SUSE Virtualization and Harvester: A Hyperconverged Foundation SUSE Virtualization, powered by its open-source upstream project Harvester, emerges as a robust hyperconverged infrastructure (HCI) solution designed to simplify IT operations. It unifies the management of virtual machines, containers, distributed storage, and comprehensive observability under a single, intuitive platform. Built on a Linux base, Harvester leverages Kubernetes (specifically RKE2) as its orchestration engine. This foundation integrates key cloud-native technologies, such as KubeVirt for virtual machine management, Longhorn for persistent block storage, and Prometheus/Grafana for monitoring and analytics. The core value proposition of SUSE Virtualization is its ability to bridge the gap between legacy VM workloads and modern containerized applications. It enables organizations to run both side-by-side, facilitating a smoother, incremental transition towards cloud-native architectures without the need to replace existing investments. This approach addresses the prevalent challenge of modernization, where virtual machines, despite the rise of containers, continue to play a critical role. SUSE further supports this journey through partnerships with migration specialists, assisting customers in transitioning VMs to Harvester and ultimately to container-based architectures as appropriate. The Strategic Advantage of Arm64 With Ampere A critical enabler of this modernization strategy is the integration of Arm64 architecture support. Ampere Computing is at the forefront of designing energy-efficient, many-core processors specifically tailored for cloud workloads. The advantages of Arm64 processors, particularly Ampere’s offerings, are compelling: they deliver single-threaded cores for predictable vCPU performance, enabling higher core density per rack. This translates into significantly lower power consumption and substantial cost benefits, positioning Arm as an attractive, greener alternative for cloud operators and enterprises alike. These attributes directly align with customer demands for enhanced performance, reduced operational costs, and environmentally sustainable infrastructure. A Phased Rollout and Community Engagement The path to general availability for Arm64 support has been a collaborative effort. Harvester first introduced Arm64 as a technical preview in its 1.3 release. Through extensive testing and QA with partners like Ampere, and invaluable community input, Arm64 achieved general availability and full support with the SUSE Virtualization 1.5 release (community release 1.5.0 in late May '25), with the enterprise “Prime” release following shortly thereafter. This commitment extends across the SUSE ecosystem, with components like Longhorn already supporting Arm64, and future plans targeting Rancher Manager for multi-cluster management on Arm64 in FY25. SUSE actively encourages users to test Harvester on Arm64, provide feedback, and engage with the active community channels on the Rancher Users Slack. This collaborative approach recognizes that real-world testing across diverse environments is crucial for platform refinement and robustness. Conclusion: A Path to Efficient Cloud Modernization The partnership between SUSE and Ampere, underscored by the general availability of SUSE Virtualization on Arm64, represents a significant leap forward in cloud modernization. It offers organizations a powerful, energy-efficient, and cost-effective path to integrate their existing VM infrastructure with the agility of cloud-native containers. This strategic alignment empowers businesses to build resilient, scalable, and sustainable IT environments for the future. Key points: SUSE Virtualization (Harvester) is an HCI solution for VMs, containers, storage, and observability, built on Kubernetes.It bridges legacy VMs with modern cloud-native applications, enabling incremental modernization.Arm64 support is now generally available, offering significant benefits like lower power consumption, higher core density, and reduced costs.Ampere Computing provides energy-efficient, high-performance Arm64 processors tailored for cloud workloads.The platform leverages KubeVirt, Longhorn, Prometheus/Grafana, all integrated with Kubernetes (RKE2). To gain deeper insights and see the full discussion, we invite you to watch the complete video: Check out the full Ampere article collection here.

By Craig Hardy
Master Production-Ready Big Data, Apache Spark Jobs in Databricks and Beyond: An Expert Guide
Master Production-Ready Big Data, Apache Spark Jobs in Databricks and Beyond: An Expert Guide

This iteration is based on existing experience scaling big data with Apache Spark workloads and uses more refinements by still preserving the eight most important strategies but moving high-value but less important strategies — such as preferring narrow transformations, applying code-level best practices, leveraging Databricks Runtime features, and optimizing cluster configuration — to a Miscellaneous section, thereby not losing focus on impactful areas such as shuffles and memory, but still addressing them thoroughly. Diagrams for in-phased insights and example code can be completely executed in Databricks or vanilla Spark sessions, and for all of these to be worth your time, the application will yield unbelievable performance benefits, often in the range of 5–20x in real-world pipelines. Optimization Strategies 1. Partitioning and Parallelism Strategy: Use repartition() to enhance parallelism before shuffle-intensive operations like joins, and coalesce() to minimize partitions pre-write to prevent small-file issues that hammer storage metadata. Python from pyspark.sql import SparkSession from pyspark.sql.functions import rand spark = SparkSession.builder.appName("PartitionExample").getOrCreate() # Sample DataFrame creation data = [(i, f"val_{i}") for i in range(1000000)] df = spark.createDataFrame(data, ["id", "value"]) # Repartition for parallelism before a join or aggregation df_repartitioned = df.repartition(200, "id") # Shuffle to 200 even partitions # Perform a sample operation (e.g., groupBy) aggregated = df_repartitioned.groupBy("id").count() # Coalesce before writing to reduce output files aggregated_coalesced = aggregated.coalesce(10) aggregated_coalesced.write.mode("overwrite").parquet("/tmp/output") print(f"Partitions after repartition: {df_repartitioned.rdd.getNumPartitions()}") print(f"Partitions after coalesce: {aggregated_coalesced.rdd.getNumPartitions()}") Explanation: Partitioning is foundational for parallelism of tasks and load balancing in Spark's distributed model. repartition(n) ensures even data spread via full shuffle, ideal pre-joins to avoid executor overload. coalesce(m) (where m < current partitions) merges locally for efficient writes, cutting I/O costs in Databricks' Delta or S3. Risks: Over-repartitioning increases shuffle overhead; monitor via Spark UI's "Input Size" metrics. Benefits: Scalable for TB-scale data; universal across Spark envs. Diagram: 2. Caching and Persistence Strategy: Cache or persist reusable DataFrames to skip recomputation in iterative or multi-use scenarios. Python from pyspark.sql import SparkSession from pyspark.storagelevel import StorageLevel spark = SparkSession.builder.appName("CachingExample").getOrCreate() # Create a sample DataFrame df = spark.range(1000000).withColumn("squared", spark.range(1000000).id ** 2) # Cache for memory-only (default) df.cache() print("First computation (uncached effectively, but sets cache):", df.count()) # Reuse: Faster second time print("Second computation (from cache):", df.count()) # Persist with custom level (e.g., memory and disk) df.persist(StorageLevel.MEMORY_AND_DISK) print("Persisted count:", df.count()) # Clean up df.unpersist() Explanation: Recomputation kills performance in loops or DAG branches. cache() uses MEMORY_ONLY; persist() allows levels like MEMORY_AND_DISK for spill resilience. In Databricks, this leverages fast NVMe; watch memory usage to avoid evictions. Benefits: Up to 10x speedup in ML training. Risks: Memory exhaustion – use spark.ui to track. Diagram: 3. Predicate Pushdown Strategy: Filter early to leverage storage-level pruning, especially with Parquet/Delta. Python from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName("PushdownExample").getOrCreate() # Read from Parquet (supports pushdown) df = spark.read.parquet("/tmp/large_dataset.parquet") # Assume pre-written large file # Early filter: Pushed down to storage filtered_df = df.filter(col("value") > 100).filter(col("category") == "A") # Further ops: Less data shuffled result = filtered_df.groupBy("category").sum("value") result.show() # Compare explain plans df.explain() # Without filter filtered_df.explain() # With pushdown visible Explanation: Pushdown skips irrelevant data at the source, slashing reads. Delta Lake enhances with stats; universal but format-dependent (Parquet, yes; JSON, no). Benefits: Network savings. Risks: Over-filtering hides data issues. Diagram: 4. Skew Handling Strategy: Salt keys or custom-partition to even out distributions. Python from pyspark.sql import SparkSession from pyspark.sql.functions import col, concat, lit, rand, floor spark = SparkSession.builder.appName("SkewExample").getOrCreate() # Skewed DataFrame skewed_df = spark.createDataFrame([(i % 10, i) for i in range(1000000)], ["key", "value"]) # Many duplicates on low keys # Salt keys: Append random suffix (0-9) salted_df = skewed_df.withColumn("salted_key", concat(col("key"), lit("_"), floor(rand() * 10).cast("string"))) # Group on salted key, then aggregate temp_agg = salted_df.groupBy("salted_key").sum("value") # Remove salt for final result final_agg = temp_agg.withColumn("original_key", col("salted_key").substr(1, 1)).groupBy("original_key").sum("sum(value)") final_agg.show() Explanation: Skew starves executors; salting disperses hot keys temporarily. Custom partitioners (via RDDs) offer precision. Check UI task times. Benefits: Balanced execution. Risks: Extra compute for salting. Diagram: 5. Optimize Write Operations Strategy: Bucket/partition wisely, coalesce files, use Delta's Optimize/Z-Order. Python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("WriteOptExample").getOrCreate() # Sample DataFrame df = spark.range(1000000).withColumn("category", (spark.range(1000000).id % 10).cast("string")) # Partition by column for query efficiency df.write.mode("overwrite").partitionBy("category").parquet("/tmp/partitioned") # For Delta: Write, then optimize df.write.format("delta").mode("overwrite").save("/tmp/delta_table") spark.sql("OPTIMIZE delta.`/tmp/delta_table` ZORDER BY (id)") # Coalesce before write df.coalesce(5).write.mode("overwrite").parquet("/tmp/coalesced") Explanation: Writes create file explosions; coalescing consolidates. Delta's Z-Order clusters for scans; Benefits: Faster reads; Databricks-specific but portable via Hive. Diagram: 6. Leverage Adaptive Query Execution (AQE) Strategy: Enable AQE for runtime tweaks like auto-skew handling. Python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("AQEExample").getOrCreate() # Enable AQE spark.conf.set("spark.sql.adaptive.enabled", "true") spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true") # Sample join that benefits from AQE (auto-broadcast if small) large_df = spark.range(1000000) small_df = spark.range(100) result = large_df.join(small_df, large_df.id == small_df.id) result.explain() # Shows adaptive plans result.show() Explanation: AQE adjusts post-stats (e.g., reduces partitions); benefits: Hands-off optimization; Spark 3+ universal. Diagram: 7. Job and Stage Optimization Strategy: Tune via Spark UI insights, adjusting memory/parallelism. Python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("TuneExample") \ .config("spark.executor.memory", "4g") \ .config("spark.sql.shuffle.partitions", "100") \ .getOrCreate() # Sample job df = spark.range(10000000).groupBy("id").count() df.write.mode("overwrite").parquet("/tmp/tuned") # After run, check UI for GC/stages; adjust configs iteratively Explanation: UI flags GC (>10% bad); tune shuffle.partitions to match cores. Benefits: Resource efficiency; universal. Diagram: 8. Optimize Joins With Broadcast Hash Join (BHJ) Strategy: Broadcast small sides to eliminate shuffles. Python from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.appName("BHJExample").getOrCreate() # Large and small DataFrames large_df = spark.range(1000000).toDF("key") small_df = spark.range(100).toDF("key") # Broadcast small for BHJ result = large_df.join(broadcast(small_df), "key") result.explain() # Shows BroadcastHashJoin result.show() Explanation: BHJ copies small DF to nodes; tune spark.sql.autoBroadcastJoinThreshold. Benefits: Shuffle-free. Risks: Memory for broadcast. Diagram: Miscellaneous Strategies These additional techniques complement the core set, offering targeted enhancements for specific scenarios. While not always foundational, they can provide significant boosts in code efficiency, platform-specific acceleration, and infrastructure tuning. Prefer Narrow Transformations Strategy: Favor narrow transformations like filter() and select() over wide ones like groupBy() or join(). Python from pyspark.sql import SparkSession from pyspark.sql.functions import col spark = SparkSession.builder.appName("NarrowExample").getOrCreate() # Sample large DataFrame df = spark.range(1000000).withColumn("value", spark.range(1000000).id * 2) # Narrow: Filter and select first (no shuffle) narrow_df = df.filter(col("value") > 500000).select("id") # Then wide: GroupBy (shuffle only on reduced data) result = narrow_df.groupBy("id").count() result.show() Explanation: Narrow ops process per-partition, avoiding shuffles; chain them early to prune. Benefits: Lower overhead Risks: Over-chaining increases complexity in code. Diagram: Code-Level Best Practices Strategy: Use select() to specify columns explicitly, avoiding *. Python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("CodeBestExample").getOrCreate() # Sample wide table df = spark.createDataFrame([(1, "A", 100, "extra1"), (2, "B", 200, "extra2")], ["id", "category", "value", "unused"]) # Bad: Select all (*) all_df = df.select("*") # Loads unnecessary columns # Good: Select specific slim_df = df.select("id", "category", "value") # Process: Less memory used result = slim_df.filter(col("value") > 150) result.show() Explanation: * loads extras, increasing memory; select() trims. Benefits: Leaner pipelines; risks: Missing columns in evolving schemas. Diagram: Utilize Databricks Runtime Features Strategy: Harness Delta Cache and Photon for I/O and compute acceleration. Code Python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("RuntimeFeaturesExample").getOrCreate() # Assume Databricks Runtime with Photon enabled spark.conf.set("spark.databricks.delta.cache.enabled", "true") # Delta Cache # Read Delta (caches automatically) df = spark.read.format("delta").load("/tmp/delta_table") # Query: Benefits from cache/Photon vectorization result = df.filter(col("value") > 100).groupBy("category").sum("value") result.show() Explanation: Delta Cache preloads locally; Photon vectorizes. Benefits: Latency drops; Databricks-only, emulate with manual caching elsewhere. Diagram Optimize Cluster Configuration for Big Data Strategy: Select instance types and enable autoscaling. For example, AWS EMR, etc. Python # This is configured via Databricks UI/CLI, not code, but example job config: # In Databricks notebook or job setup: # Cluster: Autoscaling enabled, min 2-max 10 workers # Instance: i3.xlarge (storage-optimized) or r5.2xlarge (memory-optimized) from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ClusterOptExample").getOrCreate() # Run heavy job: Autoscaling handles load df = spark.range(100000000).groupBy("id").count() # Scales up automatically df.show() Explanation: Match instances to workload (e.g., memory for joins); autoscaling adapts. Benefits: Cost savings; Databricks-specific, but can be applied to AWS EMR, etc., with auto- and managed-scaling of instance configuration JSON during cluster bootstrap. Diagram Applicability to Databricks and Other Spark Environments Universal: Some of these methods apply to EMR, Synapse, and other Spark platforms, like Partitioning, caching, predicate pushdown, skew handling techniques, narrow transformations, coding practices, AQE, job optimization, and BHJ.Databricks-specific: Write operations with Delta, features in the Runtime, cluster configuration (and configuration changes) are all native to Databricks (but can be leveraged with alternatives like Iceberg or some manual tuning). Conclusion In this article, I tried to demonstrate eight core strategies that underpin addressing shuffle, memory, and I/O bottlenecks, and improving efficiency. The miscellaneous section describes some subtle refinement approaches, platform-specific improvements, and infrastructure tuning. You now have flexibility and variability in workloads, including ad hoc queries and production ETL pipelines. Collectively, these 12 strategies (core and misc.) promote a way of thinking holistically about optimization. Start by profiling in Spark UI, adaptively implement incremental improvements using the snippets provided here, and benchmark exhaustively to demonstrate the improvements (using metrics for each). By applying these techniques in Databricks, you will not only reduce costs and latency but also build scalable, resilient big data engineering solutions. As Spark development (2025 trends) continues to expand, please revisit this reference and new tools, such as MLflow, for experimentation capabilities, moving bottlenecks into breakthroughs.

By Ram Ghadiyaram DZone Core CORE
This Compiler Bottleneck Took 16 Hours Off Our Training Time
This Compiler Bottleneck Took 16 Hours Off Our Training Time

A 60-hour training job had become the new normal. GPUs were saturated, data pipelines looked healthy, and infra monitoring didn’t flag any issues. But something was off. The model wasn't large, nor was the data complex enough to justify that duration. What we eventually discovered wasn't in the Python code or the model definition. It was buried deep in the compiler stack. Identifying the Invisible Bottleneck Figure: Model pipeline showing the expected flow toward fused kernel optimization and the alternate fallback path leading to GPU under-utilization. The figure highlights critical decision points in the compiler stack affecting performance. We were deploying a quantized neural network using TensorFlow with TensorRT acceleration, routed through a custom TVM stack for inference optimization. On paper, this stack was airtight: optimized kernels, precompiled operators, and GPU targeting enabled. But profiling told another story. Despite all optimizations turned on, GPU utilization plateaued. Memory spikes were inconsistent. Logs showed fallback ops triggering sporadically. We drilled into the Relay IR emitted by TVM and discovered a subtle but costly regression: certain activation patterns (like leaky_relu fused after layer_norm) weren’t being lowered correctly for quantized inference. Instead of a fused kernel, we were getting segmented ops that killed parallelism and introduced memory stalls. APL # Expected fused pattern (not observed due to mispass) %0 = nn.batch_norm(%input, %gamma, %beta, epsilon=0.001) %1 = nn.leaky_relu(%0) # Compiled form should have been a single fused op # Actual IR observed during fallback %0 = nn.batch_norm(%input, %gamma, %beta, epsilon=0.001) %1 = nn.op.identity(%0) %2 = nn.leaky_relu(%1) # Segmentation introduced identity pass, breaking the fuse pattern The root cause? A compiler pass treating a narrow edge case as a generic transform. The Dirty Work of Debugging Acceleration In debugging TVM's Relay IR stack, one overlooked ally was the relay.analysis module. We scripted out a pattern matcher to scan through blocks and detect unintended op separation, especially when quantization annotations were injected. The IR was instrumented with logs to trace op-to-op transitions. Python from tvm.relay.dataflow_pattern import wildcard, is_op, rewrite, DFPatternCallback class FuseChecker(DFPatternCallback): def __init__(self): super().__init__() self.pattern = is_op("nn.batch_norm")(wildcard()) >> is_op("nn.leaky_relu") def callback(self, pre, post, node_map): print("Unfused pattern detected in block:", pre) rewrite(FuseChecker(), mod['main']) This gave us visibility into the transformation path and showed that, despite high-level optimizations, certain common patterns weren’t being caught. Worse, the IR graph diverged depending on how the quantization pre-pass handled calibration annotations. Fixing this wasn’t elegant. First, we had to isolate the affected patterns with debug passes in Relay. We then created a custom lowering path that preserved fused execution for our target GPU architecture. Meanwhile, TensorRT logs revealed that calibration ranges had silently defaulted to asymmetric scaling for certain ops, leading to poor quantization fidelity; something no benchmarking script had caught. APL [TensorRT] WARNING: Detected uncalibrated layer: layer_norm_23 [TensorRT] INFO: Tactic #12 for conv1_3x3 failed due to insufficient workspace [TensorRT] INFO: Fallback triggered for leaky_relu_5 [TensorRT] WARNING: Calibration range fallback defaulted to (min=-6.0, max=6.0) We re-quantized using percentile calibration and disabled selective TensorRT fusions that were behaving unpredictably. The changes weren’t just in code; they were in judgment calls around what to optimize and when. Performance engineering is part systems diagnosis, part gut instinct. Infrastructure Rewrites We Didn’t Want To Do To maintain reproducibility, we added hashing logic to model IR checkpoints. This allowed us to fingerprint model graphs before and after every compiler optimization pass. Any IR delta triggered a pipeline rerun and a deployment alert. We also introduced an internal version control mechanism for compiled artifacts, stored in S3 buckets with hash-tagged lineage references. This way, deployment failures could be traced back to a specific commit in the compiler configuration and not just the source model. None of these fixes was isolated. Once our quantization flow changed, our SageMaker Edge deployment containers broke due to pa ackage mismatch and model signature incompatibility. We had to revalidate across device classes, update Docker images, and reprofile edge deployment times. Python # TVM Lowering Configuration import tvm from tvm import relay target = tvm.target.Target("cuda") with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target=target, params=params) What made this harder was the legacy cost-tracking infrastructure built during my time at Amazon Go. Any model versioning tweak disrupted our resource billing granularity. So we also had to rewire metering hooks, regenerate EC2 cost estimates, and rewrite tagging policies. Tooling complexity cascades. A single op tweak at the compiler level turned into a week-long infra-wide dependency resolution Performance Tradeoff Yes, we got a 5x speedup. But it came with tradeoffs. Some quantized models lost accuracy in edge deployment. Others were too fragile to generalize across hardware classes. We had to A/B test between 8-bit and mixed-precision models. In one case, we rolled back to a non-quantized deployment just to preserve prediction confidence. Quantization also impacted explainability in downstream model audits. We noticed inconsistent behavior in post-deployment trace logs, particularly in user-facing applications where timing-sensitive predictions created drift across device tiers. Optimizing the calibration configuration for precision often meant sacrificing consistency — a trade-off that's hard to communicate outside the infra team. The hardest part? Convincing teams that ‘faster’ didn’t always mean ‘better.’ Closing Reflection Most performance wins don’t come from new tools. They come from understanding how existing ones fail. TVM, TensorRT, SageMaker... they all offer acceleration, but none of them account for context. We learned to build visibility into our compilers, not just our models. We now inspect every IR block before deployment. We trace every fallback path. We don’t benchmark for speed but we benchmark for behavior. We also built internal dashboards to track compiler-side regressions over time. Having that historical visibility has helped us preemptively catch fallback patterns that would have otherwise crept into production. That 16-hour drop wasn’t just a speed win. It was a visibility win. And in ML infrastructure, that’s the metric that really matters.

By Srinidhi Goud Myadaboyina
Modular Monoliths Explained: Structure, Strategy, and Scalability
Modular Monoliths Explained: Structure, Strategy, and Scalability

Choosing the right architectural style is crucial for building robust systems. It’s a decision that must balance short, medium, and long-term needs — and the trade-offs involved deserve thoughtful consideration. In software engineering, we’re spoiled for choice. But with buzzwords flying around and shiny new paradigms tempting us at every turn (did someone say FOMO?), it’s easy to feel overwhelmed. Some styles get dismissed as “legacy” simply because they didn’t work well for a few organizations — even if they still hold merit. The classic debate often boils down to monolith and microservices. Both have their strengths and weaknesses, but the trade-offs may not age well or meet evolving system requirements. Re-architecting later isn’t always feasible or cost-effective, and the resulting technical debt can become a bottleneck for business growth. But what if there’s a middle ground? An architectural style that combines the simplicity of monoliths with the flexibility of microservices — and allows you to pivot as your system matures. Introducing the modular monolith — or modulith. Modular Monolith Definition The term modular monolith consists of two parts, viz., module and monolith. In general, a module can be defined as a standalone logical application. This logical application may consist of several features, domain model, APIs (for external and internal consumptions), DB tables, and microfrontend. A typical module A monolith, as we know, is an application built as a single unit with all the features, DB tables, front end intertwined and tightly coupled with each other. A typical monolith Thus, by combining these two, a modular monolith can be defined as an application with multiple independent but cohesive modules coupled together as a software unit. A typical modular monolith Module Structure or Boundary A key characteristic of a modular monolith is the clear definition of modules and their respective boundaries — i.e., the business capabilities each module is responsible for. These boundaries can be derived similarly to bounded contexts in domain-driven design (DDD). Initially, boundaries don’t need to be rigid. This flexibility allows teams to evaluate and refine the module structure over time, making it easier to pivot if needed. For example, if a module is found to be too lightweight (anemic) or overly complex (bloated), it can be merged with another module or split into smaller ones. While modules may reside in separate code repositories for organizational purposes, they are deployed together as a single unit. Strategy Let's discuss available strategies for key design considerations. These strategies are crucial for reaping the maximum benefits from the modular monolith architecture. Depending upon the domain complexity and consistency requirements, either one or more or a mix of patterns can be incorporated for respective considerations. Inter-Module Communication Although a module is principled as self-contained and independent unit of processing it still need to communicate with other module(s) to function as a system. To achieve loose coupling and tight cohesion, the communication pattern choice is crucial. Below are a few available choices for effective inter-module communication: In-memory events. With events, the modules are decoupled and communicate asynchronously. Modules publish events for other modules to react to the domain changes. With this pattern, the extraction of the module as a separate deployable unit (microservice) is eased to a great extent. This pattern, however, is suitable only if an eventual or weaker form of consistency is required.Dependency inversion (DI). Modules invert dependency between themselves by allowing other modules to define the core logic. The dependent module defines a contract (viz. an interface) which is implemented by the provider module. E.g., for a payment module (dependent) to send a notification, the dependency is inverted by allowing the notification module to implement & determine which and how the notification is being sent (email, SMS, etc.).API. Participating modules define APIs and communicate via them only. These APIs could simply be a direct function call or REST, gRPC, etc. In the case of a direct function call, there is a risk of cyclic dependencies between module code, and thus it should be carefully implemented.Orchestrator. An additional module, viz., orchestrator, can act as the aggregator by communicating between modules and delegating the individual core processing to them. The communication between the orchestrator and modules could be event-based or API calls. The coarse-grained external APIs are available through this orchestrator. This pattern is suitable where strong consistency is required. E.g., An order processing orchestrator would communicate with inventory, payment, notification, shipping, etc. modules to process the order. This orchestrator is also responsible for handling failures. Note: Inter module communication choices are independent of the external API communication pattern. As an example, external APIs could still be synchronous while modules communicate internally asynchronously via events or DI or both. Neither of them should influence or enforce the other. Data Isolation Data isolation strategy influences the robustness, fault tolerance, throughput and scalability of system. Although the choice is strongly influenced by the consistency requirement of system, it can be adapted feature/module wise as well. E.g., while payment requires strong consistency, notification could work for eventual or weaker consistency. Thus, payment module would have different data isolation as compared to notification module. Below are a few available choices for data isolation: No isolation. Every module is allowed to write or read from all the DB tables. This simplifies the systems’ transaction management and ensures strong consistency. However, the chances of data corruption by any rogue module are high. Moreover, any schema changes have a corresponding ripple effect on almost every module. Thus, this strategy should be employed only as the last choice unless absolutely necessary.Write isolation. Only owning module writes to the table but all modules can read from the table directly. This ensures data integrity and simplified transaction management. However, any schema update breaks the reader modules.Command Query Responsibility Segregation (CQRS). Similar to write isolation, only the owning module writes to the table. For other modules, read-only views or shadow tables (with read-only access) are provided. This strategy is the most preferred one as it ensures data integrity, simplified transaction management, flexible schema changes without breaking reader modules, and high read and write throughput. The readers, however, could experience some delays in data freshness.Schema per module. Every module owns a DB schema/tables to which it can read and write freely. Access of data for other modules are restricted via respective APIs only. While data integrity and flexible schema changes are ensured, this strategy suffers from providing eventual consistency and complex transaction management. Thus, this choice should be carefully evaluated before being incorporated. Development Velocity Initially, if high development velocity is required, start with a monolith first. A typical monolith doesn’t age well, and they eventually end up as a big ball of mud with low development velocity. Thus, at a certain point in the development journey, the monolith should be first broken into modules as a modular monolith and evaluated on various parameters. However, if an existing monolith is to be broken, then evaluate Modular Monolith first. Thus, a typical development journey could look like below: Development journey: Start with monolith, proceed towards modular monolith, and reach micro services. Note: Spring modulith promises to assist developers and enforce the modular monolith principles, thus increasing the development velocity. Advantages and Disadvantages Below are the key advantages and disadvantages of modular monolith that we should be aware of. The majority of which are similar to that of a monolith. Advantages Simplified deployment — Entire application is deployed as a single unit, reducing orchestration complexity.Enhanced modularity — Codebase is organized into well-defined modules, improving maintainability and clarity.No network overhead — In-process communication between modules avoids the latency and complexity of inter-service calls.Easier debugging — Centralized logging and stack traces make troubleshooting more straightforward than in distributed systems.Supports domain-driven design (DDD) — Modules can align with bounded contexts, promoting clean domain separation.Lower operational cost — Fewer infrastructure components mean lower DevOps overhead than with microservices.Smooth refactoring path — Modular monolith helps to test the module boundaries early (e.g., if the domain split is incorrect it can be merged back quickly and easily). Reverse migration can be done easily from micro services to a modular monolith. Disadvantages Limited scalability — Cannot scale modules independently, which may impact overall system throughput.Risk of tight coupling — Poor boundaries or shared state can lead to hidden dependencies between modules.Single point of failure — A rogue module can potentially crash the entire system.Harder to enforce isolation — Unlike micro services, module boundaries rely on discipline instead.Deployment bottlenecks — Any change, even in a single module, requires redeploying the entire application. Case Studies Below are a few real-world case studies of Modular Monolith being used or evaluated for the benefits it offers: Shopify successfully migrated from the existing monolith to a modular monolith (reference).GitLab is evaluating the movement from legacy monolith to modular monolith (reference). Conclusion Modular Monoliths strike a thoughtful balance between traditional monoliths and micro services. By organizing code into cohesive, loosely coupled modules, teams gain clarity, maintainability, and a smoother path to future architectural evolution. This approach reduces operational overhead, supports domain-driven design, and enables faster development cycles — especially in early stages. While it doesn’t offer independent scaling like microservices, it avoids their complexity and fragmentation. For many teams, starting with a modular monolith provides a solid foundation that can adapt as the system grows, making it a strategic choice for building robust, flexible, and maintainable software systems. References Atlassian — Micro services vs MonolithMartin Fowler — Monolith First Monolith vs. Modular Monolith vs. Microservices A quick primer on the comparison of monolith vs. modular monolith vs. microservices.

By Ammar Husain
Bridging the Divide: Tactical Security Approaches for Vendor Integration in Hybrid Architectures
Bridging the Divide: Tactical Security Approaches for Vendor Integration in Hybrid Architectures

Security architecture in hybrid environments has traditionally focused on well-known concepts such as OWASP vulnerabilities, identity and access management, role-based access control, network security, and the principle of least privilege. Best practices like secure coding and incorporating SAST/DAST testing into CI/CD pipelines are also widely discussed. However, when organizations operate in a hybrid model — running workloads both on-premises and in the cloud — while also integrating with vendor-managed cloud solutions, a different set of security design considerations comes into play. These scenarios are not uncommon, yet they are rarely highlighted in the context of secure solution implementation involving vendor software in hybrid environments. This article highlights four real-world use cases and outlines practical architectural strategies for organizations to adopt to ensure secure integration in hybrid settings. Acronyms OWASP – Open Web Application Security ProjectSAST – Static Application Security TestingDAST – Dynamic Application Security TestingCI/CD – Continuous Integration / Continuous TestingSaaS – Software as a ServiceUX – User ExperienceETL – Extract, Transform, and Load Use Cases There are three use cases this article covers, as listed below. Automated software update by the vendor in the organization's managed data centerWebhook – mismatch in verification methodologyJavaScript embedding – monitoring mandate Tactical Solutions Automated Software Update by Vendor in Organization-Managed Data Center Problem Statement In some vendor software integrations, organizations are required to install an agent within their own data center. This agent typically acts as a bridge between the vendor’s cloud-hosted application and the organization’s on-premises systems. For example, it may facilitate data transfer between the vendor software and the organization’s on-premises database. In many cases, the vendor’s operational architecture requires that this agent be automatically updated. While convenient, this approach introduces a significant security risk. If the vendor’s software is compromised or contains malware, the update process could infect the virtual machine or container hosting the agent. From there, the threat could propagate into other parts of the organization’s infrastructure, potentially leading to a major security incident. Figure 1 showcases the scenario. Figure 1: Vendor software agent running in the organization's data center Solution A tactical way to solve this problem is to install the future version of the agent software in a separate virtual machine or container and scan the software as well as the machine for any vulnerabilities. If the software and the deployment platform where the software is running pass all the security checks, then the vendor can be approved to install the new version of the agent software automatically. This way it can be ensured that an unverified version of the vendor software doesn’t automatically get pushed to the organization’s data center. Figure 2 demonstrates the solution. Figure 2: Pre-release version of vendor software and scan process Webhook: Mismatch in Verification Methodology Problem Statement This is an interesting security scenario where we often stumble. For a webhook implementation, the organization has to open up an inbound connectivity from the vendor software over the internet. As it is an inbound traffic to the organization's data center (on-prem or cloud), the inbound traffic needs to be verified from every aspect of software security, such as DDoS attack, malicious payload, etc. Organizations generally have a well-defined common security policy to verify all incoming traffic from external vendors. On the other hand, vendor software may also have a common policy that works as a guideline for their customers to verify all aspects of security when they receive inbound traffic from the vendor webhook. It is highly unlikely that the security policy of an organization and a vendor will match, especially when both organization and vendor are major players in the industry. As the security policy doesn’t match the majority of the time, it creates a challenge to implement such webhook integration. Solution A tactical way to solve the issue is to let the incoming traffic hit a reverse proxy layer of the organization. The reverse proxy layer, which receives traffic from the internet, is generally protected by a DDoS protection layer. The reverse proxy layer can forward the incoming traffic to the backend service layer, which has the business logic to process the webhook request. The backend service layer can implement the payload and other verification of the vendor webhook incoming traffic based on the policy set up for the vendor specification. Figure 3 demonstrates the tactical solution. Figure 3: Webhook traffic verification JavaScript Embedding: Monitoring Mandate Problem Statement Some of the vendor solutions these days are JavaScript toolkits. They are typically Digital Adoption Platform (DAP) software that are used to navigate users through the UX of the web platform to make them familiar with the navigation of newly released features. The integration process often requires embedding the vendor's JavaScript toolkit within the organization's codebase. This is deemed risky due to script injection and other types of JavaScript vulnerabilities. In addition to that, vendor software generally also has a feature to send information from a web browser to their system to capture data for analytical purposes. This analytical data capture feature adds further risk since there is a possibility of vendor software capturing unauthorized data elements about customers and applications in their system. The organization, therefore, prefers analytics traffic to flow to the vendor platform from the browser through its infrastructure. If the data flows through the organization's infrastructure, then the data that flows through the vendor platform can be monitored and actioned upon as necessary. Solution There are two problems to solve in this use case: Safely integrate the JavaScript package of the vendor into the organization's codebaseImplement a solution to send analytics traffic from the browser to the vendor through the organization's infrastructure To implement a secure integration solution with the vendor JavaScript tool, the script needs to be packaged as part of the CI/CD pipeline to scan and perform SAST/DAST testing before deploying. In order to route the analytics traffic to the vendor platform through the organization's infrastructure, create a proxy to the target vendor endpoint and customize the vendor JavaScript to point to the proxy. This arrangement helps in routing analytics traffic from the browser to the vendor through the organization's infrastructure. Figure 4: JavaScript embedding and analytics traffic flow Conclusion This article explored three real-world scenarios that highlight the security challenges organizations face when integrating vendor software into hybrid environments. Each use case demonstrates how seemingly routine technical decisions — such as software updates, webhook validation, or JavaScript embedding — can introduce vulnerabilities if not carefully addressed. The solutions presented are not just theoretical best practices but tactical architectural choices that organizations can adopt to implement solutions in a secure way for these less talked about but common integration challenges.

By Dipankar Saha
AI-Assisted Software Engineering With OOPS, SOLID Principles, and Documentation
AI-Assisted Software Engineering With OOPS, SOLID Principles, and Documentation

Top-down and bottom-up approaches are two problem-solving approaches to divide and conquer a problem. What Is the Top-Down Approach? Take any problem, break it down until you are in a position to schedule it to a machine, OS, SDK, or software system. What Is the Bottom-Up Approach? You have pluggable solutions or building blocks such as machine, os, sdk, which you are aware of, given a problem you put the building blocks together to build a complete solution. How Did OOPS Promote the Bottom-Up Approach in Software? With OOPS, we started building more and more software building blocks that could be reused and combined. A lot of libraries providing many components emerged from OOPS languages like Java, C++, etc. Did OOPS and Bottom-Up Approach Solve the Problem? Yes, partially. Still, most requirements come to us from the top (i.e., from a business perspective), and we have to solve them using a top-down approach to stitch the components together to meet the requirements. The business landscape keeps changing. We need reusable components, but we also need them plugged into our system to solve business problems. The business identifies "needs," which are then converted to "user stories," and these use cases are converted again into "requirement specifications." As the saying goes, 1 need turns to 100 user stories and 100 user stories turn into 1000 requirements. The Needs also keep changing over time, thereby changing user stories and requirement specifications as well. This breakdown is still top-down. Here’s a practical example: Business Need “Improve customer retention by enabling a personalized shopping experience.” User Stories As a returning customer, I want to see product recommendations based on my previous purchases so I can find relevant products faster.As a logged-in user, I want to see my recently viewed products on the homepage so I can resume shopping easily.As a customer, I want to save items to a wishlist so I can revisit them later. Technical Specification (for Story #1) API: GET /api/recommendations/{userId}Data source: Purchase history, user behavior trackingAlgorithm: Collaborative filtering or external ML serviceSecurity: GDPR-compliant data usage and opt-out capabilitiesFront-end integration: Carousel component in homepage layout This breakdown shows how business strategy leads to concrete technical outcomes. While reusable components help implement these specifications, the requirements flow remains top-down. But what should you do if you want to slow down these landscaping changes while still adapting to fast-changing business needs? To adapt to such a changing landscape, we are supposed to make use of SOLID principles. Single Responsibility Principle (SRP): Each module or class should do one thing well.Open/Closed Principle (OCP): Components should be open for extension but closed for modification.Liskov Substitution Principle (LSP): Derived types must be substitutable for their base types.Interface Segregation Principle (ISP): Favor many small, specific interfaces over large, general-purpose ones.Dependency Inversion Principle (DIP): High-level modules should not depend on low-level modules; both should depend on abstractions. Inversion of control principle, when putting components together, it is simply "called" in other terms, such as dependency injection. While we aspire to work on the highest maturity level, like runtime dependency switch, we should be in a position to build reusable components, each having a single responsibility, each open for extension and closed for modifications, and each is ready to be substituted. While each SOLID principle is helpful, start with the Dependency Injection principle. This is a missing piece that can help us slow down the fast-paced changing landscape of needs and increase our churn time or turnaround time, and also introduce us to other principles. How to Apply AI Assistance While Doing It? Even if there is an AI OS tomorrow, the art of building software with reusable components won't change, but AI assistance should be aware of all reusable components and libraries on the market that are maintained and secure to use. AI assistance should know how the components that are present in the current developed system work and their responsibilities, and how they can be extended and substituted. While addressing the business needs, we should make ourselves aware of the need for AI assistance. To reiterate, A catalog of well-maintained reusable components and libraries is required.Needs, user stories, and requirement specification document required.Documentation of component of current software with responsibilities, extendability, and substitutability is required. Are the current AI assistants not doing the required? Yes, but partially no. While LLMs are doing their best to understand the entire workspace, it is only the instructions from the developers that is going to guide AI assistant. In this post, we have only pulled out a systematic approach to leverage an AI assistant effectively. Conclusion In this article, we highlighted bottom up and top down approaches and how top down changes the needs, user stories and requirements and how we adopt OOPS principles and why to rely on SOLID principles, especially dependency injection and dependency over abstractions to build softwares bottom up and how to extend this methods with AI by preparing necessary documents relevant that can align with developer intent.

By Narendran Solai Sridharan
What Is Agent Observability? Key Lessons Learned
What Is Agent Observability? Key Lessons Learned

Agents are proliferating like wildfire, yet there is a ton of confusion surrounding foundational concepts such as agent observability. Is it the same as AI observability? What problem does it solve, and how does it work? Fear not, we'll dive into these questions and more. Along the way, we will cite specific user examples as well as our own experience in pushing a customer-facing AI agent into production. By the end of this article, you will understand: Best practices from real data + AI teamsHow the agent observability category is definedThe benefits of agent observabilityThe critical capabilities required for achieving those benefitsBest practices from real data + AI teams What Is an Agent? Anthropic defines an agent as “LLMs autonomously using tools in a loop.” I’ll expand on that definition a bit. An agent is an AI equipped with a set of guiding principles and resources, capable of a multi-step decision and action chain to produce a desired outcome. These resources often consist of access to databases, communication tools, or even other sub-agents (if you are using a multi-agent architecture). What is an agent? A visual guide to the agent lifecycle. Image courtesy of the author. For example, a customer support agent may: Receive a user inquiry regarding a refund on their last purchaseCreate and escalate a ticketAccess the relevant transaction history in the data warehouseAccess the relevant refund policy chunk in a vector databaseUse the provided context and instructional prompt to formulate a responseReply to the user And that would just be step one in the process! The user would reply creating another unique response and series of actions. What Is Observability? Observability is the ability to have visibility into a system's inputs and outputs, as well as the performance of its component parts. An analogy I like to use is a factory that produces widgets. You can test the widgets to make sure they are within spec, but to understand why any deficiencies occurred, you also need to monitor the gears that make up the assembly line (and have a process for fixing broken parts). The broken boxes represent data products, and the gears are the components in a data landscape that introduce reliability issues (data, systems, code). Image courtesy of the author. There are multiple observability categories. The term was first introduced by platforms designed to help software engineers or site reliability engineers reduce the time their applications are offline. These solutions are categorized by Gartner in their Magic Quadrant for Observability Platforms. Barr Moses introduced the data observability category in 2019. These platforms are designed to reduce data downtime and increase adoption of reliable data and AI. Gartner has produced a Data Observability Market Guide and given the category a benefit rating of HIGH. Gartner also projects 70% of organizations will adopt data observability platforms by 2027, an increase from 50% in 2025. And amidst these categories, you also have agent observability. Let’s define it. What Is Agent Observability? If we combine the two definitions — what is an agent and what is observability — together, we get the following: Agent observability is the ability to have visibility into the performance of the inputs, outputs, and component parts of an LLM system that uses tools in a loop. It’s a critical, fast-growing category — Gartner projects that 90% of companies with LLMs in production will adopt these solutions. Agent observability provides visibility into the agent lifecycle. Image courtesy of the author. Let’s revisit our customer success agent example to further flesh out this definition. What was previously an opaque process with a user question, “Can I get a refund?” and agent response, “Yes, you are within the 30-day return window. Would you like me to email you a return label?” now might look like this: Sample trace visualized. Image courtesy of the author. The above image is a visualized trace, or a record of each span (unit of work) the agent took as part of its session with a user. Many of these spans involve LLM calls. As you can see in the image below, agent observability provides visibility into the telemetry of each span, including the prompt (input), completion (output), and operational metrics such as token count (cost), latency, and more. As valuable as this visibility is, what is even more valuable is the ability to set proactive monitors on this telemetry. For example, getting alerted when the relevance of the agent output drops or if the amount of tokens used during a specific span starts to spike. We’ll dive into more details on common features, how it works, and best practices in subsequent sections, but first, let’s make sure we understand the benefits and goals of agent observability. A Quick Note on Synonymous Categories Terms like GenAI observability, AI observability, or LLM observability are often used interchangeably, although technically, the LLM is just one component of an agent. RAG (retrieval-augmented generation) observability refers to a similar but less narrow pattern involving AI retrieving context to inform its response. I’ve also seen teams reference LLMops, AgentOps, or evaluation platforms. The labels and technologies have evolved rapidly over a short period of time, but these categorical terms can be considered roughly synonymous. For example, Gartner has produced an “Innovation Insight: LLM Observability” report with essentially the same definition. Honestly, there is no need to sweat the semantics. Whatever you or your team decide to call it, what’s truly important is that you have the technology and processes in place to monitor and improve the quality and reliability of your agent’s outputs. Do You Need Agent Observability If You Use Guardrails? The short answer is yes. Many AI development platforms, such as AWS Bedrock, include real-time safeguards, called guardrails, to prevent toxic responses. However, guardrails aren’t designed to catch regressions in agent responses over time across dimensions such as accuracy, helpfulness, or relevance. In practice, you need both working together. Guardrails protect you from acute risks in real time, while observability protects you from chronic risks that appear gradually. It’s similar to the relationship between data testing and anomaly detection for monitoring data quality. Problem to Be Solved and Business Benefits Ultimately, the goal of any observability solution is to reduce and minimize downtime. This concept for software applications was popularized by the Google Site Reliability Engineering Handbook, which defined downtime as the portion of unsuccessful requests divided by the total number of requests. Like everything in the AI space, defining a successful request is more difficult than it seems. After all these are non-deterministic systems meaning you can provide the same input many times and get many different outputs. Is a request only unsuccessful if it technically fails? What about if it hallucinates and provides inaccurate information? What if the information is technically correct, but it’s in another language or surrounded by toxic language? Again, it’s best to avoid getting lost in the semantics and pedantics. Ultimately, the goal of reducing downtime is to ensure features are adopted and provide the intended value to users. This means agent downtime should be measured based on the underlying use case. For example, clarity and tone of voice might be paramount for our customer success chatbot, but it might not be a large factor for a revenue operations agent providing summarized insights from sales calls. This also means your downtime metric should correspond to user adoption. If those numbers don’t track, you haven’t captured the key metrics that make your agent valuable. Most data + AI teams I talk to today are using adoption as the main proxy for agent reliability. As the space begins to mature, teams are gradually moving toward more forward leading indicators such as downtime and the metrics that roll up to it such as relevancy, latency, recall (F1), and more. Dropbox, for example, measures agent downtime as: Responses without a citationIf more than 95% of responses have a latency greater than 5 secondsIf the agent does not reference the right source at least 85% of the time (F1 > 85%)Factual accuracy, clarity, and formatting are other dimensions, but a failure threshold isn’t provided. At Monte Carlo, our development team considers our Troubleshooting Agent as experiencing downtime based on the metrics of semantic distance, groundedness, and proper tool usage. These are evaluated on a 0-1 scale using an LLM-as-judge methodology. Downtime in staging is defined as: Any score under 0.5More than 33% of LLM-as-judge evaluations or more than 2 total evaluations score between a .5 and .8, even after an automatic retry. Groundedness tests show the agent invents information or answers out of scope (hallucination or missing context).The agent misuses or fails to use the required tools Outside of adoption, agents can be evaluated across the classic business values of reducing cost, increasing revenue, or decreasing risk. In these scenarios, the cost of downtime can be quantified easily by taking the frequency and duration of downtime and multiplying them by the ROI being driven by the agent. This formula remains mostly academic at the moment since, as we’ve noted previously, most teams are not as focused on measuring immediate ROI. However, I have spoken to a few. One of the clearest examples in this regard is a pharmaceutical company using an agent to enrich customer records in a master data management match-merge process. They originally built their business case on reducing cost, specifically the number of records that need to be enriched by human stewards. However, while they did increase the number of records that could be automatically enriched, they also improved a large number of poor records that would have been automatically discarded as well! So the human steward workload actually increased! Ultimately, this was a good result as record quality improved; however, it does underscore how fluid and unpredictable this space remains. How Agent Observability Works Agent observability can be built internally by engineering teams or purchased from several vendors. We’ll save the build vs. buy analysis for another time, but, as with data testing, some smaller teams will choose to start with an internal build until they reach a scale where a more systemic approach is required. Whether an internal build or vendor platform, when you boil it down to the essentials, there are really two core components to an agent observability platform: trace visualization and evaluation monitors. Trace Visualization Traces, or telemetry data that describes each step taken by an agent, can be captured using an open-source SDK that leverages the OpenTelemetry (Otel) framework. Teams label key steps — such as skills, workflows, or tool calls — as spans. When a session starts, the agent calls the SDK, which captures all the associated telemetry for each span, such as model version, duration, tokens, etc. A collector then sends that data to the intended destination (we think the best practice is to consolidate within your warehouse or lakehouse source of truth), where an application can then help visualize the information, making it easier to explore. One benefit to observing agent architectures is that this telemetry is relatively consolidated and easy to access via LLM orchestration frameworks, as compared to observing data architectures, where critical metadata may be spread across a half dozen systems. Evaluation Monitors Once you have all of this rich telemetry in place, you can monitor or evaluate it. This can be done using an agent observability platform, or sometimes the native capabilities within data + AI platforms. Teams will typically refer to the process of using AI to monitor AI (LLM-as-judge) as an evaluation. This type of monitor is well-suited to evaluate the helpfulness, validity, and accuracy of the agent. This is because the outputs are typically larger text fields and non-deterministic, making traditional SQL-based monitors less effective across these dimensions. Where SQL code-based monitors really shine, however, is in detecting issues across operational metrics (system failures, latency, cost, throughput) as well as situations in which the agent’s output must conform to a very specific format or rule. For example, if the output must be in the format of a US postal address, or if it must always have a citation. Most teams will require both types of monitors. In cases where either approach will produce a valid result, teams should favor code-based monitors as they are more deterministic, explainable, and cost-effective. However, it’s important to ensure your heuristic or code-based monitor is achieving the intended result. Simple code-based monitors focused on use case-specific criteria — say, output length must be under 350 characters–are typically more effective than complex formulas designed to broadly capture semantic accuracy or validity, such as ROUGE, BLEU, cosine similarity, and others. While these traditional metrics benefit from being explainable, they struggle when the same idea is expressed in different terms. Almost every data science team starts with these familiar monitors, only to quickly abandon them after a rash of false positives. What About Context Engineering and Reference Data? This is arguably the third component of agent observability. It can be a bit tricky to draw a firm line between data observability and agent observability — it's probably best not to even try. This is because agent behavior is driven by the data it retrieves, summarizes, or reasons over. In many cases, the “inputs” that shape an agent’s responses — things like vector embeddings, retrieval pipelines, and structured lookup tables — sit somewhere between the two worlds. Or perhaps it may be more accurate to say they all live in one world, and that agent observability MUST include data observability. This argument is pretty sound. After all, an agent can’t get the right answer if it’s fed wrong or incomplete context — and in these scenarios, agent observability evaluations will still pass with flying colors. Challenges and Best Practices It would be easy enough to generate a list of agent observability challenges teams could struggle with, but let’s take a look at the most common problems teams are actually encountering. And remember, these are challenges specifically related to observing agents. Challenge #1: Evaluation Cost LLM workloads aren’t cheap, and a single agent session can involve hundreds of LLM calls. Now imagine for each of those calls you are also calling another LLM multiple times to judge different quality dimensions. It can add up quickly. One data + AI leader confessed to us that their evaluation cost was 10 times as expensive as the baseline agent workload. Monte Carlo’s agent development team strives to maintain roughly a one to one workload to evaluation ratio. Best Practices to Contain Evaluation Cost Most teams will sample a percentage or an aggregate number of spans per trace to manage costs while still retaining the ability to detect performance degradations. Stratified sampling, or sampling a representative portion of the data, can be helpful in this regard. Conversely, it can also be helpful to filter for specific spans, such as those with a longer-than-average duration. Challenge #2: Defining Failure and Alert Conditions Even when teams have all the right telemetry and evaluation infrastructure in place, deciding what actually constitutes “failure” can be surprisingly difficult. To start, defining failure requires a deep understanding of the agent’s use case and user expectations. A customer support bot, a sales assistant, and a research summarizer all have different standards for what counts as “good enough.” What’s more, the relationship between a bad response and its real-world impact on adoption isn’t always linear or obvious. For example, if an evaluation model gives a response that is judged to be a .75 for clarity, is that a failure? Best Practices for Defining Failure and Alert Conditions Aggregate multiple evaluation dimensions. Rather than declaring a failure based on a single score, combine several key metrics — such as helpfulness, accuracy, faithfulness, and clarity — and treat them as a composite pass/fail test. This is the approach Monte Carlo takes in our agent evaluation framework for our internal agents. Most teams will also leverage anomaly detection to identify a consistent drop in scores over a period of time rather than a single (possibly hallucinated) evaluation. Dropbox, for example, leverages dashboards that track their evaluation score trends over hourly, six-hour, and daily intervals. Finally, know which monitors are “soft” and which are “hard.” Some monitors should immediately trigger an alert when their threshold is breached. Typically, these are more deterministic monitors evaluating an operational metric such as latency or a system failure. Challenge #3: Flaky Evaluations Who evaluates the evaluators? Using a system that can hallucinate to monitor a system that can hallucinate has obvious drawbacks. The other challenge for creating valid evaluations is that, as every single person who has put an agent into production has bemoaned to me, small changes to the prompt have a large impact on the outcome. This means creating customized evaluations or experimenting with evaluations can be difficult. Best Practices for Avoiding Flaky Evaluations Most teams avoid flaky tests or evaluations by testing extensively in staging on golden datasets with known input-output pairs. This will typically include representative queries that have proved problematic in the past. It is also a common practice to test evaluations in production on a small sample of real-world traces with a human in the loop. Of course, LLM judges will still occasionally hallucinate. Or as one data scientist put it to me, “one in every ten tests spits out absolute garbage.” He will automatically rerun evaluations for low scores to confirm issues. Challenge #4: Visibility Across the Data + AI Lifecycle Of course, once a monitor sends an alert, the immediate next question is always: “Why did that fail?” Getting the answer isn’t easy! Agents are highly complex, interdependent systems. Finding the root cause requires end-to-end visibility across the four components that introduce reliability issues into a data + AI system: data, systems, code, and model. Here are some examples: Data Real-world changes and input drift. For example, if a company enters a new market and there are now more users speaking Spanish than English. This could impact the language the model was trained in.Unavailable context. We recently wrote about an issue where the model was working as intended but the context on the root cause (in this case a list of recent pull requests made on table queries) was missing. System Pipeline or job failuresAny change to what tools are provided to the agent or changes in the tools themselves. Changes to how the agents are orchestrated Code Data transformation issues (changing queries, transformation models)Updates to promptsChanges impacting how the output is formatted Model Platform updates its model versionChanges to which model is used for a specific call Best Practices for Visibility Across the Data + AI Lifecycle It is critical to consolidate telemetry from your data + AI systems into a single source of truth, and many teams are choosing the warehouse or lakehouse as their central platform. This unified view lets teams correlate failures across domains — for example, seeing that a model’s relevancy drop coincided with a schema change in an upstream dataset or an updated model. Deep Dive: Example Architecture The image above shows the technical architecture that Monte Carlo’s Troubleshooting Agent leverages to build a scalable, secure, and decoupled system that connects its existing monolithic platform to its new AI Agent stack. On the AI side, the AI Agent Service runs on Amazon ECS Fargate, which enables containerized microservices to scale automatically without managing underlying infrastructure. Incoming traffic to the AI Agent Service is distributed through a network load balancer (NLB), providing high-performance, low-latency routing across Fargate tasks. The image below is an abstracted interpretation of the Troubleshooting Agent’s workflow, which leverages several specialized sub-agents. These sub-agents investigate different signals to determine the root cause of a data quality incident and report back to the managing agent, who presents the findings to the user. Deliver Production-Ready Agents The core takeaway I hope you walk away with is that when your agents enter production and become integral to business operations, the ability to assess their reliability becomes a necessity. Production-grade agents must be observed. This article was co-written with Michael Segner.

By Lior Gavish
Advanced Patterns in Salesforce LWC: Reusable Components and Performance Optimization
Advanced Patterns in Salesforce LWC: Reusable Components and Performance Optimization

If you’ve built Lightning Web Components (LWC) at scale, you’ve probably hit the same walls I did: duplicated logic, bloated bundles, rerenders that come out of nowhere, and components that were never meant to talk to each other but somehow ended up coupled. When I first transitioned from Aura and Visualforce to LWC, the basics felt easy: reactive properties, lifecycle hooks, and clean templates. But as our team started building enterprise-grade Salesforce apps dozens of screens, hundreds of components the cracks started showing. Performance dipped. Reusability turned into a myth. New devs struggled to onboard without breaking something. This article shares what helped us break that cycle: reusable component patterns, scoped events, smart caching, and render-aware design. Why Reusability and Performance Are Critical in LWC Salesforce isn't just a CRM anymore; it's an app platform. You’re often dealing with: Complex UIs with dynamic layoutsAPI-heavy backends and Apex controller logicStrict governor limitsTeams of developers contributing across multiple sandboxes In this kind of environment, the usual “build fast and clean later” approach doesn’t work. Reusable patterns and performance principles aren’t just nice to have; they’re essential. Especially when each render or Apex trip costs you time, limits, and UX points. Pattern 1: Composition Over Inheritance (and Over Nesting) We started with the common mistake of creating huge parent components that owned every little UI detail dropdowns, modals, tables, loaders. Changes became brittle fast. Instead, we now follow strict composition rules. If a piece of UI can stand on its own (e.g., lookup-picker, pagination-control, inline-toast), it becomes its own component. No logic leaks out. Inputs are exposed via @api, outputs via CustomEvent. Example: HTML CopyEdit <!-- parent.html --> <c-pagination-control current-page={page} total-pages={totalPages} onpagechange={handlePageChange}> </c-pagination-control> This way, our parent component never touches DOM methods or layout tricks. It just delegates and listens. Pattern 2: Stateless Presentational Components We borrowed a page from React and introduced what we call stateless presentational components. These components render only what they’re told: no Apex calls, no wire service, no @track. They just take inputs and return markup. This helped us test faster (no mocking wire/adapters), reuse components in record pages, and reduce side-effects that used to cause reactivity bugs. Pattern 3: Event Contracts and Pub/Sub Boundaries The LWC CustomEvent model is clean until your app grows. We started seeing cascading rerenders because a modal deep in the DOM fired an event that the app shell listened to (via window.dispatchEvent and pubsub). Messy. We introduced event contracts: each component has a known set of events it can emit or consume. No rogue dispatchEvent calls. Pub/Sub boundaries are scoped to app sections, not global. We even versioned events using string names like product:updated:v2. This small process change reduced production event bugs by 40%. Pattern 4: Conditional Rendering vs. DOM Fragmentation If you’re using {#if} in LWC, be careful, it detaches and destroys the entire subtree. We had a dashboard that rerendered every chart from scratch when toggling a filter. CPU spikes, layout shifts, and ugly flickers. The fix? Use hidden or style.display = "none" if you just need to hide, not destroy. Reserve {#if} for full control when data changes significantly. Also, beware of uncontrolled DOM growth. One report page of ours had over 12,000 nodes due to lazy filtering logic and nested <template for:each> inside loops. A quick audit and refactor brought render time down from 1.8s to under 300ms. Pattern 5: Local Storage and Caching Wisely Don’t refetch everything on every load. For components that rely on config data (e.g., picklist values, role maps, branding info), we use a cache strategy: SessionStorage for session-scoped valuesLocalStorage for persistent feature flags or read-only configLightning Data Service for record-backed state We also memoize Apex calls using a keyed map inside @wire or connectedCallback. Result: our homepage boot time dropped by 20%. Pattern 6: Lazy Loading and Dynamic Imports This one's still underused in the LWC world. If your component loads third-party libraries or expensive JS modules (like Chart.js or D3), use loadScript() or dynamic import() to defer until truly needed. JavaScript CopyEdit connectedCallback() { if (!this.chartLoaded) { loadScript(this, CHART_JS) .then(() => this.chartLoaded = true); } } We applied this to our analytics tab and shaved 600KB from the initial bundle. Testing and Linting for Reusability We enforce these rules in CI using: ESLint with LWC pluginJest tests for logic-heavy componentsStorybook for visual regression and documentation Our component PRs require usage examples and at least one story. That change alone made it easier for QA and business analysts to validate features early. Final Thoughts We didn’t arrive at these patterns overnight. Each one came from a specific failure: a broken layout, an unresponsive tab, a hard-to-maintain legacy page. The thing with Salesforce LWC is: it works well for small teams and simple UIs, but as complexity grows, you need rules and patterns that scale with it. By treating components as atomic, stateless, and independently testable and by drawing clear performance boundaries we turned a slow, tangled UI into a platform others could build on. If you’re struggling with slow loads, buggy renders, or a component mess that keeps growing, try some of these patterns. And share your own. We’re all still learning what “scalable” means in Salesforce LWC.

By Lakshman Pradeep Reddy Vangala
Delta Lake 4.0 and Delta Kernel: What's New in the Future of Data Lakehouses
Delta Lake 4.0 and Delta Kernel: What's New in the Future of Data Lakehouses

In data storage, the idea of a data lakehouse has transformed data storage and analysis in organizations. Data lakehouses combine the low-cost and scalability storage ability of data lakes and data warehouse’s reliability and performance. In this space, some players have emerged, such as Delta Lake, as strong open-source frameworks for implementing robust ACID-compliant data Lakes. Now, with the introduction of Delta Lake 4.0 and the development of Delta Kernel, the future of the lakehouse architecture is in a revolutionary transition. Brimming with features driving performance, scaling, and interoperability, these updates are to keep up with the increasing dynamics of data workloads in 2025 and beyond. Evolving Data Flexibility: Variant Types and Schema Widening Probably one of the most significant changes in Delta Lake 4.0 is the introduction of the VARIANT data type, which can store semi-structured data without a rigid schema. This is a dramatic change for developers and data engineers who deal with telemetry, clickstream, or JSON-based marketing data. Semi-structured data had to be ‘flattened’ or stored as strings — both of which added complexity and performance limitations. The data can now be stored in raw form as VARIANT, enabling more flexible querying and ingestion pipelines. Interesting to go along with this is type widening, which makes the evolution of table schemata as time passes more straightforward. The field types are usually required to change as data applications grow. For instance, an integer may become a column, but later on, it will need larger values, meaning it has to be transformed into a long type. Delta Lake 4.0 makes such changes with grace, without rewriting entire datasets. Developers can change column types by doing it manually or letting Delta Lake take care of it automatically during inserts and merges, which will decrease the operational overhead and maintain the fidelity of the data historically. Such innovations are an indication of a greater trend in the data world: expanding the needs for systems, which change with the times, not ones that oppose them. Boosting Reliability and Transactions With Coordinated Commits As data transactions are scaled across organizations, transactional consistency across processes and users is of great importance. Delta Lake 4.0 introduces a groundbreaking innovation in this field with Coordinated Commits. This characteristic installs a centralized commit coordination mechanism that ensures multiple users or systems updating the same Delta table are in a synchronized state. Imagine a case where several data pipelines are updating various parts of a table in several clusters at the same time. Inconsistencies and reading anomalies are a danger without coordination. Coordinated Commits make sure that all changes are versioned and separated, which introduces true multi-statement and multi-table transactional capabilities into the lakehouse context. Such a change is essential to organizations that are processing data in a real-time or complex manner of data transformation workflows, where the integrity of data is critical. It sets up Delta Lake’s dream of a very concurrent, multi-user world, and it takes it a step closer to the entire transactional prowess of traditional data warehouses. Remote Interoperability: Delta Connect and Function of Delta Kernel In 2025, there is an increased distribution of data platforms. Data practitioners demand interacting with lakehouses with multiple tools and programming languages, not infrequently remotely and over multiple cloud environments. Delta Lake 4.0 has come with Delta Connect — a feature that is built over Spark Connect that separates the clients’ interface from the data engine. This adds remote access to Delta tables from lightweight clients, which greatly facilitates the connection with notebooks, APIs, and third-party services. Bringing the ability to write an application in Python or JavaScript that can go and read and write directly into Delta tables on remote Spark clusters makes possible what Delta Connect enables. This flexibility enables more nimble development and provides real integration with modern cloud-native tooling. However, what powers the smooth interoperation is the Delta Kernel. Firstly, initially introduced to unify and stabilize the core Delta table protocol, Delta Kernel currently provides a collection of libraries, written in Java and Rust, revealing a clean and consistent interface to Delta tables. These libraries hide internal complexities of partitioning, metadata processing, and the deletion vectors, which makes the adoption of external engines to natively support Delta much simpler. Such projects as Apache Flink and Apache Druid have already implemented Delta Kernel with stunning results. In Flink, with streamlined access to table metadata, Delta Sink pipelines are now in a position to start much faster. In Rust ecosystem, delta-rs have embraced Delta Kernel to allow advanced table operations directly from Python and Rust surroundings. Delta Connect and Delta Kernel combined are making Delta Lake the most available and engine-agnostic lakehouse offering for today. Smarter Performance: Predictive Optimization and Delta Tensor The balancing act of performance management in data lakes has always been the case. Over a period of time, small files, fragmented partitions, and metadata bloat can severely impact performance. Delta Lake tries to overcome this by introducing predictive optimization — a maintenance feature that automatically executes such operations as compaction according to the workload patterns observed. Predictive optimization does not require data engineers to schedule optimize or vacuum commands manually because it tracks the way in which data is queried and maintained. It smartly performs only optimizations as needed, optimizing storage costs, minimize compute usage, maintain high query performance at all times. Such automation is an effort towards self healing data platforms which self-optimize as time passes on like autonomous databases. Another invention promising wide implications is Delta Tensor — a new feature focused on AI and machine learning workloads. While AI adoption is currently soaring high, the need for data scientists to store high-dimensional data, such as vectors and tensors, directly in the lakehouse tables becomes increasingly necessary. Delta Tensor brings support for storing multidimensional arrays in Delta tables with compact, sparse encodings. This is not only a framework for structured and semi-structured data but a viable base for data-rich machine learning systems too. As more machine learning and AI are baked into companies' core products, native support for tensor data in their data platforms is a game-changer. Conclusion Moving through the year 2025, it’s apparent that Delta Lake and its rapidly growing ecosystem have established a new standard for the way data is saved, processed, and operationalized. By integrating data lake scalability with the reliability and performance of the data warehouses, Delta Lake is transforming the landscape of modern data architecture. As the use of Delta Lake 4.0 and Delta Kernel indicates, no matter whether for agile startups or global enterprises, there is a strategic move towards more intelligent, flexible, and interoperable data solutions. With increasing data volumes and changes in analytical needs, these innovations are expected to become key pillars in the future of an enterprise data platform.

By Sairamakrishna BuchiReddy Karri
Human-AI Readiness
Human-AI Readiness

The expectations are cosmic. The investments are colossal. Amazon, Google, Meta, and Microsoft collectively spent over $251 billion on infrastructure investment to support AI in 2024, up 62% from 2023's $155 billion, and they plan to spend more than $300 billion in 2025. The prize for those who can provide "superior intelligence on tap," as some are now touting, is infinite. The AI ecosystem is exploding, with new startups and innovative offerings pouring out of global tech hubs. The technology isn’t just evolving; it’s erupting. The theory of AI adoption is also evolving. While everyone acknowledges that risk remains high and vigilance is necessary, concerns are shifting from Terminator-style apocalyptic fantasies to the practical realities of global social disruption anticipated, as AI’s impact cascades through, well, everything. As is becoming clear, we’ll be living next to and collaborating with AI interfaces in every form, from phones and smart glasses to robots and drones. Current academic and industry research strongly supports the thesis that AI-human interaction is evolving toward collaborative teamwork rather than job displacement. The academic community has embraced the term "hybrid intelligence" to describe this phenomenon. Wharton research characterizes hybrid intelligence as "a transformative shift toward a more holistic, human-centered approach to technology and work." The World Economic Forum has introduced "collaborative intelligence" as a framework where "AI teammates will adapt and learn to achieve shared objectives with people." IBM frames this evolution as "an era of human-machine partnership that will define the modern workplace," indicating widespread recognition that superior outcomes emerge from combining human and AI strengths rather than replacing human capabilities. Step back for a second and consider a possible future state of AI integration, in which each team in your organization is partnered with a "superior intelligence in the cloud." Teams will need to be competent at raising the right questions, considering AI responses, evaluating them against our criteria, and reaching consensus between AI and human teammates. Now consider the set of skills required to do this successfully. Preparing AI to collaborate will require training it on data that is pertinent to your business. While the generalized "knowledge on tap" model of public-facing AIs like Claude and Gemini are endlessly useful, specialized models trained on domain- and enterprise-specific data will be able to provide unique insights and combinations not apparent to humans. To collaborate well, AI interfaces will be dynamic, evolving into a more capable partner as it experiences and adapts to your team’s style. To thrive in this evolving environment, organizations must embrace a new paradigm: human-AI readiness. The Indispensable Human-AI Partnership The current theory of AI readiness hypothesizes that machines will not replace knowledge workers, but rather enhance, automate, and rationalize their tasks. In this 'happy path' scenario, AI acts as a powerful augmentative force, applying its research capabilities, its interactive persona, and its ability to absorb, summarize, and interrogate data to aid every human endeavor. AI can serve multiple roles for human collaborators: A thoughtful 'whiteboard,' generating a dialog about possibilities and ideas;An administrative assistant, performing routine administrative tasks, and enabling more time for innovation.An innovation lab, capable of producing prototype ideas, generating specifications or code, performing simulations, or conducting statistical analysis. A constructive critic, reviewing your creative output to ensure clarity, guiding your output to its best presentation. As already indicated by the amount of creative content being produced with AI text, image, and video generators, in the happy-path world, AI becomes a launchpad for human creativity, guided by human intent. Whether AI is being applied in business, the arts, or the military, it has no intent; only humans can supply that. Humans retain control, driving the strategic and creative direction, selection, and refinement, while AI brings breadth of research and analysis capabilities, and the power to generate useful insights. The true power of this transformation, and the competitive advantage it confers, can only be unlocked if all employees are equipped and empowered to use AI effectively. The challenge to this human-AI collaboration model is the widespread variation in AI literacy. Many companies find themselves in the experimentation stage of AI adoption, with limited enterprise-wide proficiency. A majority of workers (78%) want to learn to use AI more effectively, while large segments remain AI avoidant (22%) or merely AI familiar (39%), the category for those informally test-driving some AI tools, but not integrating them into their work yet. Leaders who are inundated with messaging telling them AI is coming to disrupt their business show higher proficiency: 33% AI-literate and 30% AI-fluent. Generational and team-based gaps also exist, and teams like IT and marketing are more AI literate than sales and customer experience (CX). Very few surveyed believed that they were successfully integrating AI into their enterprise in a structured way. A Structured Path to AI Maturity: Organizational Design and Strategic Transition Achieving enterprise-wide AI-enablement and realizing its full potential demands a strategic approach that goes beyond technological implementation. Enterprises that hope to gain market advantage for the strategic application of AI require a roadmap toward AI maturity, in which organizations integrate AI holistically across their enterprise, AI-enabling their data, infrastructure, software stack, model selection and management, as well as the human and change-management disciplines we’ve discussed. Only a small fraction of firms, 12% of companies globally, labeled as AI achievers, have advanced their AI maturity enough to achieve superior growth and business transformation. For these Achievers, AI transformation is an imperative that has driven them to the highest level of urgency and commitment. Achieving AI maturity for these top performers is not defined by any single competency, but by their balanced approach to AI evolution in their enterprise. Accenture's research identifies five key success factors that distinguish AI achievers: Champion AI as a Strategic Priority for the Entire Organization, with Full Sponsorship from Leadership: AI Achievers are significantly more likely to have formal senior sponsorship for their AI strategies, with 83% having CEO and senior sponsorship compared to 56% of experimenters. This executive buy-in is crucial, as strategies without it risk floundering due to competing initiatives.Bold AI strategies, even with modest beginnings, spur innovation and embed a culture of innovation across the organization. Leaders encourage experimentation and learning, implementing systems that help employees showcase innovations and seek feedback.Invest Heavily in Talent to Get More from AI Investments: This is a critical step in bridging the "literacy gap" that holds many companies back from optimizing AI use. AI Achievers prioritize building AI literacy across their workforces, evident in the 78% of them that have mandatory AI training for most employees, from product developers to C-suite executives.This ensures that AI proficiency starts at the top and permeates the organization, making human and AI collaboration scalable. Achievers also proactively develop AI talent strategies/ This systematic re-skilling and talent development is a core component of organizational design for the AI era.Industrialize AI Tools and Teams to Create a Strong AI Core: An AI core is an operational data and AI platform that balances experimentation and execution, allowing firms to productize AI applications and seamlessly integrate AI into other systems. This directly addresses the "technology gap" where businesses lack proper AI communication tools and strategies.Achievers build this core by harnessing internal and external data, ensuring its trustworthiness, and storing it in a single enterprise-grade cloud platform with appropriate usage, monitoring, and security policies. They are also more likely to develop custom machine learning applications or partner with solution providers, tapping into developer networks to swiftly productionize and scale successful pilots. This industrialization ensures that AI isn't siloed but becomes a fundamental part of the business's operational systems.Design AI Responsibly, from the Start: With the increasing deployment of AI, adhering to laws, regulations, and ethical norms is critical for building a sound data and AI foundation. AI achievers prioritize being "responsible by design," proactively integrating ethical frameworks and clear usage policies from the outset.This commitment ensures that AI systems are developed and deployed with good intentions, empower employees, fairly impact customers and society, and engender trust. Organizations that demonstrate high-quality, trustworthy, and "regulation-ready" AI systems gain a significant competitive advantage, attracting and retaining customers while building investor confidence. This is crucial for navigating the "systems gap" by building trust and mitigating risks.Prioritize Long- and Short-Term AI Investments: Achievers understand that the AI investment journey has no finish line and continuously increase their spending on data and AI. They plan to dedicate 34% of their tech budgets to AI development by 2024, up from 14% in 2018.Their investments focus on expanding the scope of AI for maximum impact and "cross-pollinating" solutions across the enterprise. This sustained investment ensures that the organization remains at the cutting edge, continuously improving its AI capabilities and fostering a culture of long-term innovation. These success factors collectively form a comprehensive roadmap for enterprise-wide AI adoption. They involve clear actions such as assigning AI business drivers, educating leadership, engaging employees through interactive sessions, showcasing early wins, launching tailored AI onboarding programs, promoting continuous learning, and creating acceptable usage guidelines and policies. By standardizing tools, training, and processes, businesses can ensure that innovation up-levels all teams, not just a few departments. Redefining Job Duties and Human-AI Interaction As AI becomes deeply embedded in daily workflows, the nature of individual job duties and the very fabric of human-AI interaction will evolve. For technical professionals, understanding the nuances of how users engage with AI-infused systems is paramount for successful implementation and adoption. AI systems, due to their probabilistic nature and continuous learning, can sometimes exhibit unpredictable or inconsistent behaviors, potentially leading to confusion, distrust, or even safety issues. Therefore, designing for effective human-AI interaction is crucial to ensure that people can understand, trust, and effectively engage with AI. By adhering to these guidelines, technical professionals can design and deploy AI solutions that are not only powerful but also user-centric, fostering effective human-AI collaboration in every role. This impacts job duties by shifting focus from manual execution to strategic oversight, creative direction, and problem-solving augmented by AI's capabilities. Conclusion The journey to human-AI readiness is a strategic imperative for every organization. It is a long-term shift that requires proactive planning, incremental adjustments, and a willingness to adapt. The future of business success lies in mastering the "art of AI maturity," integrating cutting-edge technology with thoughtful strategies, robust processes, and, most importantly, an empowered, AI-literate workforce. By championing AI from the top, investing heavily in talent, industrializing AI capabilities, designing responsibly, and making sustained investments, businesses can bridge existing gaps and truly transform their operations. The goal is to create an environment where humans and AI operate as a seamless team, unlocking unprecedented levels of creativity, productivity, and innovation, and ultimately, securing a lasting competitive advantage.

By Rick Freedman

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

How Does a Scrum Master Improve the Productivity of the Development Team?

November 7, 2025 by Sandeep Kashyap

AI Code Generation: The Productivity Paradox in Software Development

November 5, 2025 by Ammar Husain

AIOps to Agentic AIOps: Building Trustworthy Symbiotic Workflows With Human-in-the-Loop LLMs

November 4, 2025 by Pratik Prakash DZone Core CORE

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

Beyond Dashboards: How Autonomous AI Agents Are Redefining Enterprise Analytics

November 7, 2025 by Mohan Krishna Mannava

Prompt Engineering vs Context Engineering

November 7, 2025 by Vineet Bhatkoti

Understanding Proxies and the Importance of Japanese Proxies in Modern Networking

November 7, 2025 by Adamo Tonete

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Kubernetes v1.34: Enabling Smarter Traffic Routing With PreferSameNode and PreferSameZone

November 7, 2025 by Nitin Ware

Understanding Proxies and the Importance of Japanese Proxies in Modern Networking

November 7, 2025 by Adamo Tonete

Workload Identities: Bridging Infrastructure and Application Security

November 7, 2025 by Maria Pelagia

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

Kubernetes v1.34: Enabling Smarter Traffic Routing With PreferSameNode and PreferSameZone

November 7, 2025 by Nitin Ware

Production-Grade React Project Structure: From Setup to Scale

November 7, 2025 by Karthik Jeyapal

Laravel + Next.js Integration Guide (Real-World Setup, 2025)

November 7, 2025 by Jeff Tomas

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

Kubernetes v1.34: Enabling Smarter Traffic Routing With PreferSameNode and PreferSameZone

November 7, 2025 by Nitin Ware

How Does a Scrum Master Improve the Productivity of the Development Team?

November 7, 2025 by Sandeep Kashyap

Laravel + Next.js Integration Guide (Real-World Setup, 2025)

November 7, 2025 by Jeff Tomas

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

Beyond Dashboards: How Autonomous AI Agents Are Redefining Enterprise Analytics

November 7, 2025 by Mohan Krishna Mannava

Prompt Engineering vs Context Engineering

November 7, 2025 by Vineet Bhatkoti

Production-Grade React Project Structure: From Setup to Scale

November 7, 2025 by Karthik Jeyapal

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • [email protected]

Let's be friends: