DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Celebrate a decade of Kubernetes. Explore why K8s continues to be one of the most prolific open-source systems in the SDLC.

What's in your tech stack? Tell us about it in our annual Community Survey, and help shape the future of DZone!

Learn how to build your data architecture with open-source tools + design patterns for scalability, disaster recovery, monitoring, and more.

Cloud + data orchestration: Demolish your data silos. Enable complex analytics. Eliminate I/O bottlenecks. Learn the essentials (and more)!

DZone Spotlight

Saturday, September 28 View All Articles »
Optimizing External Secrets Operator Traffic

Optimizing External Secrets Operator Traffic

By Upasana Sharma
In Kubernetes, a Secret is an object that stores sensitive information like a password, token, key, etc. One of the several good practices for Kubernetes secret management is making use of a third-party secrets store provider solution to manage secrets outside of the clusters and configuring pods to access those secrets. There are plenty of such third-party solutions available in the market, such as: HashiCorp Vault Google Cloud Secret Manager AWS Secrets Manager Azure Key Vault These third-party solutions, a.k.a External Secrets Managers (ESM), implement secure storage, secret versioning, fine-grain access control, audit and logging. The External Secrets Operator (ESO) is an open-source solution used for secure retrieval and synchronization of secrets from the ESM. The secrets retrieved from the ESM are injected into the Kubernetes environment as native secret objects. Thus, ESO enables application developers to use Kubernetes Secret Object with Enterprise grade External Secret Managers. ESO implementation in a Kubernetes cluster primarily requires two resources: ClusterSecretStore that specifies how to access the External Secrets Manager ExternalSecret that specifies what data is to be fetched and stored as a Kubernetes secret object Secret retrieval is a one-time activity, but synchronization of secrets generates traffic at regular intervals. So it's important to follow best practices (listed below) that can optimize ESO traffic to the external secrets management systems. Defining Refresh Interval for the ExternalSecret Object Long-lived static secrets pose a security risk that can be addressed by adopting a secret rotation policy. Each time a secret gets rotated in the ESM, it should be reflected in the corresponding Kubernetes Secret object. ESO supports automatic secret synchronization for such situations. Secrets get synchronized after a specified time frame, called "refresh interval," which is a part of the ExternalSecret resource definition. It is advisable to opt for an optimum refresh interval value; e.g., a secret that's not likely to get modified often can have a refresh interval of one day instead of one hour or a few minutes. Remember, the more aggressive the refresh interval, the more traffic it will generate. Defining Refresh Interval for the ClusterSecretStore Object The refresh interval defined in the ClusterSecretStore (CSS) is the frequency with which the CSS validates itself with the ESM. If the refresh interval is not specified while defining a CSS object, the default refresh interval (which is specific to the ESM API implementation) is considered. The default CSS refresh interval has been found to be a very aggressive value; i.e., the interaction with the ESM happens very frequently in this case. For example, the picture below is an excerpt of the description of a sample CSS (HashiCorp Vault is the ESM in this case) that has no refresh interval value in its definition. The refresh interval seen in the CSS description below is five minutes, implying the resource is approaching the ESM every five minutes, generating avoidable traffic. The refresh interval attribute gets missed in most CSS definitions because: There is a discrepancy between the default value of the refresh interval for an ExternalSecret object and that for a ClusterSecretStore object. This can inadvertently lead to an un-optimized implementation for ClusterSecretStore. The default value of the refresh interval for the ExternalSecret object is ZERO. It signifies that refresh is disabled; i.e., the secret never gets synchronized automatically. The default value of the refresh interval for the ClusterSecretStore object is ESM-specific; e.g., it is five minutes in the HashiCorp Vault scenario cited above. The refresh interval attribute is not present in the prominent samples/examples on the internet (e.g., check ClusterSecretStore documentation). One can gain insight into this attribute via the command kubectl explain clustersecretstore.spec. The significance of defining a refresh interval for CSS can be realized by monitoring the traffic generated via a CSS object without a refresh interval in a test cluster that does not have any ESO object. Using Cluster-Scoped External Secrets Over Namespace-Scoped External Secrets The first ESO release was done in May 2021. Back then, the only option was to use the namespace-scoped ExternalSecret resource. So, even if the secret stored was global, an ExternalSecret object had to be defined for each namespace. ExternalSecret objects across all namespaces would get synchronized at the defined refresh interval, thereby generating traffic. The larger the number of namespaces, the more traffic they would generate. There was a dire need for a global ExternalSecret object accessible across different namespaces. To fill this gap, the cluster-level external secret resource, ClusterExternalSecret (CES) was introduced in April 2022 (v0.5.0). Opting for ClusterExternalSecret over ExternalSecret (where applicable) can avoid redundant traffic generation. A sample YAML specific to HashiCorp Vault and Kubernetes image pull secret can be referred to below: CES example: YAML apiVersion: external-secrets.io/v1beta1 kind: ClusterExternalSecret metadata: name: "sre-cluster-ext-secret" spec: # The name to be used on the ExternalSecrets externalSecretName: sre-cluster-es # This is a basic label selector to select the namespaces to deploy ExternalSecrets to. # you can read more about them here https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#resources-that-support-set-based-requirements namespaceSelector: #mandatory -- not adding this will expose the external secret matchLabels: label: try_ces # How often the ClusterExternalSecret should reconcile itself # This will decide how often to check and make sure that the ExternalSecrets exist in the matching namespaces refreshTime: "10h" # This is the spec of the ExternalSecrets to be created externalSecretSpec: secretStoreRef: name: vault-backend kind: ClusterSecretStore target: name: sre-k8secret-cluster-es template: type: kubernetes.io/dockerconfigjson data: .dockerconfigjson: "{{.dockersecret | toString}" refreshInterval: "24h" data: - secretKey: dockersecret remoteRef: key: imagesecret property: dockersecret Conclusion By following the best practices listed above, the External Secrets Operator traffic to the External Secrets Manager can be reduced significantly. More
Practical Generators in Go 1.23 for Database Pagination

Practical Generators in Go 1.23 for Database Pagination

By Nikita Melnikov
The recent introduction of range functions in Go 1.23 marks a significant advancement in the language’s capabilities. This new feature brings native generator-like functionality to Go, opening up new possibilities for writing efficient and elegant code. In this article, we will explore range functions and demonstrate their practical application through a real-world example: paginated database queries. As a software engineer, I've experienced the critical importance of efficient data handling, especially when working with large datasets and performance-intensive applications. The techniques discussed here have broad applications across various domains, helping to optimize system resource usage and improve overall application responsiveness. Understanding Database Pagination Database pagination is an essential technique for efficiently managing large datasets by retrieving data in smaller, manageable chunks. This approach helps reduce memory consumption and optimizes resource utilization, making it particularly useful in applications where handling large amounts of data without overloading system resources is crucial. In this article, we’ll use Postgres as our database, leveraging its Cursors feature for efficient query result pagination. Here’s a brief overview of the syntax: Java BEGIN; DECLARE my_unique_cursor CURSOR FOR SELECT ... FROM ... WHERE ... ORDER BY ... FETCH 42 FROM my_unique_cursor; // Return the first 42 rows FETCH 24 FROM my_unique_cursor; // Return the second 24 rows ROLLBACK; This approach allows us to declare a unique cursor for our query and then fetch rows in batches, providing a foundation for efficient data retrieval. Go 1.23 Generators: A New Paradigm Go 1.23 introduces range functions, a feature that simplifies iteration over custom data types and sets. To implement this functionality, we need to create a custom function that returns one of the following signatures: Java func(yield func() bool) func(yield func(V) bool) func(yield func(K, V) bool) For a comprehensive understanding of range functions, refer to the official Go documentation. Implementing Pagination With Go 1.23 Let’s examine a practical implementation of pagination using Go 1.23’s new features. We’ll start with a basic database schema: Java CREATE TABLE test ( id SERIAL PRIMARY KEY, text VARCHAR(255) NOT NULL ); INSERT INTO test (text) VALUES ('row 0'), ('row 1'), -- ... more rows ... ('row 10'); Now, let’s define our Paginate function: Java func Paginate[T any]( ctx context.Context, db *sql.DB, query string, batchSize int, decoder Decoder[T], ) (func(func(T, error) bool), error) Here, Decoder[T] is defined as type Decoder[T any] func(rows *sql.Rows) (T, error), allowing for type-safe decoding of database rows. The implementation of Paginate demonstrates the power of Go 1.23’s range functions: Java func Paginate[T any]( ctx context.Context, db *sql.DB, query string, batchSize int, decoder Decoder[T], ) (func(func(T, error) bool), error) { tx, err := db.BeginTx(ctx, nil) if err != nil { return nil, fmt.Errorf("error starting transaction: %w", err) } cursor, err := NewRandomCursor() if err != nil { return nil, fmt.Errorf("error generating cursor: %w", err) } return func(yield func(T, error) bool) { defer func() { _ = tx.Rollback() }() _, err = tx.ExecContext(ctx, fmt.Sprintf("DECLARE %s CURSOR FOR %s", cursor, query)) if err != nil { log.Printf("Error declaring cursor: %v", err) return } for { page, ok, err := ReadPage[T]( func() (*sql.Rows, error) { return tx.QueryContext(ctx, fmt.Sprintf("FETCH %d FROM %s", batchSize, cursor)) }, decoder, ) if err != nil { var unit T yield(unit, err) } for _, row := range page { if !yield(row, nil) { return } } if !ok { break } } }, nil } This implementation efficiently manages database transactions, cursor handling, and error management while utilizing Go 1.23’s new range functions. Practical Application To demonstrate the practical use of our pagination function, consider the following main function: Java func main() { // Assume proper context and database setup query := "SELECT id, text FROM test ORDER BY id" batchSize := 2 pagination, err := Paginate[Entry](ctx, db, query, batchSize, DecodeEntry) if err != nil { log.Fatal(err) } for row, err := range pagination { if err != nil { log.Printf("Error fetching rows: %v", err) return } log.Printf("Row: %v", row) } } Executing this code yields results that clearly illustrate the pagination process: Java Query 1 Row: {1 row 0} Row: {2 row 1} Query 2 Row: {3 row 2} Row: {4 row 3} // ... subsequent queries and rows ... Conclusion The introduction of range functions in Go 1.23 represents a significant step forward in the language’s evolution. This feature enables developers to create more efficient and readable code, particularly when dealing with large datasets and complex data structures. In the context of database pagination, as demonstrated in this article, range functions allow for the creation of flexible and reusable solutions that can significantly enhance performance and resource management. This is particularly crucial in fields like fintech, where handling large volumes of data efficiently is often a key requirement. The approach outlined here not only improves performance but also contributes to better code maintainability and readability. As Go continues to evolve, embracing features like range functions will be crucial for developers aiming to write more expressive and performant code. I encourage fellow developers to explore the full potential of Go 1.23’s range functions in their projects. The possibilities for optimization and improved code structure are substantial and could lead to significant advancements in how we handle data-intensive operations. More

Trend Report

Kubernetes in the Enterprise

In 2014, Kubernetes' first commit was pushed to production. And 10 years later, it is now one of the most prolific open-source systems in the software development space. So what made Kubernetes so deeply entrenched within organizations' systems architectures? Its promise of scale, speed, and delivery, that is — and Kubernetes isn't going anywhere any time soon.DZone's fifth annual Kubernetes in the Enterprise Trend Report dives further into the nuances and evolving requirements for the now 10-year-old platform. Our original research explored topics like architectural evolutions in Kubernetes, emerging cloud security threats, advancements in Kubernetes monitoring and observability, the impact and influence of AI, and more, results from which are featured in the research findings.As we celebrate a decade of Kubernetes, we also look toward ushering in its future, discovering how developers and other Kubernetes practitioners are guiding the industry toward a new era. In the report, you'll find insights like these from several of our community experts; these practitioners guide essential discussions around mitigating the Kubernetes threat landscape, observability lessons learned from running Kubernetes, considerations for effective AI/ML Kubernetes deployments, and much more.

Kubernetes in the Enterprise

Refcard #303

API Integration Patterns

By Thomas Jardinet DZone Core CORE
API Integration Patterns

Refcard #389

Threat Detection

By Sudip Sengupta DZone Core CORE
Threat Detection

More Articles

What Is a Data Pipeline?
What Is a Data Pipeline?

The efficient flow of data from one location to the other — from a SaaS application to a data warehouse, for example — is one of the most critical operations in today's data-driven enterprise. After all, useful analysis cannot begin until the data becomes available. Data flow can be precarious because there are so many things that can go wrong during the transportation from one system to another: data can become corrupted, it can hit bottlenecks (causing latency), or data sources may conflict and/or generate duplicates. As the complexity of the requirements grows and the number of data sources multiplies, these problems increase in scale and impact. To address these challenges, organizations are turning to data pipelines as essential solutions for managing and optimizing data flow, ensuring that insights can be derived efficiently and effectively. How Do Data Pipelines Work? A data pipeline is software that eliminates many manual steps from the process and enables a smooth, automated flow of data from one station to the next. It starts by defining what, where, and how data is collected. It automates the processes involved in extracting, transforming, combining, validating, and loading data for further analysis and visualization. It provides end-to-end velocity by eliminating errors and combating bottlenecks or latency. It can process multiple data streams at once. In short, it is an absolute necessity for today's data-driven enterprise. A data pipeline views all data as streaming data and it allows for flexible schemas. Regardless of whether it comes from static sources (like a flat-file database) or from real-time sources (such as online retail transactions), the data pipeline divides each data stream into smaller chunks that it processes in parallel, conferring extra computing power. The data is then fed into processing engines that handle tasks like filtering, sorting, and aggregating the data. During this stage, transformations are applied to clean, normalize, and format the data, making it suitable for further use. Once the data has been processed and transformed, it is loaded into a destination, such as a data warehouse, database, data lake, or another application, such as a visualization tool. Think of it as the ultimate assembly line. Who Needs a Data Pipeline? While a data pipeline is not a necessity for every business, this technology is especially helpful for those that: Generate, rely on, or store large amounts or multiple sources of data. Maintain siloed data sources. Require real-time or highly sophisticated data analysis. Store data in the cloud. Here are a few examples of who might need a data pipeline: E-commerce companies: To process customer transaction data, track user behavior, and deliver personalized recommendations in real time. Financial institutions: For real-time fraud detection, risk assessment, and aggregating data for regulatory reporting. Healthcare organizations: To streamline patient data management, process medical records, and support data-driven clinical decision-making. Media and entertainment platforms: For streaming real-time user interactions and content consumption data to optimize recommendations and advertisements. Telecommunications providers: To monitor network traffic, detect outages, and ensure optimal service delivery. All of these industries rely on data pipelines to efficiently manage and extract value from large volumes of data. In fact, most of the companies you interface with on a daily basis — and probably your own — would benefit from a data pipeline. 5 Components of A Data Pipeline A data pipeline is a series of processes that move data from its source to its destination, ensuring it is cleaned, transformed, and ready for analysis or storage. Each component plays a vital role in orchestrating the flow of data, from initial extraction to final output, ensuring data integrity and efficiency throughout the process. A data pipeline is made up of 5 components: 1. Data Sources These are the origins of raw data, which can include databases, APIs, file systems, IoT devices, social media, and logs. They provide the input that fuels the data pipeline. 2. Processing Engines These are systems or frameworks (e.g., Apache Spark, Flink, or Hadoop) responsible for ingesting, processing, and managing the data. They perform operations like filtering, aggregation, and computation at scale. 3. Transformations This is where the raw data is cleaned, normalized, enriched, or reshaped to fit a desired format or structure. Transformations help make the data usable for analysis or storage. 4. Dependencies These are the interconnections between various stages of the pipeline, such as task scheduling and workflow management (e.g., using tools like Apache Airflow or Luigi). Dependencies ensure that each stage runs in the correct sequence, based on the successful completion of prior tasks, enabling smooth, automated data flow. 5. Destinations These are the systems where the processed data is stored, such as data warehouses (e.g., Amazon Redshift, Snowflake), databases, or data lakes. The data here is ready for use in reporting, analytics, or machine learning models. Data Pipeline Architecture Data pipeline architecture is the blueprint that defines how data flows from its origins to its final destination, guiding every step in the process. It outlines how raw data is collected, processed, and transformed before being stored or analyzed, ensuring that the data moves efficiently and reliably through each stage. This architecture connects various components, like data sources, processing engines, and storage systems, working in harmony to handle data at scale. A well-designed data pipeline architecture ensures smooth, automated data flow, allowing businesses to transform raw information into valuable insights in a timely and scalable way. For example, a real-time streaming data pipeline might be used in financial markets, where data from stock prices, trading volumes, and news feeds is ingested in real time, processed using a system like Apache Kafka, transformed to detect anomalies or patterns, and then delivered to an analytics dashboard for real-time decision-making. In contrast, a batch data pipeline might involve an e-commerce company extracting customer order data from a database, transforming it to aggregate sales by region, and loading it into a data warehouse like Amazon Redshift for daily reporting and analysis. Both architectures serve different use cases but follow the same underlying principles of moving and transforming data efficiently. SparkDatabox, CC BY-SA 4.0, via Wikimedia Commons Data Pipeline vs. ETL Pipeline You may commonly hear the terms ETL and data pipeline used interchangeably. ETL stands for Extract, Transform, and Load. ETL systems extract data from one system, transform the data, and load the data into a database or data warehouse. Legacy ETL pipelines typically run in batches, meaning that the data is moved in one large chunk at a specific time to the target system. Typically, this occurs in regularly scheduled intervals; for example, you might configure the batches to run at 12:30 a.m. every day when the system traffic is low. By contrast, "data pipeline" is a broader term that encompasses ETL as a subset. It refers to a system for moving data from one system to another. The data may or may not be transformed, and it may be processed in real-time (or streaming) instead of batches. When the data is streamed, it is processed in a continuous flow which is useful for data that needs constant updating, such as a data from a sensor monitoring traffic. In addition, the data may not be loaded to a database or data warehouse. It might be loaded to any number of targets, such as an AWS bucket or a data lake, or it might even trigger a webhook on another system to kick off a specific business process. 8 Data Pipeline Use Cases Data pipelines are essential for a wide range of applications, enabling businesses to efficiently handle, process, and analyze data in various real-world scenarios. Here are just a few use cases for data pipelines: 1. Real-Time Analytics Stream and process live data from sources like IoT devices, financial markets, or web applications to generate real-time insights and enable rapid decision-making. 2. Data Warehousing Ingest and transform large volumes of raw data from various sources into a structured format, then load it into a data warehouse for business intelligence and reporting. 3. Machine Learning Automate the extraction, transformation, and loading (ETL) of data to feed machine learning models, ensuring the models are trained on up-to-date and clean datasets. 4. Customer Personalization Process customer behavior data from e-commerce or social platforms to deliver personalized recommendations or targeted marketing campaigns in real time. 5. Log Aggregation and Monitoring Collect, process, and analyze system logs from multiple servers or applications to detect anomalies, monitor system health, or troubleshoot issues. 6. Data Migration Transfer data between storage systems or cloud environments, transforming the data as necessary to meet the requirements of the new system. 7. Fraud Detection Continuously ingest and analyze transaction data to detect suspicious activity or fraudulent patterns, enabling immediate responses. 8. Compliance and Auditing Automatically gather and process data required for regulatory reporting, ensuring timely and accurate submissions to meet compliance standards. The Benefits of Leveraging Data Pipelines Data pipelines provide many benefits for organizations looking to derive meaningful insights from their data efficiently and reliably. These include: Automation: Data pipelines streamline the entire process of data ingestion, processing, and transformation, reducing manual tasks and minimizing the risk of errors. Scalability: They allow organizations to handle growing volumes of data efficiently, ensuring that processing remains fast and reliable even as data increases. Real-Time Processing: Data pipelines can be designed to handle real-time data streams, enabling immediate insights and faster decision-making for time-sensitive applications. Data Quality and Consistency: Automated transformations and validation steps in the pipeline help ensure that the data is clean, consistent, and ready for analysis. Improved Decision-Making: With faster and more reliable data processing, pipelines enable organizations to make informed decisions based on up-to-date information. Cost Efficiency: By automating data flows and leveraging scalable infrastructure, data pipelines reduce the resources and time needed to process and manage large datasets. Centralized Data Access: Pipelines consolidate data from various sources into a single, accessible destination like a data warehouse, making it easier to analyze and use across departments. Error Handling and Recovery: Many pipelines are designed with fault tolerance, meaning they can detect issues, retry failed tasks, and recover from errors without disrupting the entire process. The Challenges of Employing Data Pipelines While data pipelines provide powerful solutions for automating data flow and ensuring scalability, they come with their own set of challenges. Building and maintaining a robust data pipeline requires overcoming technical hurdles related to data integration, performance, and reliability. Organizations must carefully design pipelines to handle the growing complexity of modern data environments, ensuring that the pipeline remains scalable, resilient, and capable of delivering high-quality data. Understanding these challenges is crucial for effectively managing data pipelines and maximizing their potential. With that in mind, here are some common challenges of employing data pipelines: Data Integration Complexity: Integrating data from multiple, diverse sources can be challenging, as it often involves handling different formats, structures, and protocols. Scalability and Performance: As data volumes grow, pipelines must be designed to scale efficiently without compromising performance or speed, which can be difficult to achieve. Data Quality and Consistency: Ensuring clean, accurate, and consistent data throughout the pipeline requires rigorous validation, error handling, and monitoring. Maintenance and Updates: Data pipelines need regular maintenance to handle changes in data sources, formats, or business requirements, which can lead to operational overhead. Latency In Real-time Systems: Achieving low-latency data processing in real-time systems is technically demanding, especially when handling large volumes of fast-moving data. Future Trends in Data Pipelines As data pipelines continue to evolve, new trends are emerging that reflect both the growing complexity of data environments and the need for more agile, intelligent, and efficient systems. With the explosion of data volume and variety, organizations are looking to future-proof their pipelines by incorporating advanced technologies like AI, cloud-native architectures, and real-time processing. These innovations are set to reshape how data pipelines are built, managed, and optimized, ensuring they can handle the increasing demands of modern data-driven businesses while maintaining security and compliance: Increased Use of AI and Machine Learning: Data pipelines will increasingly leverage AI and ML for automation in data cleaning, anomaly detection, and predictive analytics, reducing manual intervention and improving data quality. Real-Time Streaming Pipelines: The demand for real-time analytics is driving the shift from batch processing to real-time streaming pipelines, enabling faster decision-making for time-sensitive applications like IoT, finance, and e-commerce. Serverless and Cloud-Native Architectures: Cloud providers are offering more serverless data pipeline services, reducing the need for managing infrastructure and allowing organizations to scale pipelines dynamically based on demand. DataOps Integration: The rise of DataOps, focusing on collaboration, automation, and monitoring, is improving the efficiency and reliability of data pipelines by applying DevOps-like practices to data management. Edge Computing Integration: As edge computing grows, data pipelines will increasingly process data closer to the source (at the edge), reducing latency and bandwidth usage, particularly for IoT and sensor-driven applications. Improved Data Privacy and Security: As regulations around data privacy grow (e.g., GDPR, CCPA), pipelines will increasingly incorporate stronger data encryption, anonymization, and auditing mechanisms to ensure compliance and protect sensitive information. These trends reflect the growing sophistication and adaptability of data pipelines to meet evolving business and technological demands. Types of Data Pipeline Solutions There are a number of different data pipeline solutions available, and each is well-suited to different purposes. For example, you might want to use cloud-native tools if you are attempting to migrate your data to the cloud. The following list shows the most popular types of pipelines available. Note that these systems are not mutually exclusive. You might have a data pipeline that is optimized for both cloud and real-time, for example: Batch Batch processing is most useful when you want to move large volumes of data at a regular interval and you do not need to move data in real time. For example, it might be useful for integrating your marketing data into a larger system for analysis. Real-Time These tools are optimized to process data in real time. Real-time is useful when you are processing data from a streaming source, such as data from financial markets or telemetry from connected devices. Cloud Native These tools are optimized to work with cloud-based data, such as data from AWS buckets. These tools are hosted in the cloud, allowing you to save money on infrastructure and expert resources because you can rely on the infrastructure and expertise of the vendor hosting your pipeline. Open Source These tools are most useful when you need a low-cost alternative to a commercial vendor and you have the expertise to develop or extend the tool for your purposes. Open source tools are often cheaper than their commercial counterparts, but require expertise to use the functionality because the underlying technology is publicly available and meant to be modified or extended by users. Building vs. Buying: Choosing the Right Data Pipeline Solution Okay, so you're convinced that your company needs a data pipeline. How do you get started? Building a Data Pipeline You could hire a team to build and maintain your own data pipeline in-house. Here's what it entails: Developing a way to monitor incoming data (whether file-based, streaming, or something else). Connecting to and transforming data from each source to match the format and schema of its destination. Moving the data to the target database/data warehouse. Adding and deleting fields and altering the schema as company requirements change. Making an ongoing, permanent commitment to maintaining and improving the data pipeline. Count on the process being costly both in terms of resources and time. You'll need experienced (and thus expensive) personnel, either hired or trained, and pulled away from other high-value projects and programs. It could take months to build, incurring significant opportunity costs. Lastly, it can be difficult to scale these types of solutions because you need to add hardware and people, which may be out of budget. Buying a Data Pipeline Solution A simpler, more cost-effective solution is to invest in a robust data pipeline. Here's why: You get immediate, out-of-the-box value, saving you the lead time involved in building an in-house solution. You don't have to pull resources from existing projects or products to build or maintain your data pipeline. If or when problems arise, you have someone you can trust to fix the issue, rather than having to pull resources off of other projects or failing to meet an SLA. It gives you an opportunity to cleanse and enrich your data on the fly. It enables real-time, secure analysis of data, even from multiple sources simultaneously by storing the data in a cloud data warehouse. You can visualize data in motion. You get peace of mind from enterprise-grade security and a 100% SOC 2 Type II, HIPAA, and GDPR compliant solution. Schema changes and new data sources are easily incorporated. Built in error handling means data won't be lost if loading fails. Conclusion Data pipelines have become essential tools for organizations seeking to maximize value from their data assets. These automated systems streamline data flow from source to destination, offering benefits such as scalability, real-time processing, and improved decision-making. While they present challenges like integration complexity and maintenance needs, the advantages far outweigh the drawbacks for data-driven businesses. When considering a data pipeline solution, organizations must weigh the pros and cons of building in-house versus investing in pre-built solutions. While in-house development offers customization, pre-built solutions provide immediate value and scalability without the resource drain of ongoing maintenance. As data continues to grow in importance, it's crucial that your organization takes the time to assess their data management needs. Explore the various data pipeline solutions available and consider how they align with your business goals. By implementing an effective data pipeline, you can transform raw data into a powerful driver of business insights and competitive advantage in our increasingly data-centric world.

By Garrett Alley
Probabilistic Graphical Models: A Gentle Introduction
Probabilistic Graphical Models: A Gentle Introduction

What Are Probabilistic Graphical Models? Probabilistic models represent complex systems by defining a joint probability distribution over multiple random variables, effectively capturing the uncertainty and dependencies within the system. However, as the number of variables increases, the joint distribution grows exponentially, making it computationally infeasible to handle directly. Probabilistic graphical models (PGMs) address this challenge by leveraging the conditional independence properties among variables and representing them using graph structures. These graphs allow for a more compact representation of the joint distribution, enabling the use of efficient graph-based algorithms for both learning and inference. This approach significantly reduces computational complexity, making PGMs a powerful tool for modeling complex, high-dimensional systems. PGMs are extensively used in diverse domains such as medical diagnosis, natural language processing, causal inference, computer vision, and the development of digital twins. These fields require precise modeling of systems with many interacting variables, where uncertainty plays a significant role [1-3]. Definition: "Probabilistic graphical models (PGM) is a technique of compactly representing a joint distribution by exploiting dependencies between the random variables. " [4]. This definition might seem complex at first but it can be clarified by breaking down the core elements of PGMs: Model A model is a formal representation of a system or process, capturing its essential features and relationships. In the context of PGMs, the model comprises variables that represent different aspects of the system and the probabilistic relationships among them. This representation is independent of any specific algorithm or computational method used to process the model. Models can be developed using various techniques: Learning from data: Statistical and machine learning methods can be employed to infer the structure and parameters of the model from historical data. Expert knowledge: Human experts can provide insights into the system, which can be encoded into the model. Combination of both: Often, models are constructed using a mix of data-driven approaches and expert knowledge. Algorithms are then used to analyze the model, answer queries, or perform tasks based on this representation. Probabilistic PGMs handle uncertainty by explicitly incorporating probabilistic principles. Uncertainty in these models can stem from several sources: Noisy data: Real-world data often includes errors and variability that introduce noise into the observations. Incomplete knowledge: We may not have access to all relevant information about a system, leading to partial understanding and predictions. Model limitations: Models are simplifications of reality and cannot capture every detail perfectly. Assumptions and simplifications can introduce uncertainty. Stochastic nature: Many systems exhibit inherent randomness and variability, which must be modelled probabilistically. Graphical The term "graphical" refers to the use of graphs to represent complex systems. In PGMs, graphs are used as a visual and computational tool to manage the relationships between variables: Nodes: Represent random variables or their states Edges: Represent dependencies or relationships between variables Graphs provide a compact and intuitive way to capture and analyze the dependencies among a large number of variables. This graphical representation allows for efficient computation and visualization, making it easier to work with complex systems. Preliminary Concepts Learning, Inference, and Sampling PGMs are powerful for exploring and understanding complex domains. Their utility lies in three key operations[1]: Learning: This entails estimating the parameters of the probability distribution from data. This process allows the model to generalize from observed data and make predictions about unseen data. Inference: Inference is the process of answering queries about the model, typically in the form of conditional distributions. It involves determining the probability of certain outcomes given observed variables, which is crucial for decision-making and understanding dependencies within the model. Sampling: Sampling refers to the ability to draw samples from the probability distribution defined by the graphical model. This is important for tasks like simulation, approximation, and exploring the distribution's properties, and is also often used in approximate inference methods when exact inference is computationally infeasible. Factors in PGMs In PGMs, a factor is a fundamental concept used to represent and manipulate the relationships between random variables. A factor is a mathematical construct that assigns a value to each possible combination of values for a subset of random variables. This value could represent probabilities, potentials, or other numerical measures, depending on the context. The scope of a factor is the set of variables it depends on. Types of Factors Joint distribution: Represents the full joint probability distribution over all variables in the scope Conditional Probability Distribution (CPD): Provides the probability of one variable given the values of others; It is often represented as a table, where each entry corresponds to a conditional probability value. Potential function: In the context of Markov Random Fields, factors represent potential functions, which assign values to combinations of variables but may not necessarily be probabilities. Operations on Factors Factor product: Combines two factors by multiplying their values, resulting in a new factor that encompasses the union of their scopes Factor marginalization: Reduces the scope of a factor by summing out (marginalizing over) some variables, yielding a factor with a smaller scope Factor reduction: Focuses on a subset of the factor by setting specific values for certain variables, resulting in a reduced factor Factors are crucial in PGMs for defining and computing high-dimensional probability distributions, as they allow for efficient representation and manipulation of complex probabilistic relationships. Representation in PGMs Representation of PGMs involves two components: Graphical structure that encodes dependencies among variables Probability distributions or factors that define the quantitative relationships between these variables The choice of representation affects both the expressiveness of the model and the computational efficiency of inference and learning. Bayesian Networks A Bayesian network is used to represent causal relationships between variables. It consists of a directed acyclic graph (DAG) and a set of Conditional Probability Distributions (CPDs) associated with each of the random variables [4]. Key Concepts in Bayesian Networks Nodes and edges: Nodes represent random variables, and directed edges represent conditional dependencies between these variables. An edge from node A to node B indicates that A is a parent of B i.e. B is conditionally dependent on A. Acyclic nature: The graph is acyclic, meaning there are no cycles, ensuring that the model represents a valid probability distribution. Conditional Probability Distributions (CPDs): In a Bayesian Network, each node Xi has an associated Conditional Probability Distribution (CPD) that defines the probability of Xi given its parents in the graph. These CPDs quantify how each variable depends on its parent variables. The overall joint probability distribution can then be decomposed into a product of these local CPDs. Conditional independence: The structure of the graph encodes conditional independence assumptions. Specifically, a node is Xi conditionally independent of its non-descendants given its parents. This assumption allows for the decomposition of the joint probability distribution into a product of conditional distributions. This factorization allows the complex joint distribution to be efficiently represented and computed by leveraging the network's graphical structure. Common Structures in Bayesian Networks To better grasp how a Directed Acyclic Graph (DAG) captures dependencies between variables, it's essential to understand some of the common structural patterns in Bayesian networks. These patterns influence how variables are conditionally independent or dependent, shaping the flow of information within the model. By identifying these structures, we can gain insights into the network's behavior and make more efficient inferences. The following table summarizes common structures in Bayesian networks and explains how they influence the conditional independence or dependence of variables [5, 6]: Example As a motivating example, consider an email spam classification model where each feature, Xi, encodes whether a particular word is present, and the target, y, indicates whether the email is spam. To classify an email, we need to compute the joint probability distribution, P(Xi,y), which models the relationship between the features (words) and the target (spam status). Figure 1 (below) illustrates two Bayesian network representations for this classification task. The network on the left represents Bayesian Logistic Regression, which models the relationship between the features and y in the most general form. This model captures potential dependencies between words and how they collectively influence the probability that an email is spam. In contrast, the network on the right shows the Naive Bayes Model, which simplifies the problem by making a key assumption: the presence of each word in an email is conditionally independent of the presence of other words, given whether the email is spam or not. This conditional independence assumption reduces the model's complexity, as it requires far fewer parameters than a fully general model like Bayesian logistic regression. Figure 1: Bayesian Networks Dynamic Bayesian Network (DBN) A Dynamic Bayesian Network (DBN) is an extension of a Bayesian network that models sequences of variables over time. DBNs are particularly useful for representing temporal processes, where the state of a system evolves over time. A DBN consists of the following components: Time slices: Each time slice represents the state of the system at a specific point in time. Nodes in a time slice represent variables at that time and edges within the slice capture dependencies at that same time. Temporal dependencies: Edges between nodes in successive time slices represent temporal dependencies, showing how the state of the system at one-time step influences the state at the next time step. These dependencies allow the DBN to capture the dynamics of the system as it progresses through time. DBNs combine intra-temporal dependencies (within a time slice) and inter-temporal dependencies (across time slices), allowing them to model complex temporal behaviours effectively. This dual structure is useful in applications like speech recognition, bioinformatics, and finance, where past states strongly influence future outcomes. In DBNs, we often make use of the Markov assumption and time invariance to simplify the model's complexity while maintaining its predictive power. Markov assumption simplifies the DBN by assuming that the state of the system at time t + 1 depends only on the state at time t, ignoring any earlier states. This assumption reduces the complexity of the model by focusing only on the most recent state, making it computationally more feasible. Time invariance implies that the dependencies between variables and the conditional probability distributions remain consistent across time slices. This means that the structure of the DBN and the parameters associated with each conditional distribution do not change over time. This assumption greatly reduces the number of parameters that need to be learned, making the DBN more tractable. DBN Structure A dynamic Bayesian network (DBN) is represented by a combination of two Bayesian networks: Initial Bayesian Network (BN0) over the initial state variables which models the dependencies among the variables at the initial time slice (time 0). This network specifies the distribution over the initial states of the system. Two-Time-Slice Bayesian Network (2TBN), which models the dependencies between the variables in two consecutive time slices. This network models the transition dynamics from time t to time t+1, encoding how the state of the system evolves from one-time step to the next. Example: Consider a DBN where the states of three variables (X1,X2,X3). This structure highlights both intra-temporal and inter-temporal dependencies and the relationships between the variables across different time slices, as well as the dependencies within the initial time slice. Figure 2: DBN Representation Hidden Markov Models A Hidden Markov Model (HMM) is a simpler special case of a DBN and is widely used in various fields such as speech recognition, bioinformatics, and finance. While DBNs can model complex relationships among multiple variables, HMMs focus specifically on scenarios where the system can be represented by a single hidden state variable that evolves over time. Markov Chain Foundation Before delving deeper into HMMs, it is essential to understand the concept of a Markov Chain, which forms the foundation for HMMs. A Markov Chain is a mathematical model that describes a system that transitions from one state to another in a chain-like process. It is characterized by the following properties [7]: States: The system is in one of a finite set of states at any given time. Transition probabilities: The probability of transitioning from one state to another is determined by a set of transition probabilities. Initial state distribution: The probabilities associated with starting in each possible state at the initial time step. Markov property: The future state of the system depends only on the current state and not on the sequence of states that preceded it. A Hidden Markov Model (HMM) extends the concept of a Markov Chain by incorporating hidden states and observable emissions. While Markov Chains directly model the transitions between states, HMMs are designed to handle situations where the states themselves are not directly observable, but instead, we observe some output that is probabilistically related to these states. The key components of an HMM include: States: The different conditions or configurations that the system can be in at any given time. Unlike in a Markov Chain, these states are hidden, meaning they are not directly observable. Observations: Each state generates an observation according to a probability distribution. These observations are the visible outputs that we can measure and use to infer the hidden states. Transition probabilities: The probability of moving from one state to another between consecutive time steps. These probabilities capture the temporal dynamics of the system, similar to those in a Markov Chain. Emission probabilities: The probability of observing a particular observation given the current hidden state. This links the hidden states to the observable data, providing a mechanism to relate the underlying system behaviour to the observed data. Initial state distribution: The probabilities associated with starting in each possible hidden state at the initial time step. Figure 3: Markov Chain vs Hidden Markov Model An HMM can be visualized as a simplified version of a DBN with one hidden state variable and observable emissions at each time step. In essence, an HMM is designed to handle situations where the states of the system are hidden, but the observable data provides indirect information about the underlying process. This makes HMMs powerful tools for tasks like speech recognition, where the goal is to infer the most likely sequence of hidden states (e.g., phonemes) from a sequence of observed data (e.g., audio signals). Markov Networks While Bayesian Networks are directed graphical models used to represent causal relationships, Markov Networks, also known as Markov Random Fields, are undirected probabilistic graphical models. They are particularly useful when relationships between variables are symmetric or when cycles are present, as opposed to the acyclic structure required by Bayesian Networks. Markov Networks are ideal for modelling systems with mutual interactions between variables, making them popular in applications such as image processing, social networks, and spatial statistics. Key Concepts in Markov Networks [5,6] Undirected Graphical Structure In a Markov network, the relationships between random variables are represented by an undirected graph. Each node represents a random variable, while each edge represents a direct dependency or interaction between the connected variables. Since the edges are undirected, they imply that the relationship between the variables is symmetric — unlike Bayesian networks, where the edges indicate directed conditional dependencies. Factors and Potentials Instead of using Conditional Probability Distributions (CPDs) like Bayesian networks, Markov networks rely on factors or potential functions to describe the relationships between variables. A factor is a function that assigns a non-negative real number to each possible configuration of the variables involved. These factors quantify the degree of compatibility between different states of the variables within a local neighbourhood or clique in the graph. Cliques in Markov Networks A clique is a subset of nodes in the graph that are fully connected. Cliques capture the local dependencies among variables. This means that within a clique, the variables are not independent and their joint distribution cannot be factored further. In Markov networks, potential functions are defined over cliques, capturing the joint compatibility of the variables in these fully connected subsets. The simplest cliques are pairwise cliques (two connected nodes), but larger cliques can also be defined in more complex Markov networks. Markov Properties The graph structure of a Markov network encodes various Markov properties, which dictate the conditional independence relationships among the variables: Pairwise Markov Property: Two non-adjacent variables are conditionally independent given all other variables. Formally, for nodes X and Y, if they are not connected by an edge, they are conditionally independent given the rest of the nodes. Local Markov Property: A variable is conditionally independent of all other variables in the graph given its neighbors (the variables directly connected to it by an edge). This reflects the idea that the dependency structure of a variable is fully determined by its local neighborhood in the graph. Global Markov Property: Any two sets of variables are conditionally independent given a separating set. If a set of nodes separates two other sets of nodes in the graph, then the two sets are conditionally independent given the separating set. Example: Consider the Markov network field as illustrated in Figure 4. The network consists of four variables, A, B, C, and D, represented by the nodes. The edges between these nodes are labelled with factors ϕ. These factors represent the level of association or dependency between each pair of connected variables. The joint probability distribution over all variables A, B, C, and D is computed as the product of all the pairwise factors in the network, along with a normalizing constant Z, which ensures the probability distribution is valid (i.e., sums to 1). Figure 4: Markov Network Learning and Inference Inference and learning are two critical components of PGMs, which will be explored in a follow-up article. Conclusion Probabilistic graphical models represent probability distributions and capture conditional independence structures using graphs. This allows the application of graph-based algorithms for both learning and inference. Bayesian Networks are particularly useful for scenarios involving directed, acyclic dependencies, such as causal reasoning. Markov Networks provide an alternative, especially suited for undirected, symmetric dependencies common in image and spatial data. These models can perform learning, inference, and decision-making in uncertain environments, and find applications in a wide range of fields such as healthcare, natural language processing, computer vision, and financial modeling. References Shrivastava, H. and Chajewska, U., 2023, September. Neural graphical models. In European Conference on Symbolic and Quantitative Approaches with Uncertainty (pp. 284-307). Cham: Springer Nature Switzerland. Kapteyn, M.G., Pretorius, J.V. and Willcox, K.E., 2021. A probabilistic graphical model foundation for enabling predictive digital twins at scale. Nature Computational Science, 1(5), pp.337-347. Louizos, C., Shalit, U., Mooij, J.M., Sontag, D., Zemel, R. and Welling, M., 2017. Causal effect inference with deep latent-variable models. Advances in neural information processing systems, 30. Ankan, A. and Panda, A., 2015, July. pgmpy: Probabilistic Graphical Models using Python. In SciPy (pp. 6-11). Koller, D., n.d. *Probabilistic graphical models* [Online course]. Coursera. Available at: <https://www.coursera.org/learn/probabilistic-graphical-models?specialization=probabilistic-graphical-models> (Accessed: 9 August 2024). Ermon Group (n.d.) CS228 notes: Probabilistic graphical models. Available at: https://ermongroup.github.io/cs228-notes (Accessed: 9 August 2024). Jurafsky, D. and Martin, J.H., 2024. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 3rd ed. Available at: https://web.stanford.edu/~jurafsky/slp3/ (Accessed 25 August 2024).

By Salman Khan
Backpressure in Distributed Systems
Backpressure in Distributed Systems

An unchecked flood will sweep away even the strongest dam. – Ancient Proverb The quote above suggests that even the most robust and well-engineered dams cannot withstand the destructive forces of an unchecked and uncontrolled flood. Similarly, in the context of a distributed system, an unchecked caller can often overwhelm the entire system and cause cascading failures. In a previous article, I wrote about how a retry storm has the potential to take down an entire service if proper guardrails are not in place. Here, I'm exploring when a service should consider applying backpressure to its callers, how it can be applied, and what callers can do to deal with it. Backpressure As the name itself suggests, backpressure is a mechanism in distributed systems that refers to the ability of a system to throttle the rate at which data is consumed or produced to prevent overloading itself or its downstream components. A system applying backpressure on its caller is not always explicit, like in the form of throttling or load shedding, but sometimes also implicit, like slowing down its own system by adding latency to requests served without being explicit about it. Both implicit and explicit backpressure intend to slow down the caller, either when the caller is not behaving well or the service itself is unhealthy and needs time to recover. Need for Backpressure Let's take an example to illustrate when a system would need to apply backpressure. In this example, we're building a control plane service with three main components: a frontend where customer requests are received, an internal queue where customer requests are buffered, and a consumer app that reads messages from the queue and writes to a database for persistence. Figure 1: A sample control plane Producer-Consumer Mismatch Consider a scenario where actors/customers are hitting the front end at such a high rate that either the internal queue is full or the worker writing to the database is busy, leading to a full queue. In that case, requests can't be enqueued, so instead of dropping customer requests, it's better to inform the customers upfront. This mismatch can happen for various reasons, like a burst in incoming traffic or a slight glitch in the system where the consumer was down for some time but now has to work extra to drain the backlog accumulated during its downtime. Resource Constraints and Cascading Failures Imagine a scenario where your queue is approaching 100% of its capacity, but it's normally at 50%. To match this increase in the incoming rate, you scale up your consumer app and start writing to the database at a higher rate. However, the database can't handle this increase (e.g., due to limits on writes/sec) and breaks down. This breakdown will take down the whole system with it and increase the Mean Time To Recover (MTTR). Applying backpressure at appropriate places becomes critical in such scenarios. Missed SLAs Consider a scenario where data written to the database is processed every 5 minutes, which another application listens to keep itself up-to-date. Now, if the system is unable to meet that SLA for whatever reason, like the queue being 90% full and potentially taking up to 10 minutes to clear all messages, it's better to resort to backpressure techniques. You could inform customers that you're going to miss the SLA and ask them to try again later or apply backpressure by dropping non-urgent requests from the queue to meet the SLA for critical events/requests. Backpressure Challenges Based on what's described above, it seems like we should always apply backpressure, and there shouldn't be any debate about it. As true as it sounds, the main challenge is not around if we should apply backpressure but mostly around how to identify the right points to apply backpressure and the mechanisms to apply it that cater to specific service/business needs. Backpressure forces a trade-off between throughput and stability, made more complex by the challenge of load prediction. Identifying the Backpressure Points Find Bottlenecks/Weak Links Every system has bottlenecks. Some can withstand and protect themselves, and some can't. Think of a system where a large data plane fleet (thousands of hosts) depends on a small control plane fleet (fewer than 5 hosts) to receive configs persisted in the database, as highlighted in the diagram above. The big fleet can easily overwhelm the small fleet. In this case, to protect itself, the small fleet should have mechanisms to apply backpressure on the caller. Another common weak link in architecture is centralized components that make decisions about the whole system, like anti-entropy scanners. If they fail, the system can never reach a stable state and can bring down the entire service. Use System Dynamics: Monitors/Metrics Another common way to find backpressure points for your system is to have appropriate monitors/metrics in place. Continuously monitor the system's behavior, including queue depths, CPU/memory utilization, and network throughput. Use this real-time data to identify emerging bottlenecks and adjust the backpressure points accordingly. Creating an aggregate view through metrics or observers like performance canaries across different system components is another way to know that your system is under stress and should assert backpressure on its users/callers. These performance canaries can be isolated for different aspects of the system to find the choke points. Also, having a real-time dashboard on internal resource usage is another great way to use system dynamics to find the points of interest and be more proactive. Boundaries: The Principle of Least Astonishment The most obvious things to customers are the service surface areas with which they interact. These are typically APIs that customers use to get their requests served. This is also the place where customers will be least surprised in case of backpressure, as it clearly highlights that the system is under stress. It can be in the form of throttling or load shedding. The same principle can be applied within the service itself across different subcomponents and interfaces through which they interact with each other. These surfaces are the best places to exert backpressure. This can help minimize confusion and make the system's behavior more predictable. How to Apply Backpressure in Distributed Systems In the last section, we talked about how to find the right points of interest to assert backpressure. Once we know those points, here are some ways we can assert this backpressure in practice: Build Explicit Flow Control The idea is to make the queue size visible to your callers and let them control the call rate based on that. By knowing the queue size (or any resource that is a bottleneck), they can increase or decrease the call rate to avoid overwhelming the system. This kind of technique is particularly helpful where multiple internal components work together and behave well as much as they can without impacting each other. The equation below can be used anytime to calculate the caller rate. Note: The actual call rate will depend on various other factors, but the equation below should give a good idea. CallRate_new = CallRate_normal * (1 - (Q_currentSize / Q_maxSize)) Invert Responsibilities In some systems, it's possible to change the order where callers don't explicitly send requests to the service but let the service request work itself when it's ready to serve. This kind of technique gives the receiving service full control over how much it can do and can dynamically change the request size based on its latest state. You can employ a token bucket strategy where the receiving service fills the token, and that tells the caller when and how much they can send to the server. Here is a sample algorithm the caller can use: # Service requests work if it has capacity if Tokens_available > 0: Work_request_size = min (Tokens_available, Work_request_size _max) # Request work, up to a maximum limit send_request_to_caller(Work_request_size) # Caller sends work if it has enough tokens if Tokens_available >= Work_request_size: send_work_to_service(Work_request_size) Tokens_available = Tokens_available – Work_request_size # Tokens are replenished at a certain rate Tokens_available = min (Tokens_available + Token_Refresh_Rate, Token_Bucket_size) Proactive Adjustments Sometimes, you know in advance that your system is going to get overwhelmed soon, and you take proactive measures like asking the caller to slow down the call volume and slowly increase it. Think of a scenario where your downstream was down and rejected all your requests. During that period, you queued up all the work and are now ready to drain it to meet your SLA. When you drain it faster than the normal rate, you risk taking down the downstream services. To address this, you proactively limit the caller limits or engage the caller to reduce its call volume and slowly open the floodgates. Throttling Restrict the number of requests a service can serve and discard requests beyond that. Throttling can be applied at the service level or the API level. This throttling is a direct indicator of backpressure for the caller to slow down the call volume. You can take this further and do priority throttling or fairness throttling to ensure that the least impact is seen by the customers. Load Shedding Throttling points to discarding requests when you breach some predefined limits. Customer requests can still be discarded if the service faces stress and decides to proactively drop requests it has already promised to serve. This kind of action is typically the last resort for services to protect themselves and let the caller know about it. Conclusion Backpressure is a critical challenge in distributed systems that can significantly impact performance and stability. Understanding the causes and effects of backpressure, along with effective management techniques, is crucial for building robust and high-performance distributed systems. When implemented correctly, backpressure can enhance a system's stability, reliability, and scalability, leading to an improved user experience. However, if mishandled, it can erode customer trust and even contribute to system instability. Proactively addressing backpressure through careful system design and monitoring is key to maintaining system health. While implementing backpressure may involve trade-offs, such as potentially impacting throughput, the benefits in terms of overall system resilience and user satisfaction are substantial.

By Rajesh Pandey
Hoisting in JavaScript
Hoisting in JavaScript

In JavaScript, hoisting is when variable or function declarations are lifted to the top of their respective scope before the execution. This implies that a variable or a function can be used before defining it. Hoisting Variables With var The behavior of hoisting can be observed with an example of a variable declared with the help of var. Consider the following code example: JavaScript console.log(myVar); // Output: undefined var myVar = 10; console.log(myVar); // Output: 10 Slightly naive, one might think that the console.log(myVar) before the declaration will throw an error because myVar is not declared yet. However, due to hoisting, the declaration of myVar is moved to the top of the scope, and the preceding code will run without any error. It behaves as if the code was written like this: JavaScript var myVar; console.log (myVar); // Output: undefined myVar = 10; console.log (myVar); // Output: 10 Only the declaration var myVar is hoisted, the assignment myVar = 10 is not. Therefore, when the variable is logged the first time it reads as undefined because while the variable name has been declared, it has not been assigned a value. The Problem With var A problem associated with var is that the scope of its variable is not limited to the block it’s declared on. This can cause certain behaviors that may not be desired when using loops or any other control structure; for instance, if statements. JavaScript if (true) { var testVar = "I am inside the block"; } console.log (testVar); // Output: "I am inside the block" Even though testVar is declared inside the “if block”, it can be used outside. As noted above, var lacks a block-level scope but, rather possesses a function-level or a global one. Hoisting With let and const To overcome the problems that come with var, JavaScript included let and const in ES6. These keywords are also hoisted, but with a significant difference: actualizing is done when the variable is needed in the program, and only at the time their declaration statement is met in the program. Until then, the variables are in the Temporal Dead Zone (TDZ), and reading it will cause a Reference Error. Let’s look at an example: JavaScript console.log (myLet); // Reference Error: Cannot access 'myLet' before initialization let myLet = 5; In this case, let is moved to the top of the block; however, it is only declared, but not assigned to any value until the line let myLet = 5 is under execution. This means that while the variable is in scope — i.e., it has been declared — it cannot be used before it's assigned a value. The same behavior applies to const: JavaScript console.log (myConst); // Reference Error: Cannot access 'myConst' before initialization const myConst = 10; Here, the const variable is declared but not defined, so if the script tries to use it before it is defined, it will throw a Reference Error. Block Scoping With let and const If the scope of a variable should be limited to a block, let and const are recommended. JavaScript If (true) { let blockVar = "I am inside the block"; } console.log (blockVar); // Reference Error: blockVar is not defined In this case, you can only reference blockVar within the block it is initially declared. As soon as you try to access it outside the if block, it will result in a Reference Error. Function Hoisting Functions in JavaScript are also hoisted; however, there is a difference between function declarations and function expressions in terms of how they are hoisted. Function Declarations In the case of function declarations, both the name along with the body of the function are raised to the top level of the scope. As a result, you can call a function even before its declaration: JavaScript Greet(); // Output: "Hello, world!" Function Greet () { console.log ("Hello, world!") ; } In this example, the identifier for the Greet function is hoisted to the top, allowing access to its definition at runtime. Notably, the function call occurs before the actual definition in the code, yet it works due to hoisting. Function Expressions Function expressions are not fully hoisted, meaning you may encounter an error if you attempt to call a function expression before it is defined within its scope. Specifically, while the variable to which the function is assigned is hoisted, the actual function definition is not. JavaScript sayHello (); // TypeError: say Hello is not a function Var sayHello = function () { console.log("Hello!"); }; This code behaves as if it were written like this: JavaScript Var sayHello; sayHello (); // TypeError: say Hello is not a function sayHello = function () { console.log("Hello!"); }; In the example above, the declaration var sayHello gets hoisted to the top of the scope. However, when we try to execute the function assigned tosayHello, since it has not yet been assigned, it will result in a Type Error. Thus, you cannot call a function expression before its definition. Conclusion Hoisting is one of those interesting JavaScript concepts that can make your life both easier and harder as a JS programmer. Understanding how hoisting works with var, let, const, and function declarations can help you avoid tricky bugs in JavaScript code.

By Ashim Upadhaya
DZone Annual Community Survey: What's in Your 2024 Tech Stack?
DZone Annual Community Survey: What's in Your 2024 Tech Stack?

Are you a software developer or other tech professional? If you’re reading this, chances are pretty good that the answer is "yes." Long story short — we want DZone to work for you! We're asking that you take our annual community survey so we can better serve you! ^^ You can also enter the drawing for a chance to receive an exclusive DZone Swag Pack! The software development world moves fast, and we want to keep up! Across our community, we found that readers come to DZone for various reasons, including to learn about new development trends and technologies, find answers to help solve problems they have, connect with other peers, publish their content, and expand their personal brand's audience. In order to continue helping the DZone Community reach goals such as these, we need to know more about you, your learning preferences, and your overall experience on dzone.com and with the DZone team. For this year's DZone Community research, our primary goals are to: Learn about developer tech preferences and habits Identify content types and topics that developers want to get more information on Share this data for public consumption! To support our Community research, we're focusing on several primary areas in the survey: You, including your experience, the types of software you work on, and the tools you use How you prefer to learn and what you want to learn more about on dzone.com The ways in which you engage with DZone, your content likes vs. dislikes, and your overall journey on dzone.com As a community-driven site, our relationships with our members and contributors is invaluable, and we want to make sure that we continue to serve our audience to the best of our ability. If you're curious to see the report from the 2023 Community survey, feel free to check it out here! Thank you in advance for your participation!—Your favorite DZone Content and Community team

By Dominique Roller
Open Source: A Pathway To Personal and Professional Growth
Open Source: A Pathway To Personal and Professional Growth

Open source can go beyond philanthropy: it’s a gateway to exponential learning, expanding your professional network, and propelling your software engineering career to the next level. In this article, I’ll explain why contributing to open-source projects is an excellent investment and share how to begin making your mark in the community. Why Invest Time in Open Source? Great, you’re still here! That means you’re curious about the open-source world and how it can shape your future. Before diving into how to contribute, let’s discuss why it’s worth your time, especially since many of us begin contributing during our time. Open source isn’t just a philosophy or a community-driven mindset; it’s much more than that. It’s a vibrant, advanced software industry where powerful companies and brilliant minds converge to build, innovate, and drive progress. Open Source: A Modern Pillar of Software Engineering Open source often carries the misconception of being a volunteer-driven side hustle, but that’s far from the truth. It’s a critical element of the global software industry, embraced by tech giants and startups alike. Microsoft, once an open-source skeptic, is now a staunch advocate. IBM’s acquisition of Red Hat, the largest open-source company, for 34 billion dollars highlights the industry’s power and value. While the feel-good factor of helping others is undoubtedly there, open-source is also a sophisticated, high-demand industry. Many of today’s best practices — code reviews, automated testing, software documentation, and issue tracking — trace their origins back to the open-source world. Major organizations like Microsoft, PayPal, and Adobe have adopted inner-source practices, which essentially bring open-source methodologies inside their organizations. Some of the most significant software advancements, like databases (the most popular ones are open-source) and infrastructure tools like Kubernetes, have their roots in the open-source community. Open source connects people globally through shared methodologies, cutting-edge techniques, and a mission to build better software. Open-source components are woven into the very fabric of modern software development — making it hard to imagine the tech world without them. Six Reasons To Contribute To Open Source If you’re still wondering whether it’s worth the effort, let’s explore six compelling reasons why participating in open source can boost your career and broaden your horizons: 1. Learn From the Best By diving into open-source projects, you gain access to some of the most skilled engineers in the world. Experts from companies like IBM, Google, Red Hat, and more will review your code. It’s an incredible opportunity to learn directly from leaders in the tech industry. 2. Expand Your Experience Contributing to open source provides unique experiences, allowing you to collaborate on global, distributed projects that impact the world. Whether you’re an entry-level developer seeking growth or a senior engineer honing your skills, open source offers unparalleled learning opportunities. 3. Grow Your Network Working on open-source projects connects you with professionals from diverse backgrounds and organizations. These connections can lead to new job opportunities, collaborative ventures, or even the creation of your own company. 4. Boost Communication Skills Open-source work requires more than just coding — it demands effective communication. Engaging with the community, defending proposals, and leading discussions help refine your soft skills. It is especially relevant if you’re aiming for leadership roles like Staff Engineer or Principal Engineer, where influence and communication are key. 5. Improve Language Skills Open-source projects provide non-native English speakers an excellent opportunity to practice and improve their English skills. Moreover, contributing internationally exposes you to other languages, helping you bridge communication gaps and break the ice in global interactions. Personally, open-source has allowed me to improve my English, French, Italian, and Spanish. 6. Stand Out Professionally The best job offers often come not from searching but from being sought after. Contributing to open-source makes you a part of a minor, elite group of engineers. Out of the millions of Java developers, how many are core contributors to the Java platform itself? That number is minimal, giving you an edge in the industry. In summary, contributing to open source enhances your influence as a software engineer, gives you access to unique opportunities, and helps you realize that code is just a part of the bigger picture. How To Start Contributing Contributing to open source takes time, especially if you aim to become a committer. It takes discipline, patience, and a willingness to learn constantly. But the good news is, it’s achievable. Here are some steps to help you get started: 1. Choose a Project You’re Passionate About The first step is finding a project that excites you, whether it’s something you use at work, want to learn more about or enjoy. Open-source contributions require long-term commitment, so pick a project you won’t mind spending time on regularly. 2. Introduce Yourself Once you’ve chosen a project, join the community through mailing lists, Slack, Discord, or other platforms. Introduce yourself and express your interest in helping. 3. Observe Before diving in, take the time to understand the project’s workflow. Watch how PRs are handled, read through the comments, and familiarize yourself with the code style and community dynamics. 4. Read the Documentation The documentation provides a window into the minds of the engineers who built the project. Reading it will help you understand the project profoundly and inspire you to contribute by improving the docs, especially if you notice areas that need clarification. 5. Be a Steward, Not Just a Contributor Adding new features is exciting, but maintaining and improving existing code is just as important. Embrace your role as a project steward and focus on reducing complexity rather than adding unnecessary functionality. 6. Take on the Unloved Tasks Every project has tasks nobody wants to do, such as updating documentation, adding tests, or cleaning up old code. These contributions are invaluable and great for getting your foot in the door. 7. Contribute Beyond Code Not all contributions are code-related. You can help with tutorials, articles, workshops, or even handle social media. Open-source is more than just writing code — it’s about building a community. Recommended Projects To Start With If you’re unsure where to start, consider contributing to one of these projects: Jakarta EE MicroProfile Jakarta Data Jakarta NoSQL MicroStream These are just a few projects I’m personally involved with, and I’d be happy to guide you along the way. If you have any questions, don’t hesitate to reach out! Conclusion Open source is a game-changer — not only in terms of technology but also in the opportunities it creates. It has transformed my life, allowing me to travel the world, meet incredible people, and build long-lasting friendships. The open-source community has become like family, from RV trips across the US to parachuting adventures and museum visits. Open source can do the same for you. It’s more than just code; it’s about building connections, mastering new skills, and making an impact far beyond your desk. If you’re nearby or at any open-source event, let me know! I’d love to meet up and share experiences.

By Otavio Santana DZone Core CORE
Mastering the Art of Data Engineering to Support Billion-Dollar Tech Ecosystems
Mastering the Art of Data Engineering to Support Billion-Dollar Tech Ecosystems

Data reigns supreme as the currency of innovation, and it is a valuable one at that. In the multifaceted world of technology, mastering the art of data engineering has become crucial for supporting billion-dollar tech ecosystems. This sophisticated craft involves creating and maintaining data infrastructures capable of handling vast amounts of information with high reliability and efficiency. As companies push the boundaries of innovation, the role of data engineers has never been more critical. Specialists design systems that certify seamless data flow, optimize performance, and provide the backbone for applications and services that millions of people use. The tech ecosystem’s health lies in the capable hands of those who develop it for a living. Its growth— or collapse — all depends on how proficient one is at wielding the art of data engineering. The Backbone of Modern Technology Data engineering often plays the role of an unsung hero behind modern technology's seamless functionality. It involves a meticulous process of designing, constructing, and maintaining scalable data systems that can efficiently handle data's massive inflow and outflow. These systems form the backbone of tech giants, enabling them to provide uninterrupted services to their users. Data engineering makes certain that everything runs smoothly. This encompasses aspects from e-commerce platforms processing millions of transactions per day, social media networks handling real-time updates, or navigation services providing live traffic updates. Building Resilient Infrastructures One of the primary challenges in data engineering is building resilient infrastructures that can withstand failures and protect data integrity. High availability environments are essential, as even minor downtimes can lead to significant disruptions and financial losses. Data engineers employ data replication, redundancy, and disaster recovery planning techniques to create robust systems. For instance, by implementing Massive Parallel Processing (MPP) architecture databases like IBM Netezza and AWS (Amazon Web Services), Redshift has redefined how companies handle large-scale data operations, providing high-speed processing and reliability. Leveraging Massive Parallel Processing (MPP) Databases MPP databases are a group of servers working together as one entity. The first critical component of the MPP database is how data is stored across all nodes in the cluster. A data set is split across many segments and distributed across nodes based on the table's distribution key. While it may be intuitive to split data equally on all nodes to leverage all the resources in response to user queries, there is more to it than just storing for performance — such as data skew and process skew. Data skew occurs when data is unevenly distributed across the nodes. This means that the node carrying more data has more work than the node having less data for the same user request. The slowest node in the cluster always determines the cumulative response time of the cluster. Process skew also entails unevenly distributed data across the nodes. The difference in this situation can be found in the user's interest in data that is only stored in a few nodes. Consequently, only those specific nodes work in response to the use of query, whereas other nodes are idle (i.e., underutilization of cluster resources). A delicate balance must be achieved between how data is stored and accessed, preventing data skew and process skew. The balance between data stored and accessed can be achieved by understanding the data access patterns. Data must be shared using the same unique key across tables, which will be used chiefly for joining data between tables. The unique key will ensure even data distribution and that the tables often joined on the same unique key end up storing the data on the same nodes. This arrangement of data will lead to a much faster local data join (co-located join) than the need to move data across nodes to join to create a final dataset. Another performance enhancer is sorting the data during the loading process. Unlike traditional databases, MPP databases do not have an index. Instead, they eliminate unnecessary data block scans based on how the keys are sorted. Data must be loaded by defining the sort key, and user queries must use this sort key to avoid unnecessary scanning of data blocks. Driving Innovation With Advanced Technologies The field of data engineering never remains the same, with new technologies and methodologies emerging daily to address growing data demands. In recent years, adopting hybrid cloud solutions has become a power move. Companies can achieve greater flexibility, scalability, and cost efficiency by taking advantage of cloud services such as AWS, Azure, and GCP. Data engineers play a crucial role in evaluating these cloud offerings, determining their suitability for specific requirements, and implementing them to fine-tune performance. Moreover, automation and artificial intelligence (AI) are transforming data engineering, making processes more efficient by reducing human intervention. Data engineers are increasingly developing self-healing systems that detect issues and automatically take corrective actions. This proactive outlook decreases downtime and boosts the overall reliability of data infrastructures. Additionally, exhaustive telemetry monitors systems in real-time, enabling early detection of potential problems and the generation of swift resolutions. Navigating the Digital Tomorrows: The Internet of Things and the World of People As data volumes continue to grow tenfold, the future of data engineering promises even more upgrades and challenges. Emerging technologies such as quantum computing and edge computing are poised to modify the field, offering unprecedented processing power and efficiency. Data engineers must be able to see these trends coming from a mile away. As the industry moves into the future at record speed, the ingenuity of data engineers will remain a key point of the digital age, powering the applications that define both the Internet of Things and the world of people.

By Ashish Karalkar
Server-Side Rendering With Spring Boot
Server-Side Rendering With Spring Boot

Understanding the shared steps in the project setup is crucial before delving into the specifics of each client-augmenting technology. My requirements from the last post were quite straightforward: I'll assume the viewpoint of a backend developer. No front-end build step: no TypeScript, no minification, etc. All dependencies are managed from the backend app, i.e., Maven It's important to note that the technology I'll be detailing, except Vaadin, follows a similar approach. Vaadin, with its unique paradigm, really stands out among the approaches. WebJars WebJars is a technology designed in 2012 by James Ward to handle these exact requirements. WebJars are client-side web libraries (e.g., jQuery and Bootstrap) packaged into JAR (Java Archive) files. Explicitly and easily manage the client-side dependencies in JVM-based web applications Use JVM-based build tools (e.g. Maven, Gradle, sbt, etc.) to download your client-side dependencies Know which client-side dependencies you are using Transitive dependencies are automatically resolved and optionally loaded via RequireJS Deployed on Maven Central Public CDN, generously provided by JSDelivr - WebJars website A WebJar is a regular JAR containing web assets. Adding a WebJar to a project's dependencies is nothing specific: XML <dependencies> <dependency> <groupId>org.webjars.npm</groupId> <artifactId>alpinejs</artifactId> <version>3.14.1</version> </dependency> </dependencies> The framework's responsibility is to expose the assets under a URL. For example, Spring Boot does it in the WebMvcAutoConfiguration class: Java public void addResourceHandlers(ResourceHandlerRegistry registry) { if (!this.resourceProperties.isAddMappings()) { logger.debug("Default resource handling disabled"); return; } addResourceHandler(registry, this.mvcProperties.getWebjarsPathPattern(), //1 "classpath:/META-INF/resources/webjars/"); addResourceHandler(registry, this.mvcProperties.getStaticPathPattern(), (registration) -> { registration.addResourceLocations(this.resourceProperties.getStaticLocations()); if (this.servletContext != null) { ServletContextResource resource = new ServletContextResource(this.servletContext, SERVLET_LOCATION); registration.addResourceLocations(resource); } }); } The default is "/webjars/**" Inside the JAR, you can reach assets by their respective path and name. The agreed-upon structure is to store the assets inside resources/webjars//. Here's the structure of the alpinejs-3.14.1.jar: Plain Text META-INF |_ MANIFEST.MF |_ maven.org.webjars.npm.alpinejs |_ resources.webjars.alpinejs.3.14.1 |_ builds |_ dist |_ cdn.js |_ cdn.min.js |_ src |_ package.json Within Spring Boot, you can access the non-minified version with /webjars/alpinejs/3.14.1/dist/cdn.js. Developers release client-side libraries quite often. When you change a dependency version in the POM, you must change the front-end path, possibly in multiple locations. It's boring, has no added value, and you risk missing a change. The WebJars Locator project aims to avoid all these issues by providing a path with no version, i.e., /webjars/alpinejs/dist/cdn.js. You can achieve this by adding the webjars-locator JAR to your dependencies: XML <dependencies> <dependency> <groupId>org.webjars.npm</groupId> <artifactId>alpinejs</artifactId> <version>3.14.1</version> </dependency> <dependency> <groupId>org.webjars</groupId> <artifactId>webjars-locator</artifactId> <version>0.52</version> </dependency> </dependencies> I'll use this approach for every front-end technology. I'll also add the Bootstrap CSS library to provide a better-looking user interface. Thymeleaf Thymeleaf is a server-side rendering technology. Thymeleaf is a modern server-side Java template engine for both web and standalone environments. Thymeleaf's main goal is to bring elegant natural templates to your development workflow — HTML that can be correctly displayed in browsers and also work as static prototypes, allowing for stronger collaboration in development teams. With modules for Spring Framework, a host of integrations with your favourite tools, and the ability to plug in your own functionality, Thymeleaf is ideal for modern-day HTML5 JVM web development — although there is much more it can do. - Thymeleaf I was still a consultant when I first learned about Thymeleaf. At the time, JSP was at the end of their life. JSF were trying to replace them; IMHO, they failed. I thought Thymeleaf was a fantastic approach: it allows you to see the results in a static environment at design time and in a server environment at development time. Even better, you can seamlessly move between one and the other using the same file. I've never seen this capability used. However, Spring Boot fully supports Thymeleaf. The icing on the cake: the latter is available via an HTML namespace on the page. If you didn't buy into JSF (spoiler: I didn't), Thymeleaf is today's go-to SSR templating language. Here's the demo sample from the website: HTML <table> <thead> <tr> <th th:text="#{msgs.headers.name}">Name</th> <th th:text="#{msgs.headers.price}">Price</th> </tr> </thead> <tbody> <tr th:each="prod: ${allProducts}"> <td th:text="${prod.name}">Oranges</td> <td th:text="${#numbers.formatDecimal(prod.price, 1, 2)}">0.99</td> </tr> </tbody> </table> Here is a Thymeleaf 101, in case you need to familiarise yourself with the technology. When you open the HTML file, the browser displays the regular value inside the tags, i.e., Name and Price. When you use it in the server, Thymeleaf kicks in and renders the value computed from th:text, #{msgs.headers.name} and #{msgs.headers.price}. The $ operator queries for a Spring bean of the same name passed to the model. ${prod.name} is equivalent to model.getBean("prod").getName()". The # calls a function. th:each allows for loops. Thymeleaf Integration With the Front-End Framework Most, if not all, front-end frameworks work with a client-side model. We need to bridge between the server-side model and the client-side one. The server-side code I'm using is the following: Kotlin data class Todo(val id: Int, var label: String, var completed: Boolean = false) //1 fun config() = beans { bean { mutableListOf( //2 Todo(1, "Go to the groceries", false), Todo(2, "Walk the dog", false), Todo(3, "Take out the trash", false) ) } bean { router { GET("/") { ok().render( //3 "index", //4 mapOf("title" to "My Title", "todos" to ref<List<Todo>>()) //5 ) } } } } Define the Todo class. Add an in-memory list to the bean factory. In a regular app, you'd use a Repository to read from the database Render an HTML template. The template is src/main/resources/templates/index.html with Thymeleaf attributes. Put the model in the page's context. Thymeleaf offers a th:inline="javascript" attribute on the `` tag. It renders the server-side data as JavaScript variables. The documentation explains it much better than I ever could: The first thing we can do with script inlining is writing the value of expressions into our scripts, like: /* ... var username = /*[[${session.user.name}]]*/ 'Sebastian'; ... /*]]>*/ The /*[[...]]*/ syntax, instructs Thymeleaf to evaluate the contained expression. But there are more implications here: Being a javascript comment (/*...*/), our expression will be ignored when displaying the page statically in a browser. The code after the inline expression ('Sebastian') will be executed when displaying the page statically. Thymeleaf will execute the expression and insert the result, but it will also remove all the code in the line after the inline expression itself (the part that is executed when displayed statically). - Thymeleaf documentation If we apply the above to our code, we can get the model attributes passed by Spring as: HTML <script th:inline="javascript"> /*<![CDATA[*/ window.title = /*[[${title}]]*/ 'A Title' window.todos = /*[[${todos}]]*/ [{ 'id': 1, 'label': 'Take out the trash', 'completed': false }] /*]]>*/ </script> When rendered server-side, the result is: HTML <script> /*<![CDATA[*/ window.title = "My title"; window.todos: [{"id":1,"label":"Go to the groceries","completed":false},{"id":2,"label":"Walk the dog","completed":false},{"id":3,"label":"Take out the trash","completed":false}] /*]]>*/ </script> Summary In this post, I've described two components I'll be using throughout the rest of this series: WebJars manage client-side dependencies in your Maven POM. Thymeleaf is a templating mechanism that integrates well with Spring Boot. The complete source code for this post can be found on GitHub. Go Further WebJars Instructions for Spring Boot

By Nicolas Fränkel DZone Core CORE
Redefining Artifact Storage: Preparing for Tomorrow's Binary Management Needs
Redefining Artifact Storage: Preparing for Tomorrow's Binary Management Needs

As software pipelines evolve, so do the demands on binary and artifact storage systems. While solutions like Nexus, JFrog Artifactory, and other package managers have served well, they are increasingly showing limitations in scalability, security, flexibility, and vendor lock-in. Enterprises must future-proof their infrastructure with a vendor-neutral solution that includes an abstraction layer, preventing dependency on any one provider and enabling agile innovation. The Current Landscape: Artifact and Package Manager Solutions There are several leading artifact and package management systems today, each with its own strengths and limitations. Let’s explore the key players: JFrog Artifactory A popular choice for managing binaries, JFrog integrates with many DevOps tools and supports a variety of package formats. However, the vendor lock-in issue with JFrog’s ecosystem can restrict enterprises from adopting new technologies or integrating alternative solutions without high migration costs. Sonatype Nexus Repository Another well-known artifact manager, Nexus is strong in managing open-source components and has a wide range of package format support. Its limitations include complex configurations and scalability challenges in handling extremely large datasets or AI-driven workloads. AWS CodeArtifact Amazon’s cloud-native artifact management solution is convenient for AWS users and offers seamless integration with other AWS services. However, it lacks the cross-cloud portability that enterprises require, effectively locking users into the AWS ecosystem. Azure Artifacts Similarly to AWS CodeArtifact, Azure Artifacts integrates well with Microsoft’s development tools and cloud services but lacks multi-cloud flexibility and comes with the risk of vendor lock-in for those not heavily invested in the Azure ecosystem. GitHub Packages GitHub’s artifact management feature is integrated with its CI/CD pipelines, offering a straightforward solution for small to mid-size projects. However, it’s limited in scope, lacks scalability, and is not built for enterprise-grade artifact management on a large scale. Google Artifact Registry Google's offering provides artifact management across multiple cloud platforms and regions, but as with AWS and Azure, it is tightly coupled to Google's ecosystem, limiting cross-cloud flexibility. Key Limitations Across Current Solutions Each of these systems has its place in the development ecosystem, but they come with inherent limitations: Scalability: As artifact sizes grow, many current systems face challenges in handling the increased data load, especially when dealing with machine learning models or containerized environments. Vendor lock-in: Most of these solutions are tightly coupled with their respective cloud or infrastructure ecosystems, limiting an enterprise's ability to migrate or adopt newer technologies across different environments without significant cost and disruption. Complexity: Some systems, such as Nexus, are challenging to configure and maintain, especially for organizations looking for simplicity and agility in their artifact management. Cross-platform integration: Many artifact management solutions are optimized for specific toolchains (e.g., GitHub, AWS, Azure), which can hinder flexibility and force teams to adopt vendor-specific workflows that may not be ideal. Next-Generation Solutions: The Future of Vendor-Neutral Artifact Storage To overcome these limitations, next-generation artifact management solutions must not only offer scalability, resiliency, toolchain integration, and automation but also be vendor-neutral and future-proof. An abstraction layer that decouples enterprises from any one vendor is essential to ensuring flexibility and adaptability. 1. Vendor-Neutral, Hyper-Scalable Platforms Next-gen solutions must scale horizontally across cloud providers and on-prem environments, allowing enterprises to manage binary growth without being tied to a single vendor’s infrastructure. An abstraction layer will give enterprises the flexibility to switch between clouds (e.g., AWS, Azure, Google Cloud) or combine them, avoiding lock-in while ensuring smooth operations. 2. Built-In Resiliency Across Clouds Future systems should automatically replicate data across clouds and regions, ensuring redundancy and availability no matter where the infrastructure resides. The resiliency of these platforms should be built independently of any single vendor to avoid dependency. 3. Seamless Integration With Modern Toolchains Next-generation solutions should integrate easily with any DevOps pipeline, CI/CD tool, or container orchestration platform, such as Jenkins, Kubernetes, and GitHub Actions, without forcing teams to adhere to vendor-specific configurations. Enterprises should be able to move artifacts between clouds and platforms without reconfiguring their entire toolchain. 4. Intelligence and Automation These systems must leverage AI to automate artifact lifecycle management, predicting storage needs and optimizing performance. Automated policies for archiving, cleanup, and resource management should be flexible and customizable without requiring specialized vendor-specific tools or contracts. 5. SBOM (Software Bill of Materials) and Security Integration Security is paramount, and SBOM will play a crucial role in ensuring transparency and compliance in software supply chains. A next-gen solution must offer native SBOM support without being limited by vendor ecosystems. By using a unified SBOM framework across different platforms, enterprises can ensure security without being locked into proprietary tools. 6. Binary Variability Management Handling binary variability is key as artifact versions proliferate. A next-gen system should offer version control and traceability across multiple environments and toolchains, ensuring that enterprises can easily switch between different versions or rollback to previous configurations. Vendor-neutral platforms will allow for this flexibility without locking enterprises into a specific solution. Outpacing Competitors: The Case for Vendor-Neutral Solutions While current platforms like Nexus, Artifactory, and cloud-native offerings each have their strengths, they all suffer from a common issue: vendor lock-in. Enterprises that rely on these platforms often find themselves constrained by limited integration options, high switching costs, and a lack of flexibility. By adopting a vendor-neutral solution with an abstraction layer, enterprises can avoid these pitfalls. This layer decouples binary management from the underlying infrastructure, giving organizations the freedom to innovate, scale, and shift between platforms as needed — without fear of vendor lock-in choking their capability to adapt to future technologies. Conclusion: The Future of Enterprise Artifact Storage As the software landscape continues to evolve, so too must our approach to binary and artifact storage. The next generation of artifact management systems must be scalable, secure, resilient, and most importantly, vendor-neutral. By incorporating SBOM, managing binary variability, and offering an abstraction layer that enables flexibility, these solutions will empower enterprises to stay agile and innovative in a rapidly changing world. In a future where vendor lock-in could stifle enterprise growth, adopting a neutral, flexible solution is the key to long-term success. References JFrog Artifactory Documentation, Official documentation for JFrog Artifactory: It outlines the platform's capabilities, including package management, scalability, and integrations. Sonatype Nexus Repository:Documentation on Nexus Repository, covering supported formats, scalability, and integrations AWS CodeArtifact: Overview of AWS CodeArtifact, detailing the platform’s cloud-native artifact management, integrations, and vendor lock-in limitations Azure Artifacts Documentation: Microsoft's Azure Artifacts platform documentation, focusing on CI/CD integration, supported formats, and cross-cloud limitations Google Artifact Registry: Information on Google’s Artifact Registry, its cloud-native management, and the challenges of vendor dependency SBOM (Software Bill of Materials): Overview of SBOM and its importance for transparency and security in the software supply chain What is Vendor Lock-in? Tips to avoid it: An article exploring the risks of vendor lock-in and how it affects enterprise flexibility and innovation

By Vishal Raina
Refining Your JavaScript Code: 10 Critical Mistakes to Sidestep
Refining Your JavaScript Code: 10 Critical Mistakes to Sidestep

JavaScript, the backbone of modern web development, is a powerful and versatile language. JavaScript's flexibility and dynamic nature make it both a blessing and a curse for new developers. While it allows for rapid development and creativity, it also has quirks that can trip up the uninitiated. By familiarizing yourself with these common mistakes, you'll be better equipped to write clean, efficient, and bug-free code. Mistake 1: Not Declaring Variables Properly The Problem One of the most common mistakes beginners make is not properly declaring variables. JavaScript allows you to declare variables using var, let, or const. Failing to declare a variable properly can lead to unexpected behavior and hard-to-track bugs. Example JavaScript function myFunction() { a = 10; // Variable 'a' is not declared console.log(a); } myFunction(); console.log(a); // 'a' is now a global variable Explanation In the example above, the variable a is not declared using var, let, or const. As a result, it becomes a global variable, which can lead to conflicts and unintended side effects in your code. Solution Always declare your variables explicitly. Use let and const to ensure proper scoping. JavaScript function myFunction() { let a = 10; // Variable 'a' is properly declared console.log(a); } myFunction(); console.log(a); // ReferenceError: 'a' is not defined Mistake 2: Confusing == and === The Problem JavaScript has two types of equality operators: == (loose equality) and === (strict equality). Beginners often use == without understanding its implications, leading to unexpected type coercion. Example JavaScript console.log(5 == '5'); // true console.log(5 === '5'); // false Explanation The == operator performs type coercion, converting the operands to the same type before comparison. This can lead to misleading results. The === operator, on the other hand, does not perform type coercion and compares both the value and the type. Solution Use === to avoid unexpected type coercion and ensure more predictable comparisons. JavaScript console.log(5 === '5'); // false console.log(5 === 5); // true Mistake 3: Misunderstanding Asynchronous Code The Problem JavaScript is single-threaded but can handle asynchronous operations through callbacks, promises, and async/await. Beginners often misunderstand how asynchronous code works, leading to issues like callback hell or unhandled promises. Example JavaScript setTimeout(function() { console.log('First'); }, 1000); console.log('Second'); Explanation In the example, setTimeout is asynchronous and will execute after the synchronous code. Beginners might expect "First" to be logged before "Second," but the output will be "Second" followed by "First." Solution Understand and use promises and async/await to handle asynchronous operations more effectively. JavaScript function myAsyncFunction() { return new Promise((resolve) => { setTimeout(() => { resolve('First'); }, 1000); }); } async function execute() { const result = await myAsyncFunction(); console.log(result); console.log('Second'); } execute(); // Output: "First" "Second" Mistake 4: Not Understanding this The Problem The this keyword in JavaScript behaves differently compared to other languages. Beginners often misuse this, leading to unexpected results, especially in event handlers and callbacks. Example JavaScript const obj = { value: 42, getValue: function() { return this.value; } }; const getValue = obj.getValue; console.log(getValue()); // undefined Explanation In the example, getValue is called without its object context, so this does not refer to obj but to the global object (or undefined in strict mode). Solution Use arrow functions or bind the function to the correct context. JavaScript const obj = { value: 42, getValue: function() { return this.value; } }; const getValue = obj.getValue.bind(obj); console.log(getValue()); // 42 // Alternatively, using arrow function const obj2 = { value: 42, getValue: function() { return this.value; } }; const getValue = () => obj2.getValue(); console.log(getValue()); // 42 Mistake 5: Ignoring Browser Compatibility The Problem JavaScript behaves differently across various browsers. Beginners often write code that works in one browser but fails in others, leading to compatibility issues. Example JavaScript let elements = document.querySelectorAll('.my-class'); elements.forEach(function(element) { console.log(element); }); Explanation The NodeList returned by querySelectorAll has a forEach method in modern browsers but may not in older ones like Internet Explorer. Solution Use feature detection or polyfills to ensure compatibility across different browsers. JavaScript let elements = document.querySelectorAll('.my-class'); if (elements.forEach) { elements.forEach(function(element) { console.log(element); }); } else { for (let i = 0; i < elements.length; i++) { console.log(elements[i]); } } Mistake 6: Failing To Use let or const in Loops The Problem Beginners often use var in loops, leading to unexpected behavior due to variable hoisting and function scope issues. Example JavaScript for (var i = 0; i < 3; i++) { setTimeout(function() { console.log(i); }, 1000); } // Output: 3, 3, 3 Explanation Using var in the loop causes the variable i to be hoisted and shared across all iterations. When the setTimeout callbacks are executed, they all reference the final value of i. var has function scope, whereas let and const have block scope, making them more predictable in loop iterations. Solution Use let instead of var to create a block-scoped variable for each iteration. JavaScript for (let i = 0; i < 3; i++) { setTimeout(function() { console.log(i); }, 1000); } // Output: 0, 1, 2 Mistake 7: Not Handling Errors in Promises The Problem When working with promises, beginners often forget to handle errors, leading to unhandled rejections that can crash applications or cause silent failures. Example JavaScript fetch('https://api.example.com/data') .then(response => response.json()) .then(data => console.log(data)); Explanation If the fetch request fails or the response isn't valid JSON, the promise will reject, and without a .catch block, the error won't be handled. Solution Always add a .catch block to handle errors in promises. JavaScript fetch('https://api.example.com/data') .then(response => response.json()) .then(data => console.log(data)) .catch(error => console.error('Error:', error)); Mistake 8: Overusing Global Variables The Problem Beginners often rely too heavily on global variables, leading to code that is difficult to debug, maintain, and scale. Example JavaScript var counter = 0; function increment() { counter++; } function reset() { counter = 0; } Explanation Using global variables like counter increases the risk of conflicts and makes it hard to track where and how the variable is being modified. Solution Encapsulate variables within functions or modules to limit their scope. JavaScript function createCounter() { let counter = 0; return { increment: function() { counter++; return counter; }, reset: function() { counter = 0; }, }; } const counter = createCounter(); console.log(counter.increment()); // 1 counter.reset(); Mistake 9: Misusing Array Methods The Problem Beginners often misuse array methods like map, filter, and reduce, leading to inefficient or incorrect code. Example JavaScript let numbers = [1, 2, 3, 4]; numbers.map(num => num * 2); // [2, 4, 6, 8] numbers.filter(num => num % 2 === 0); // [2, 4] numbers.reduce((sum, num) => sum + num, 0); // 10 Explanation While the code above is correct, beginners might misuse these methods by not understanding their purpose. For example, using map when no transformation is needed or using reduce where filter would be more appropriate. Solution Understand the purpose of each array method and use them appropriately. Use map for transformation. Use filter for selecting items. Use reduce for aggregating values. Mistake 10: Forgetting to Return in Arrow Functions The Problem Beginners often forget that arrow functions with curly braces {} require an explicit return statement, leading to unexpected undefined results. Example JavaScript const double = (x) => { x * 2 }; console.log(double(4)); // undefined Explanation The arrow function above does not return anything because the return statement is missing inside the curly braces. Solution Either add a return statement or remove the curly braces to use an implicit return. JavaScript const double = (x) => x * 2; console.log(double(4)); // 8 // Or with explicit return const doubleExplicit = (x) => { return x * 2 }; console.log(doubleExplicit(4)); // 8 FAQs Why Is Declaring Variables Important? Properly declaring variables prevents them from becoming global and causing unintended side effects. It ensures your code is more predictable and easier to debug. What’s the Difference Between == and ===? The == operator performs type coercion, converting operands to the same type before comparison, which can lead to unexpected results. The === operator compares both value and type, providing more predictable comparisons. How Can I Avoid Callback Hell? You can avoid callback hell by using promises and async/await to handle asynchronous operations more cleanly and manageably. How Do I Properly Use this? Understanding the context in which this is used is crucial. Use bind or arrow functions to ensure this refers to the correct object context. Why Should I Care About Browser Compatibility? Ensuring your code works across different browsers prevents bugs and provides a consistent user experience. Use feature detection and polyfills to handle compatibility issues. Conclusion Avoiding these common JavaScript mistakes will help you write cleaner, more efficient code and save you from frustrating debugging sessions. Remember to declare variables properly, use strict equality checks, handle asynchronous code correctly, understand the this keyword, and ensure browser compatibility. By mastering these aspects, you'll be well on your way to becoming a proficient JavaScript developer.

By Raju Dandigam

Culture and Methodologies

Agile

Agile

Career Development

Career Development

Methodologies

Methodologies

Team Management

Team Management

Creating MVPs on a Budget: A Guide for Tech Startups

September 27, 2024 by Nathan Smith

Backpressure in Distributed Systems

September 26, 2024 by Rajesh Pandey

Operational Testing Tutorial: Comprehensive Guide With Best Practices

May 16, 2023 by Harshit Paul

Data Engineering

AI/ML

AI/ML

Big Data

Big Data

Databases

Databases

IoT

IoT

Leveraging IBM WatsonX Data With Milvus To Build an Intelligent Slack Bot for Knowledge Retrieval

September 27, 2024 by Pradeep Gopalgowda

Indexes Under The Hood

September 27, 2024 by Adam Furmanek

Explainable AI: Making the Black Box Transparent

May 16, 2023 by Yifei Wang DZone Core CORE

Software Design and Architecture

Cloud Architecture

Cloud Architecture

Integration

Integration

Microservices

Microservices

Performance

Performance

Leveraging IBM WatsonX Data With Milvus To Build an Intelligent Slack Bot for Knowledge Retrieval

September 27, 2024 by Pradeep Gopalgowda

Hands-On: Assigning Pods to Nodes Using Affinity Rules

September 27, 2024 by Rafael Natali

Low Code vs. Traditional Development: A Comprehensive Comparison

May 16, 2023 by Tien Nguyen

Coding

Frameworks

Frameworks

Java

Java

JavaScript

JavaScript

Languages

Languages

Tools

Tools

Indexes Under The Hood

September 27, 2024 by Adam Furmanek

Hands-On: Assigning Pods to Nodes Using Affinity Rules

September 27, 2024 by Rafael Natali

Scaling Event-Driven Applications Made Easy With Sveltos Cross-Cluster Configuration

May 15, 2023 by Gianluca Mardente

Testing, Deployment, and Maintenance

Deployment

Deployment

DevOps and CI/CD

DevOps and CI/CD

Maintenance

Maintenance

Monitoring and Observability

Monitoring and Observability

Leveraging IBM WatsonX Data With Milvus To Build an Intelligent Slack Bot for Knowledge Retrieval

September 27, 2024 by Pradeep Gopalgowda

Hands-On: Assigning Pods to Nodes Using Affinity Rules

September 27, 2024 by Rafael Natali

Low Code vs. Traditional Development: A Comprehensive Comparison

May 16, 2023 by Tien Nguyen

Popular

AI/ML

AI/ML

Java

Java

JavaScript

JavaScript

Open Source

Open Source

Leveraging IBM WatsonX Data With Milvus To Build an Intelligent Slack Bot for Knowledge Retrieval

September 27, 2024 by Pradeep Gopalgowda

Is Spring AI Strong Enough for AI?

September 27, 2024 by Reza Ganji DZone Core CORE

Five IntelliJ Idea Plugins That Will Change the Way You Code

May 15, 2023 by Toxic Dev

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: