Modernizing APIs: Share your thoughts on GraphQL, AI, microservices, automation, and more for our April report (+ enter a raffle for $250!).
DZone Research Report: A look at our developer audience, their tech stacks, and topics and tools they're exploring.
What Is Platform Engineering?
Exploring the New Eclipse JNoSQL Version 1.1.0: A Dive Into Oracle NoSQL
Enterprise Security
This year has observed a rise in the sophistication and nuance of approaches to security that far surpass the years prior, with software supply chains being at the top of that list. Each year, DZone investigates the state of application security, and our global developer community is seeing both more automation and solutions for data protection and threat detection as well as a more common security-forward mindset that seeks to understand the Why.In our 2023 Enterprise Security Trend Report, we dive deeper into the greatest advantages and threats to application security today, including the role of software supply chains, infrastructure security, threat detection, automation and AI, and DevSecOps. Featured in this report are insights from our original research and related articles written by members of the DZone Community — read on to learn more!
Getting Started With Large Language Models
Software Supply Chain Security
It wasn't long ago that I decided to ditch my Ubuntu-based distros for openSUSE, finding LEAP 15 to be a steadier, more rock-solid flavor of Linux for my daily driver. The trouble is, I hadn't yet been introduced to Linux Mint Debian Edition (LMDE), and that sound you hear is my heels clicking with joy. LMDE 6 with the Cinnamon desktop. Allow me to explain. While I've been a long-time fan of Ubuntu, in recent years, it's the addition of snaps (rather than system packages) and other Ubuntu-only features started to wear on me. I wanted straightforward networking, support for older hardware, and a desktop that didn't get in the way of my work. For years, Ubuntu provided that, and I installed it on everything from old netbooks, laptops, towers, and IoT devices. More recently, though, I decided to move to Debian, the upstream Linux distro on which Ubuntu (and derivatives like Linux Mint and others) are built. Unlike Ubuntu, Debian holds fast to a truly solid, stable, non-proprietary mindset — and I can still use the apt package manager I've grown accustomed to. That is, every bit of automation I use (Chef and Ansible mostly) works the same on Debian and Ubuntu. I spent some years switching back and forth between the standard Ubuntu long-term releases and Linux Mint, a truly great Ubuntu-derived desktop Linux. Of course, there are many Debian-based distributions, but I stumbled across LMDE version 6, based on Debian GNU/Linux 12 "Bookworm" and known as Faye, and knew I was onto something truly special. As with the Ubuntu version, LMDE comes with different desktop environments, including the robust Cinnamon, which provides a familiar environment for any Linux, Windows, or macOS user. It's intuitive, chock full of great features (like a multi-function taskbar), and it supports a wide range of customizations. However, it includes no snaps or other Ubuntuisms, and it is amazingly stable. That is, I've not had a single freeze-up or odd glitch, even when pushing it hard with Kdenlive video editing, KVM virtual machines, and Docker containers. According to the folks at Linux Mint, "LMDE is also one of our development targets, as such it guarantees the software we develop is compatible outside of Ubuntu." That means if you're a traditional Linux Mint user, you'll find all the familiar capabilities and features in LMDE. After nearly six months of daily use, that's proven true. As someone who likes to hang on to old hardware, LMDE extended its value to me by supporting both 64- and 32-bit systems. I've since installed it on a 2008 Macbook (32-bit), old ThinkPads, old Dell netbooks, and even a Toshiba Chromebook. Though most of these boxes have less than 3 gigabytes of RAM, LMDE performs well. Cinnamon isn't the lightest desktop around, but it runs smoothly on everything I have. The running joke in the Linux world is that "next year" will be the year the Linux desktop becomes a true Windows and macOS replacement. With Debian Bookworm-powered LMDE, I humbly suggest next year is now. To be fair, on some of my oldest hardware, I've opted for Bunsen. It, too, is a Debian derivative with 64- and 32-bit versions, and I'm using the BunsenLabs Linux Boron version, which uses the Openbox window manager and sips resources: about 400 megabytes of RAM and low CPU usage. With Debian at its core, it's stable and glitch-free. Since deploying LMDE, I've also begun to migrate my virtual machines and containers to Debian 12. Bookworm is amazingly robust and works well on IoT devices, LXCs, and more. Since it, too, has long-term support, I feel confident about its stability — and security — over time. If you're a fan of Ubuntu and Linux Mint, you owe it to yourself to give LMDE a try. As a daily driver, it's truly hard to beat.
Why Go Cloud-Native? Cloud-native technologies empower us to produce increasingly larger and more complex systems at scale. It is a modern approach to designing, building, and deploying applications that can fully capitalize on the benefits of the cloud. The goal is to allow organizations to innovate swiftly and respond effectively to market demands. Agility and Flexibility Organizations often migrate to the cloud for the enhanced agility and the speed it offers. The ability to set up thousands of servers in minutes contrasts sharply with the weeks it typically takes for on-premises operations. Immutable infrastructure provides confidence in configurable and secure deployments and helps reduce time to market. Scalable Components Cloud-native applications are more than just hosting the applications on the cloud. The approach promotes the adoption of microservices, serverless, and containerized applications, and involves breaking down applications into several independent services. These services integrate seamlessly through APIs and event-based messaging, each serving a specific function. Resilient Solutions Orchestration tools manage the lifecycle of components, handling tasks such as resource management, load balancing, scheduling, restarts after internal failures, and provisioning and deploying resources to server cluster nodes. According to the 2023 annual survey conducted by the Cloud Native Computing Foundation, cloud-native technologies, particularly Kubernetes, have achieved widespread adoption within the cloud-native community. Kubernetes continues to mature, signifying its prevalence as a fundamental building block for cloud-native architectures. Security-First Approach Cloud-native culture integrates security as a shared responsibility throughout the entire IT lifecycle. Cloud-native promotes security shift left in the process. Security must be a part of application development and infrastructure right from the start and not an afterthought. Even after product deployment, security should be the top priority, with constant security updates, credential rotation, virtual machine rebuilds, and proactive monitoring. Is Cloud-Native Right for You? There isn't a one-size-fits-all strategy to determine if becoming cloud-native is a wise option. The right approach depends on strategic goals and the nature of the application. Not every application needs to invest in developing a cloud-native model; instead, teams can take an incremental approach based on specific business requirements. There are three levels to an incremental approach when moving to a cloud-native environment. Infrastructure-Ready Applications It involves migrating or rehosting existing on-premise applications to an Infrastructure-as-a-Service (IaaS) platform with minimal changes. Applications retain their original structure but are deployed on cloud-based virtual machines. It is always the first approach to be suggested and commonly referred to as "lift and shift." However, deploying a solution in the cloud that retains monolithic behavior or not utilizing the entire capabilities of the cloud generally has limited merits. Cloud-Enhanced Applications This level allows organizations to leverage modern cloud technologies such as containers and cloud-managed services without significant changes to the application code. Streamlining development operations with DevOps processes results in faster and more efficient application deployment. Utilizing container technology addresses issues related to application dependencies during multi-stage deployments. Applications can be deployed on IaaS or PaaS while leveraging additional cloud-managed services related to databases, caching, monitoring, and continuous integration and deployment pipelines. Cloud-Native Applications This advanced migration strategy is driven by the need to modernize mission-critical applications. Platform-as-a-Service (PaaS) solutions or serverless components are used to transition applications to a microservices or event-based architecture. Tailoring applications specifically for the cloud may involve writing new code or adapting applications to cloud-native behavior. Companies such as Netflix, Spotify, Uber, and Airbnb are the leaders of the digital era. They have presented a model of disruptive competitive advantage by adopting cloud-native architecture. This approach fosters long-term agility and scalability. Ready to Dive Deeper? The Cloud Native Computing Foundation (CNCF) has a vibrant community, driving the adoption of cloud-native technologies. Explore their website and resources to learn more about tools and best practices. All major cloud providers have published the Cloud Adoption Framework (CAF) that provides guidance and best practices to adopt the cloud and achieve business outcomes. Azure Cloud Adoption Framework AWS Cloud Adoption Framework GCP Cloud Adoption Framework Final Words Cloud-native architecture is not just a trendy buzzword; it's a fundamental shift in how we approach software development in the cloud era. Each migration approach I discussed above has unique benefits, and the choice depends on specific requirements. Organizations can choose a single approach or combine components from multiple strategies. Hybrid approaches, incorporating on-premise and cloud components, are common, allowing for flexibility based on diverse application requirements. By adhering to cloud-native design principles, application architecture becomes resilient, adaptable to rapid changes, easy to maintain, and optimized for diverse application requirements.
In the realm of Java development, mastering concurrent programming is a quintessential skill for experienced software engineers. At the heart of Java's concurrency framework lies the ExecutorService, a sophisticated tool designed to streamline the management and execution of asynchronous tasks. This tutorial delves into the ExecutorService, offering insights and practical examples to harness its capabilities effectively. Understanding ExecutorService At its core, ExecutorService is an interface that abstracts the complexities of thread management, providing a versatile mechanism for executing concurrent tasks in Java applications. It represents a significant evolution from traditional thread management methods, enabling developers to focus on task execution logic rather than the intricacies of thread lifecycle and resource management. This abstraction facilitates a more scalable and maintainable approach to handling concurrent programming challenges. ExecutorService Implementations Java provides several ExecutorService implementations, each tailored for different scenarios: FixedThreadPool: A thread pool with a fixed number of threads, ideal for scenarios where the number of concurrent tasks is known and stable. CachedThreadPool: A flexible thread pool that creates new threads as needed, suitable for applications with a large number of short-lived tasks. ScheduledThreadPoolExecutor: Allows for the scheduling of tasks to run after a specified delay or to execute periodically, fitting for tasks requiring precise timing or regular execution. SingleThreadExecutor: Ensures tasks are executed sequentially in a single thread, preventing concurrent execution issues without the overhead of managing multiple threads. Thread Pools and Thread Reuse ExecutorService manages a pool of worker threads, which helps avoid the overhead of creating and destroying threads for each task. Thread reuse is a significant advantage because creating threads can be resource-intensive. In-Depth Exploration of ExecutorService Mechanics Fundamental Elements At the core of ExecutorService's efficacy in concurrent task management lie several critical elements: Task Holding Structure: At the forefront is the task holding structure, essentially a BlockingQueue, which queues tasks pending execution. The nature of the queue, such as LinkedBlockingQueue for a stable thread count or SynchronousQueue for a flexible CachedThreadPool, directly influences task processing and throughput. Execution Threads: These threads within the pool are responsible for carrying out the tasks. Managed by the ExecutorService, these threads are either spawned or repurposed as dictated by the workload and the pool's configuration. Execution Manager: The ThreadPoolExecutor, a concrete embodiment of ExecutorService, orchestrates the entire task execution saga. It regulates the threads' lifecycle, oversees task processing, and monitors key metrics like the size of the core and maximum pools, task queue length, and thread keep-alive times. Thread Creation Mechanism: This mechanism, or Thread Factory, is pivotal in spawning new threads. By allowing customization of thread characteristics such as names and priorities, it enhances control over thread behavior and diagnostics. Operational Dynamics The operational mechanics of ExecutorService underscore its proficiency in task management: Initial Task Handling: Upon task receipt, ExecutorService assesses whether to commence immediate execution, queue the task, or reject it based on current conditions and configurations. Queue Management: Tasks are queued if current threads are fully engaged and the queue can accommodate more tasks. The queuing mechanism hinges on the BlockingQueue type and the thread pool's settings. Pool Expansion: Should the task defy queuing and the active thread tally is below the maximum threshold, ExecutorService might instantiate a new thread for this task. Task Processing: Threads in the pool persistently fetch and execute tasks from the queue, adhering to a task processing cycle that ensures continuous task throughput. Thread Lifecycle Management: Idle threads exceeding the keep-alive duration are culled, allowing the pool to contract when task demand wanes. Service Wind-down: ExecutorService offers methods (shutdown and shutdownNow) for orderly or immediate service cessation, ensuring task completion and resource liberation. Execution Regulation Policies ExecutorService employs nuanced policies for task execution to maintain system equilibrium: Overflow Handling Policies: When a task can neither be immediately executed nor queued, a RejectedExecutionHandler policy decides the next steps, like task discard or exception throwing, critical for managing task surges. Thread Renewal: To counteract the unexpected loss of a thread due to unforeseen exceptions, the pool replenishes its threads, thereby preserving the pool's integrity and uninterrupted task execution. Advanced Features and Techniques ExecutorService extends beyond mere task execution, offering sophisticated features like: Task Scheduling: ScheduledThreadPoolExecutor allows for precise scheduling, enabling tasks to run after a delay or at fixed intervals. Future and Callable: These constructs allow for the retrieval of results from asynchronous tasks, providing a mechanism for tasks to return values and allowing the application to remain responsive. Custom Thread Factories: Custom thread factories can be used to customize thread properties, such as names or priorities, enhancing manageability and debuggability. Thread Pool Customization: Developers can extend ThreadPoolExecutor to fine-tune task handling, thread creation, and termination policies to fit specific application needs. Practical Examples Example 1: Executing Tasks Using a FixedThreadPool Java ExecutorService executor = Executors.newFixedThreadPool(5); for (int i = 0; i < 10; i++) { Runnable worker = new WorkerThread("Task " + i); executor.execute(worker); } executor.shutdown(); This example demonstrates executing multiple tasks using a fixed thread pool, where each task is encapsulated in a Runnable object. Example 2: Scheduling Tasks With ScheduledThreadPoolExecutor Java ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1); Runnable task = () -> System.out.println("Task executed at: " + new Date()); scheduler.scheduleAtFixedRate(task, 0, 1, TimeUnit.SECONDS); Here, tasks are scheduled to execute repeatedly with a fixed interval, showcasing the scheduling capabilities of ExecutorService. Example 3: Handling Future Results From Asynchronous Tasks Java ExecutorService executor = Executors.newCachedThreadPool(); Callable<String> task = () -> { TimeUnit.SECONDS.sleep(1); return "Result of the asynchronous computation"; }; Future<String> future = executor.submit(task); System.out.println("Future done? " + future.isDone()); String result = future.get(); // Waits for the computation to complete System.out.println("Future done? " + future.isDone()); System.out.println("Result: " + result); executor.shutdown(); This example illustrates submitting a Callable task, managing its execution with a Future object, and retrieving the result asynchronously. Insights: Key Concepts and Best Practices To effectively utilize ExecutorService and make the most of its capabilities, it is imperative to grasp several fundamental principles and adhere to the recommended approaches: Optimal Thread Pool Size The selection of an appropriate thread pool size carries substantial significance. Inadequate threads may result in the underutilization of CPU resources, whereas an excessive number of threads can lead to resource conflicts and unwarranted overhead. Determining the ideal pool size hinges on variables such as the available CPU cores and the nature of the tasks at hand. Utilizing tools like Runtime.getRuntime().availableProcessors() can aid in ascertaining the number of available CPU cores. Task Prioritization ExecutorService, by default, lacks inherent support for task priorities. In cases where task prioritization is pivotal, contemplating the use of a PriorityQueue to manage tasks and manually assign priorities becomes a viable approach. Task Interdependence In scenarios where tasks exhibit dependencies on one another, employing Future objects, which are returned upon submission of Callable tasks, becomes instrumental. These Future objects permit the retrieval of task results and the ability to await their completion. Effective Exception Handling ExecutorService offers mechanisms to address exceptions that may arise during task execution. These exceptions can be managed within a try-catch block inside the task itself or by overriding the uncaughtException method of ThreadFactory during the thread pool's creation. Graceful Termination It is imperative to execute a shutdown process for the ExecutorService when it is no longer required, ensuring the graceful release of resources. This can be achieved through the utilization of the shutdown() method, which initiates an orderly shutdown, allowing submitted tasks to conclude their execution. Alternatively, the shutdownNow() method can be employed for the forceful termination of all running tasks. Conclusion In summary, Java's ExecutorService stands as a sophisticated and powerful framework for handling and orchestrating concurrent tasks, effectively simplifying the intricacies of thread management with a well-defined and efficient API. Delving deeper into its internal mechanics and core components sheds light on the operational dynamics of task management, queuing, and execution, offering developers critical insights that can significantly influence application optimization for superior performance and scalability. Utilizing ExecutorService to its full extent, from executing simple tasks to leveraging advanced functionalities like customizable thread factories and sophisticated rejection handlers, enables the creation of highly responsive and robust applications capable of managing numerous concurrent operations. Adhering to established best practices, such as optimal thread pool sizing and implementing smooth shutdown processes, ensures applications remain reliable and efficient under diverse operational conditions. At its essence, ExecutorService exemplifies Java's dedication to providing comprehensive and high-level concurrency tools that abstract the complexities of raw thread management. As developers integrate ExecutorService into their projects, they tap into the potential for improved application throughput, harnessing the power of modern computing architectures and complex processing environments.
In today's cloud computing world, all types of logging data are extremely valuable. Logs can include a wide variety of data, including system events, transaction data, user activities, web browser logs, errors, and performance metrics. Managing logs efficiently is extremely important for organizations, but dealing with large volumes of data makes it challenging to detect anomalies and unusual patterns or predict potential issues before they become critical. Efficient log management strategies, such as implementing structured logging, using log aggregation tools, and applying machine learning for log analysis, are crucial for handling this data effectively. One of the latest advancements in effectively analyzing a large amount of logging data is Machine Learning (ML) powered analytics provided by Amazon CloudWatch. It is a brand new capability of CloudWatch. This innovative service is transforming the way organizations handle their log data. It offers a faster, more insightful, and automated log data analysis. This article specifically explores utilizing the machine learning-powered analytics of CloudWatch to overcome the challenges of effectively identifying hidden issues within the log data. Before deep diving into some of these features, let's have a quick refresher about Amazon CloudWatch. What Is Amazon CloudWatch? It is an AWS-native monitoring and observability service that offers a whole suite of capabilities: Monitoring: Tracks performance and operational health. Data collection: Gathers logs, metrics, and events, providing a comprehensive view of AWS resources. Unified operational view: Provides insights into applications running on AWS and on-premises servers. Challenges With Logs Data Analysis Volume of Data There's too much log data. In this modern era, applications emit a tremendous amount of log events. Log data can grow so rapidly that developers often find it difficult to identify issues within it; it is like finding a needle in a haystack. Change Identification Another common problem we have often seen is the fundamental problem of log analysis that goes back as long as logs have been around, identifying what has changed in your logs. Proactive Detection Proactive detection is another common challenge. It's great if you can utilize logs to dive in when an application's having an issue, find the root cause of that application issue, and fix it. But how do you know when those issues are occurring? How do you proactively detect them? Of course, you can implement metrics, alarms, etc., for the issues you know about. But there's always the problem of unknowns. So, we're often instrumenting observability and monitoring for past issues. Now, let's dive deep into the machine learning capabilities from CloudWatch that will help you overcome the challenges we have just discussed. Machine Learning Capabilities From CloudWatch Pattern Analysis Imagine you are troubleshooting a real-time distributed application accessed by millions of customers globally and generating a significant amount of application logs. Analyzing tens of thousands of log events manually is challenging, and it can take forever to find the root cause. That is where the new AWS CloudWatch machine learning-based capability can quickly help by grouping log events into patterns within the Logs Insight page of CloudWatch. It is much easier to identify through a limited number of patterns and quickly filter the ones that might be interesting or relevant based on the issue you are trying to troubleshoot. It also allows you to expand the specific pattern to look for the relevant events along with related patterns that might be pertinent. In simple words, Pattern Analysis is the automated grouping and categorization of your log events. Comparison Analysis How can we elevate pattern analysis to the next level? Now that we've seen how pattern analysis works let's see how we can extend this feature to perform comparison analysis. "Comparison Analysis" aims to solve the second challenge of identifying the log changes. Comparison analysis lets you effectively profile your logs using patterns from one time period and then compare them to the patterns extracted for another period and analyze the differences. This will help us answer this fundamental question of what changed to my logs. You can quickly compare your logs while your application's having an issue to a known healthy period. Any changes between two time periods are a strong indicator of the possible root cause of your problem. CloudWatch Logs Anomaly Detection Anomaly detection, in simple terms, is the process of identifying unusual patterns or behaviors in the logs that do not conform to expected norms. To use this feature, we need to first select the LogGroup for the application and enable CloudWatch Logs anomaly detection for it. At that point, CloudWatch will train a machine-learning model on the expected patterns and the volume of each pattern associated with your application. CloudWatch will take five minutes to train the model using logs from your application, and the feature will become active and automatically start servicing these anomalies any time they occur. So things like a brand new error message occurring that wasn't there before, a sudden spike in the volume, or if there's a spike in HTTP 400s are some examples that will result in an anomaly being generated for that. Generate Logs Insight Queries Using Generative AI With this capability, you can give natural language commands to filter log events, and CloudWatch can generate queries using Generative AI. If you are unfamiliar with CloudWatch query language or are from a non-technical background, you can easily use this feature to generate queries and filter logs. It's an iterative process; you need to learn precisely what you want from the first query. So you can update and iterate the query based on the results you see. Let's look at a couple of examples: Natural Language Prompt: "Check API Response Times" Auto-generated query by CloudWatch: In this query: fields @timestamp, @message selects the timestamp and message fields from your logs. | parse @message "Response Time: *" as responseTime parses the @message field to extract the value following the text "Response Time: " and labels it as responseTime. | stats avg(responseTime) calculates the average of the extracted responseTime values. Natural Language Prompt: "Please provide the duration of the ten invocations with the highest latency." Auto-generated query by CloudWatch In this query: fields @timestamp, @message, latency selects the @timestamp, @message, and latency fields from the logs. | stats max(latency) as maxLatency by @message computes the maximum latency value for each unique message. | sort maxLatency desc sorts the results in descending order based on the maximum latency, showing the highest values at the top. | limit 10 restricts the output to the top 10 results with the highest latency values. We can execute these queries in the CloudWatch “Logs Insights” query box to filter the log events from the application logs. These queries extract specific information from the logs, such as identifying errors, monitoring performance metrics, or tracking user activities. The query syntax might vary based on the particular log format and the information you seek. Conclusion CloudWatch's machine learning features offer a robust solution for managing the complexities of log data. These tools make log analysis more efficient and insightful, from automating pattern analysis to enabling anomaly detection. The addition of generative AI for query generation further democratizes access to these powerful insights.
In the ever-evolving landscape of the Financial Services Industry (FSI), organizations face a multitude of challenges that hinder their journey toward AI-driven transformation. Legacy systems, stringent regulations, data silos, and a lack of agility have created a chaotic environment in need of a more effective way to use and share data across the organization. In this two-part series, I delve into my own personal observations, break open the prevailing issues within the FSI, and closely inspect the factors holding back progress. I’ll also highlight the need to integrate legacy systems, navigate strict regulatory landscapes, and break down data silos that impede agility and hinder data-driven decision-making. Last but not least, I’ll introduce a proven data strategy approach in which adopting data streaming technologies will help organizations overhaul their data pipelines—enabling real-time data ingestion, efficient processing, and seamless integration of disparate systems. If you're interested in a practical use case that showcases everything in action, keep an eye out for our upcoming report. But before diving into the solution, let’s start by understanding the problem. Complexity, Chaos, and Data Dilemmas Stepping into larger financial institutions often reveals a fascinating insight: a two-decade technology progression unfolding before my eyes. The core business continues to rely on mainframe systems running COBOL, while a secondary layer of services acts as the gateway to access the core and extension of services offerings that can’t be done in the core system. Data is heavily batched and undergoes nightly ETL processes to facilitate transfer between these layers. Real-time data access poses challenges, demanding multiple attempts and queries through the gateway for even a simple status update. Data warehouses are established, serving as data dumping grounds through ETL, where nearly half of the data remains unused. Business Intelligence (BI) tools extract, transform, and analyze the data to provide valuable insights for business decisions and product design. Batch and distributed processing prevail due to the sheer volume of data to be handled, resulting in data silos and delayed reflection of changing trends. In recent years, more agile approaches have emerged, with a data shift towards binary, key-value formats for better scalability on the cloud. However, due to the architectural complexity, data transfers between services have multiplied, leading to challenges in maintaining data integrity. Plus, these innovations primarily cater to new projects, leaving developers and internal users to navigate through multiple hoops within the system to accomplish tasks. Companies also find themselves paying the price for slow innovation and encounter high costs when implementing new changes. This is particularly true when it comes to AI-driven initiatives that demand a significant amount of data and swift action. Consequently, several challenges bubble to the surface and get in the way of progress, making it increasingly difficult for FSIs to adapt and prepare for the future. Here’s a breakdown of these challenges and the ideal state for FSIs. Reality Ideal state Data silos Decentralized nature of financial operations or team’s geographical location. Separate departments or business units maintain their own data and systems that were implemented over the years, resulting in isolated data and making it difficult to collaborate. There were already several attempts to break the silos, and the solutions somehow contributed to one of the many problems below (i.e., data pipeline chaos). A consolidated view of data across the organization. Ability to quickly view and pull data when needed. Legacy systems FSIs often grapple with legacy systems that have been in place for many years. These systems usually lack the agility to adapt to changes quickly. As a result, accessing and actioning data from these legacy systems can be time-consuming, leading to delays and sometimes making it downright impossible to make good use of the latest data. Data synchronization with the old systems, and modernized ETL pipelines. Migrate and retire from the old process. Data overload With vast amounts of data from various sources, including transactions, customer interactions, market data, and more, it can be overwhelming, making it challenging to extract valuable insights and derive actionable intelligence. It often leads to high storage bills and data is not fully used most of the time. Infrastructural change to adopt larger ingestion of data, planned data storage strategy, and a more cost-effective way to safely secure and store data with sufficient failover and recovery plan. Data pipeline chaos Managing data pipelines within FSIs can be a complex endeavor. With numerous data sources, formats, and integration points, the data pipeline can become fragmented and chaotic. Inconsistent data formats, incompatible systems, and manual processes can introduce errors and inefficiencies, making it challenging to ensure smooth data flow and maintain data quality. A data catalog is a centralized repository that serves as a comprehensive inventory and metadata management system for an organization's data assets.Reduced redundancy, improved efficiency, streamlined data flow, and introduce automation, monitoring and regular inspection. Open data initiatives With the increasing need for partner collaboration and government open API projects, the FSI faces the challenge of adapting its data practices. The demand to share data securely and seamlessly with external partners and government entities is growing. FSIs must establish frameworks and processes to facilitate data exchange while ensuring privacy, security, and compliance with regulations. Secure and well-defined APIs for data access that ensure data interoperability through common standards. Plus, version and access control over access points. Clearly, there’s a lot stacked up against FSIs attempting to leap into the world of AI. Now, let’s zoom in on the different data pipelines organizations are using to move their data from point A to B and the challenges many teams are facing with them. Understanding Batch, Micro-Batch, and Real-Time Data Pipelines There are all sorts of ways that move data around. To keep things simple, I’ll distill the most common pipelines today into three categories: Batch Micro-batch Real-time 1. Batch Pipelines These are typically used when processing large volumes of data in scheduled “chunks” at a time—often in overnight processing, periodic data updates, or batch reporting. Batch pipelines are well-suited for scenarios where immediate data processing isn't crucial, and the output is usually a report, like for investment profiles and insurance claims. The main setbacks include processing delays, potentially outdated results, scalability complexities, managing task sequences, resource allocation issues, and limitations in providing real-time data insights. I’ve witnessed an insurance customer running out of windows at night to run batches due to the sheer volume of data that needed processing (updating premiums, investment details, documents, agents’ commissions, etc.). Parallel processing or map-reduce are a few techniques to shorten the time, but they also introduce complexities, as parallel both require the developer to understand the distribution of data, dependency of data, and be able to maneuver between map and reduce functions. 2. Micro-Batch Pipelines Micro batch pipelines are a variation of batch pipelines where data is processed in smaller, more frequent batches at regular intervals for lower latency and fresher results. They’re commonly used for financial trading insights, clickstream analysis, recommendation systems, underwriting, and customer churn predictions. Challenges with micro-batch pipelines include managing the trade-off between processing frequency and resource usage, handling potential data inconsistencies across micro-batches, and addressing the overhead of initiating frequent processing jobs while still maintaining efficiency and reliability. 3. Real-Time Pipelines These pipelines process data as soon as it flows in. They offer minimal latency and are essential for applications requiring instant reactions, such as real-time analytics, transaction fraud detection, monitoring critical systems, interactive user experiences, continuous model training, and real-time predictions. However, real-time pipelines face challenges like handling high throughputs, maintaining consistently low latency, ensuring data correctness and consistency, managing resource scalability to accommodate varying workloads, and dealing with potential data integration complexities—all of which require robust architectural designs and careful implementation to deliver accurate and timely results. To summarize, here’s the important information about all three pipelines in one table. Batch Micro batch Real time Cadence Scheduled longer intervals Scheduled short intervals Real time Data size Large Small defined chunks Large Scaling Vertical Horizontal Horizontal Latency High (hours/days) Medium (seconds) Low (milliseconds) Datastore Data warehouse, Data lake, Databases, Files Distributed files system, Data warehouses, Databases Stream processing Systems, Data lake, Databases Open-source technologies Apache Hadoop, Map-reduce Apache Spark™ Apache Kafka®, Apache Flink® Industry use case examples Moving files (customer signature scans), and transferring data from the mainframe for core banking data or core insurance policy information. Large datasets for ML. Prepare near real-time business reports and needs to consume data from large dataset lookups such as generating risk management reviews for investment. Daily market trend analysis. Real-time transaction/fraud detection, instant claim approval, monitoring critical systems, and customer chatbot service. As a side note, some may categorize pipelines as either ETL or ELT. ETL (Extract, Transform, and Load) transforms data on a separate processing server before moving it to the data warehouse. ELT (Extract, Load, and Transform) transforms the data within the data warehouse first before it hits its destination. Depending on the destination of the data, if it’s going to a data lake, you’ll see most pipelines doing ELT. Whereas with a data source, like a data warehouse or database, since it requires data to be stored in a more structured manner, you will see more ETL. In my opinion, all three pipelines should be using both techniques to convert data into the desired state. Common Challenges of Working With Data Pipelines Pipelines are scattered across departments, and IT teams implement them with various technologies and platforms. From my own experience working with on-site data engineers, here are some common challenges working with data pipelines: Difficulty Accessing Data Unstructured data can be tricky. The lack of metadata makes it difficult to locate the desired data within the repository (like customer correspondence, emails, chat logs, legal documents.) Certain data analytics tools or platforms may have strict requirements regarding the input data format, posing difficulties in converting the data to the required format. So, multiple complex pipelines transform logic (and lots of it). Stringent security measures and regulatory compliance can introduce additional steps and complexities in gaining access to the necessary data. (Personal identifiable data, health record for claims). Noisy, “Dirty” Data Data lakes are prone to issues like duplicated data. Persistence of decayed or outdated data within the system can compromise the accuracy and reliability of AI models and insights. Input errors during data entry were not caught and filtered. (biggest data processing troubleshooting time wasted) Data mismatches between different datasets and inconsistencies in data. (Incorrect report and pipeline errors) Performance Large volumes of data, lack of efficient storage and processing power. Methods of retrieving data, such as APIs in which the request and response aren’t ideal for large volumes of data ingestion. The location of relevant data within the system and where they’re stored heavily impacts the frequency of when to process data, plus the latency and cost of retrieving it. Data Visibility (Data Governance and Metadata) Inadequate metadata results in a lack of clarity regarding the availability, ownership, and usage of data assets. Difficult to determine the existence and availability of specific data, impeding effective data usage and analysis. Troubleshooting Identifying inconsistencies, addressing data quality problems, or troubleshooting data processing failures can be time-consuming and complex. During the process of redesigning the data framework for AI, both predictive and generative, I’ll address the primary pain points for data engineers and also help solve some of the biggest challenges plaguing the FSI today. Taking FSIs From Point A to AI Looking through a data lens, the AI-driven world can be dissected into two primary categories: inference and machine learning. These domains differ in their data requirements and usage. Machine learning needs comprehensive datasets derived from historical, operational, and real-time sources, enabling training more accurate models. Incorporating real-time data into the dataset enhances the model and facilitates agile and intelligent systems. Inference prioritizes real-time focus, leveraging ML-generated models to respond to incoming events, queries, and requests. Building a generative AI model is a major undertaking. For FSI, it makes sense to reuse an existing model (foundation model) with some fine-tuning in specific areas to fit your use case. The “fine-tuning” will require you to provide a high-quality, high-volume dataset. The old saying still holds true: garbage in, garbage out. If the data isn’t reliable, to begin with, you’ll inevitably end up with unreliable AI. In my opinion, to prepare for the best AI outcome possible, it’s crucial to set up the following foundations: Data infrastructure: You need a robust, low latency, high throughput framework to transfer and store vast volumes of financial data for efficient data ingestion, storage, processing, and retrieval. It should support distributed and cloud computing and prioritize network latency, storage costs, and data safety. Data quality: To provide better data for determining the model, it’s best to go through data cleansing, normalization, de-duplication, and validation processes to remove inconsistencies, errors, and redundancies. Now, if I were to say that there’s a simple solution, I would either be an exceptional genius capable of solving world crises or blatantly lying. However, given the complexity we already have, it’s best to focus on generating the datasets required for ML and streamline the data needed for the inference phase to make decisions. Then, you can gradually address the issues caused by the current data being overly disorganized. Taking one domain at a time, solving business users’ problems first, and not being overly ambitious is the fastest path to success. But we’ll leave that for the next post. Summary Implementing a data strategy in the financial services industry can be intricate due to factors such as legacy systems and the consolidation of other businesses. Introducing AI into this mix can pose performance challenges, and some businesses might struggle to prepare data for machine learning applications. In my next post, I’ll walk you through a proven data strategy approach to streamline your troublesome data pipelines for real-time data ingestion, efficient processing, and seamless integration of disparate systems.
The popularity of Kubernetes (K8s) as the defacto orchestration platform for the cloud is not showing any sign of pause. This graph, taken from the 2023 Kubernetes Security Report by the security company Wiz, clearly illustrates the trend: As adoption continues to soar, so do the security risks and, most importantly, the attacks threatening K8s clusters. One such threat comes in the form of long-lived service account tokens. In this blog, we are going to dive deep into what these tokens are, their uses, the risks they pose, and how they can be exploited. We will also advocate for the use of short-lived tokens for a better security posture. Service account tokens are bearer tokens (a type of token mostly used for authentication in web applications and APIs) used by service accounts to authenticate to the Kubernetes API. Service accounts provide an identity for processes (applications) that run in a Pod, enabling them to interact with the Kubernetes API securely. Crucially, these tokens are long-lived: when a service account is created, Kubernetes automatically generates a token and stores it indefinitely as a Secret, which can be mounted into pods and used by applications to authenticate API requests. Note: in more recent versions, including Kubernetes v1.29, API credentials are obtained directly by using the TokenRequest API and are mounted into Pods using a projected volume. The tokens obtained using this method have bounded lifetimes and are automatically invalidated when the Pod they are mounted into is deleted. As a reminder, the Kubelet on each node is responsible for mounting service account tokens into pods so they can be used by applications within those pods to authenticate to the Kubernetes API when needed: If you need a refresher on K8s components, look here. The Utility of Service Account Tokens Service account tokens are essential for enabling applications running on Kubernetes to interact with the Kubernetes API. They are used to deploy applications, manage workloads, and perform administrative tasks programmatically. For instance, a Continuous Integration/Continuous Deployment (CI/CD) tool like Jenkins would use a service account token to deploy new versions of an application or roll back a release. The Risks of Longevity While service account tokens are indispensable for automation within Kubernetes, their longevity can be a significant risk factor. Long-lived tokens, if compromised, give attackers ample time to explore and exploit a cluster. Once in the hands of an attacker, these tokens can be used to gain unauthorized access, elevate privileges, exfiltrate data, or even disrupt the entire cluster's operations. Here are a few leak scenarios that could lead to some serious damage: Misconfigured access rights: A pod or container may be misconfigured to have broader file system access than necessary. If a token is stored on a shared volume, other containers or malicious pods that have been compromised could potentially access it. Insecure transmission: If the token is transmitted over the network without proper encryption (like sending it over HTTP instead of HTTPS), it could be intercepted by network sniffing tools. Code repositories: Developers might inadvertently commit a token to a public or private source code repository. If the repository is public or becomes exposed, the token is readily available to anyone who accesses it. Logging and monitoring systems: Tokens might get logged by applications or monitoring systems and could be exposed if logs are not properly secured or if verbose logging is accidentally enabled. Insider threat: A malicious insider with access to the Kubernetes environment could extract the token and use it or leak it intentionally. Application vulnerabilities: If an application running within the cluster has vulnerabilities (e.g., a Remote Code Execution flaw), an attacker could exploit this to gain access to the pod and extract the token. How Could an Attacker Exploit Long-Lived Tokens? Attackers can collect long-lived tokens through network eavesdropping, exploiting vulnerable applications, or leveraging social engineering tactics. With these tokens, they can manipulate Kubernetes resources at their will. Here is a non-exhaustive list of potential abuses: Abuse the cluster's (often barely limited) infra resources for cryptocurrency mining or as part of a botnet. With API access, attackers could deploy malicious containers, alter running workloads, exfiltrate sensitive data, or even take down the entire cluster. If the token has broad permissions, it can be used to modify roles and bindings to elevate privileges within the cluster. The attacker could create additional resources that provide them with persistent access (backdoor) to the cluster, making it harder to remove their presence. Access to sensitive data stored in the cluster or accessible through it could lead to data theft or leakage. Why Aren’t Service Account Tokens Short-Lived by Default? Short-lived tokens are a security best practice in general, particularly for managing access to very sensitive resources like the Kubernetes API. They reduce the window of opportunity for attackers to exploit a token and facilitate better management of permissions as application access requirements change. Automating token rotation limits the impact of a potential compromise and aligns with the principle of least privilege — granting only the access necessary for a service to operate. The problem is that implementing short-lived tokens comes with some overhead. First, implementing short-lived tokens typically requires a more complex setup. You need an automated process to handle token renewal before it expires. This may involve additional scripts or Kubernetes operators that watch for token expiration and request new tokens as necessary. This often means integrating a secret management system that can securely store and automatically rotate the tokens. This adds a new dependency for system configuration and maintenance. Note: it goes without saying that using a secrets manager with Kubernetes is highly recommended, even for non-production workloads. But the overhead cannot be understated. Second, software teams running their CI/CD workers on top of the cluster will need adjustments to support dynamic retrieval and injection of these tokens into the deployment process. This could require changes in the pipeline configuration and additional error handling to manage potential token expiration during a pipeline run, which can be a true headache. And secrets management is just the tip of the iceberg. You will also need monitoring and alerts if you want to troubleshoot renewal failures. Fine-tuning token expiry time could break the deployment process, requiring immediate attention to prevent downtime or deployment failures. Finally, there could also be performance considerations, as many more API calls are needed to retrieve new tokens and update the relevant Secrets. By default, Kubernetes opts for a straightforward setup by issuing service account tokens without a built-in expiration. This approach simplifies initial configuration but lacks the security benefits of token rotation. It is the Kubernetes admin’s responsibility to configure more secure practices by implementing short-lived tokens and the necessary infrastructure for their rotation, thereby enhancing the cluster's security posture. Mitigation Best Practices For many organizations, the additional overhead is justified by the security improvements. Tools like service mesh implementations (e.g., Istio), secret managers (e.g., CyberArk Conjur), or cloud provider services can manage the lifecycle of short-lived certificates and tokens, helping to reduce the overhead. Additionally, recent versions of Kubernetes offer features like the TokenRequest API, which can automatically rotate tokens and project them into the running pods. Even without any additional tool, you can mitigate the risks by limiting the Service Account auto-mount feature. To do so, you can opt out of the default API credential automounting with a single flag in the service account or pod configuration. Here are two examples: For a Service Account: YAML apiVersion: v1 kind: ServiceAccount metadata: name: build-robot automountServiceAccountToken: false ... And for a specific Pod: YAML apiVersion: v1 kind: Pod metadata: name: my-pod spec: serviceAccountName: build-robot automountServiceAccountToken: false ... The bottom line is that if an application does not need to access the K8s API, it should not have a token mounted. This also limits the number of service account tokens an attacker can access if the attacker manages to compromise any of the Kubernetes hosts. Okay, you might say, but how do we enforce this policy everywhere? Enter Kyverno, a policy engine designed for K8s. Enforcement With Kyverno Kyverno allows cluster administrators to manage, validate, mutate, and generate Kubernetes resources based on custom policies. To prevent the creation of long-lived service account tokens, one can define the following Kyverno policy: YAML apiVersion: kyverno.io/v1 kind: ClusterPolicy metadata: name: deny-secret-service-account-token spec: validationFailureAction: Enforce background: false rules: - name: check-service-account-token match: any: - resources: kinds: - Secret validate: cel: expressions: - message: "Long lived API tokens are not allowed" expression: > object.type != "kubernetes.io/service-account-token" This policy ensures that only Secrets that are not of type kubernetes.io/service-account-token can be created, effectively blocking the creation of long-lived service account tokens! Applying the Kyverno Policy To apply this policy, you need to have Kyverno installed on your Kubernetes cluster (tutorial). Once Kyverno is running, you can apply the policy by saving the above YAML to a file and using kubectl to apply it: YAML kubectl apply -f deny-secret-service-account-token.yaml After applying this policy, any attempt to create a Secret that is a service account token of the prohibited type will be denied, enforcing a safer token lifecycle management practice. Wrap Up In Kubernetes, managing the lifecycle and access of service account tokens is a critical aspect of cluster security. By preferring short-lived tokens over long-lived ones and enforcing policies with tools like Kyverno, organizations can significantly reduce the risk of token-based security incidents. Stay vigilant, automate security practices, and ensure your Kubernetes environment remains robust against threats.
Here's how to use AI and API Logic Server to create complete running systems in minutes: Use ChatGPT for Schema Automation: create a database schema from natural language. Use Open Source API Logic Server: create working software with one command. App Automation: a multi-page, multi-table admin app. API Automation: A JSON: API, crud for each table, with filtering, sorting, optimistic locking, and pagination. Customize the project with your IDE: Logic Automation using rules: declare spreadsheet-like rules in Python for multi-table derivations and constraints - 40X more concise than code. Use Python and standard libraries (Flask, SQLAlchemy) and debug in your IDE. Iterate your project: Revise your database design and logic. Integrate with B2B partners and internal systems. This process leverages your existing IT infrastructure: your IDE, GitHub, the cloud, your database… open source. Let's see how. 1. AI: Schema Automation You can use an existing database or create a new one with ChatGPT or your database tools. Use ChatGPT to generate SQL commands for database creation: Plain Text Create a sqlite database for customers, orders, items and product Hints: use autonum keys, allow nulls, Decimal types, foreign keys, no check constraints. Include a notes field for orders. Create a few rows of only customer and product data. Enforce the Check Credit requirement: Customer.Balance <= CreditLimit Customer.Balance = Sum(Order.AmountTotal where date shipped is null) Order.AmountTotal = Sum(Items.Amount) Items.Amount = Quantity * UnitPrice Store the Items.UnitPrice as a copy from Product.UnitPrice Note the hint above. As we've heard, "AI requires adult supervision." The hint was required to get the desired SQL. This creates standard SQL like this. Copy the generated SQL commands into a file, say, sample-ai.sql: Then, create the database: sqlite3 sample_ai.sqlite < sample_ai.sql 2. API Logic Server: Create Given a database (whether or not it's created from AI), API Logic Server creates an executable, customizable project with the following single command: $ ApiLogicServer create --project_name=sample_ai --db_url=sqlite:///sample_ai.sqlite This creates a project you can open with your IDE, such as VSCode (see below). The project is now ready to run; press F5. It reflects the automation provided by the create command: API Automation: a self-serve API ready for UI developers and; App Automation: an Admin app ready for Back Office Data Maintenance and Business User Collaboration. Let's explore the App and API Automation from the create command. App Automation App Automation means that ApiLogicServer create creates a multi-page, multi-table Admin App automatically. This does not consist of hundreds of lines of complex HTML and JavaScript; it's a simple yaml file that's easy to customize. Ready for business user collaboration,back-office data maintenance...in minutes. API Automation App Automation means that ApiLogicServer create creates a JSON: API automatically. Your API provides an endpoint for each table, with related data access, pagination, optimistic locking, filtering, and sorting. It would take days to months to create such an APIusing frameworks. UI App Developers can use the API to create custom apps immediately, using Swagger to design their API call and copying the URI into their JavaScript code. APIs are thus self-serve: no server coding is required. Custom App Dev is unblocked: Day 1. 3. Customize So, we have working software in minutes. It's running, but we really can't deploy it until we have logic and security, which brings us to customization. Projects are designed for customization, using standards: Python, frameworks (e.g., Flask, SQLAlchemy), and your IDE for code editing and debugging. Not only Python code but also Rules. Logic Automation Logic Automation means that you can declare spreadsheet-like rules using Python. Such logic maintains database integrity with multi-table derivations, constraints, and security. Rules are 40X more concise than traditional code and can be extended with Python. Rules are an executable design. Use your IDE (code completion, etc.) to replace 280 lines of code with the five spreadsheet-like rules below. Note they map exactly to our natural language design: 1. Debugging The screenshot above shows our logic declarations and how we debug them: Execution is paused at a breakpoint in the debugger, where we can examine the state and execute step by step. Note the logging for inserting an Item. Each line represents a rule firing and shows the complete state of the row. 2. Chaining: Multi-Table Transaction Automation Note that it's a Multi-Table Transaction, as indicated by the log indentation. This is because, like a spreadsheet, rules automatically chain, including across tables. 3. 40X More Concise The five spreadsheet-like rules represent the same logic as 200 lines of code, shown here. That's a remarkable 40X decrease in the backend half of the system. 4. Automatic Re-use The logic above, perhaps conceived for Place order, applies automatically to all transactions: deleting an order, changing items, moving an order to a new customer, etc. This reduces code and promotes quality (no missed corner cases). 5. Automatic Optimizations SQL overhead is minimized by pruning, and by eliminating expensive aggregate queries. These can result in orders of magnitude impact. This is because the rule engine is not based on a Rete algorithm but is highly optimized for transaction processing and integrated with the SQLAlchemy ORM (Object Relational Manager). 6. Transparent Rules are an executable design. Note they map exactly to our natural language design (shown in comments) readable by business users. This complements running screens to facilitate agile collaboration. Security Automation Security Automation means you activate login-access security and declare grants (using Python) to control row access for user roles. Here, we filter less active accounts for users with the sales role: Grant( on_entity = models.Customer, to_role = Roles.sales, filter = lambda : models.Customer.CreditLimit > 3000, filter_debug = "CreditLimit > 3000") 4. Iterate: Rules + Python So, we have completed our one-day project. The working screens and rules facilitate agile collaboration, which leads to agile iterations. Automation helps here, too: not only are spreadsheet-like rules 40X more concise, but they meaningfully simplify iterations and maintenance. Let’s explore this with two changes: Requirement 1: Green Discounts Plain Text Give a 10% discount for carbon-neutral products for 10 items or more. Requirement 2: Application Integration Plain Text Send new Orders to Shipping using a Kafka message. Enable B2B partners to place orders with a custom API. Revise Data Model In this example, a schema change was required to add the Product.CarbonNeutral column. This affects the ORM models, the API, etc. So, we want these updated but retain our customizations. This is supported using the ApiLogicServer rebuild-from-database command to update existing projects to a revised schema, preserving customizations. Iterate Logic: Add Python Here is our revised logic to apply the discount and send the Kafka message: Extend API We can also extend our API for our new B2BOrder endpoint using standard Python and Flask: Note: Kafka is not activated in this example. To explore a running Tutorial for application integration with running Kafka, click here. Notes on Iteration This illustrates some significant aspects of how logic supports iteration. Maintenance Automation Along with perhaps documentation, one of the tasks programmers most loathe is maintenance. That’s because it’s not about writing code, but archaeology; deciphering code someone else wrote, just so you can add four or five lines that’ll hopefully be called and function correctly. Logic Automation changes that with Maintenance Automation, which means: Rules automatically order their execution (and optimizations) based on system-discovered dependencies. Rules are automatically reused for all relevant transactions. So, to alter logic, you just “drop a new rule in the bucket,” and the system will ensure it’s called in the proper order and re-used over all the relevant Use Cases. Extensibility: With Python In the first case, we needed to do some if/else testing, and it was more convenient to add a dash of Python. While this is pretty simple Python as a 4GL, you have the full power of object-oriented Python and its many libraries. For example, our extended API leverages Flask and open-source libraries for Kafka messages. Rebuild: Logic Preserved Recall we were able to iterate the schema and use the ApiLogicServer rebuild-from-database command. This updates the existing project, preserving customizations. 5. Deploy API Logic Server provides scripts to create Docker images from your project. You can deploy these to the cloud or your local server. For more information, see here. Summary In minutes, you've used ChatGPT and API Logic Server to convert an idea into working software. It required only five rules and a few dozen lines of Python. The process is simple: Create the Schema with ChatGPT. Create the Project with ApiLogicServer. A Self-Serve API to unblock UI Developers: Day 1 An Admin App for Business User Collaboration: Day 1 Customize the project. With Rules: 40X more concise than code. With Python: for complete flexibility. Iterate the project in your IDE to implement new requirements. Prior customizations are preserved. It all works with standard tooling: Python, your IDE, and container-based deployment. You can execute the steps in this article with the detailed tutorial: click here.
This article describes how to implement a Raft Server consensus module in C++20 without using any additional libraries. The narrative is divided into three main sections: A comprehensive overview of the Raft algorithm A detailed account of the Raft Server's development A description of a custom coroutine-based network library The implementation makes use of the robust capabilities of C++20, particularly coroutines, to present an effective and modern methodology for building a critical component of distributed systems. This exposition not only demonstrates the practical application and benefits of C++20 coroutines in sophisticated programming environments, but it also provides an in-depth exploration of the challenges and resolutions encountered while building a consensus module from the ground up, such as Raft Server. The Raft Server and network library repositories, miniraft-cpp and coroio, are available for further exploration and practical applications. Introduction Before delving into the complexities of the Raft algorithm, let’s consider a real-world example. Our goal is to develop a network key-value storage (K/V) system. In C++, this can be easily accomplished by using an unordered_map<string, string>. However, in real-world applications, the requirement for a fault-tolerant storage system increases complexity. A seemingly simple approach could entail deploying three (or more) machines, each hosting a replica of this service. The expectation may be for users to manage data replication and consistency. However, this method can result in unpredictable behaviors. For example, it is possible to update data using a specific key and then retrieve an older version later. What users truly want is a distributed system, potentially spread across multiple machines, that runs as smoothly as a single-host system. To meet this requirement, a consensus module is typically placed in front of the K/V storage (or any similar service, hereafter referred to as the "state machine"). This configuration ensures that all user interactions with the state machine are routed exclusively through the consensus module, rather than direct access. With this context in mind, let us now look at how to implement such a consensus module, using the Raft algorithm as an example. Raft Overview In the Raft algorithm, there are an odd number of participants known as peers. Each peer keeps its own log of records. There is one peer leader, and the others are followers. Users direct all requests (reads and writes) to the leader. When a write request to change the state machine is received, the leader logs it first before forwarding it to the followers, who also log it. Once the majority of peers have successfully responded, the leader considers this entry to be committed, applies it to the state machine, and notifies the user of its success. The Term is a key concept in Raft, and it can only grow. The Term changes when there are system changes, such as a change in leadership. The log in Raft has a specific structure, with each entry consisting of a Term and a Payload. The term refers to the leader who wrote the initial entry. The Payload represents the changes to be made to the state machine. Raft guarantees that two entries with the same index and term are identical. Raft logs are not append-only and may be truncated. For example, in the scenario below, leader S1 replicated two entries before crashing. S2 took the lead and began replicating entries, and S1's log differed from those of S2 and S3. As a result, the last entry in the S1 log will be removed and replaced with a new one. Raft RPC API Let us examine the Raft RPC. It's worth noting that the Raft API is quite simple, with just two calls. We'll begin by looking at the leader election API. It is important to note that Raft ensures that there can only be one leader per term. There may also be terms without a leader, such as if elections fail. To ensure that only one election occurs, a peer saves its vote in a persistent variable called VotedFor. The election RPC is called RequestVote and has three parameters: Term, LastLogIndex, and LastLogTerm. The response contains Term and VoteGranted. Notably, every request contains Term, and in Raft, peers can only communicate effectively if their Terms are compatible. When a peer initiates an election, it sends a RequestVote request to the other peers and collects their votes. If the majority of the responses are positive, the peer advances to the leader role. Now let's look at the AppendEntries request. It accepts parameters such as Term, PrevLogIndex, PrevLogTerm, and Entries, and the response contains Term and Success. If the Entries field in the request is empty, it acts as a Heartbeat. When an AppendEntries request is received, a follower checks the PrevLogIndex for the Term. If it matches PrevLogTerm, the follower adds Entries to its log beginning with PrevLogIndex + 1 (entries after PrevLogIndex are removed if they exist): If the terms do not match, the follower returns Success=false. In this case, the leader retries sending the request, lowering the PrevLogIndex by one. When a peer receives a RequestVote request, it compares its LastTerm and LastLogIndex pairs to the most recent log entry. If the pair is less than or equal to the requestor's, the peer returns VoteGranted=true. State Transitions in Raft Raft's state transitions look like this. Each peer begins in the Follower state. If a Follower does not receive AppendEntries within a set timeout, it extends its Term and moves to the Candidate state, triggering an election. A peer can move from the Candidate state to the Leader state if it wins the election, or return to the Follower state if it receives an AppendEntries request. A Candidate can also revert to being a Candidate if it does not transition to either a Follower or a Leader within the timeout period. If a peer in any state receives an RPC request with a Term greater than its current one, it moves to the Follower state. Commit Let us now consider an example that demonstrates how Raft is not as simple as it may appear. I took this example from Diego Ongaro's dissertation. S1 was the leader in Term 2, where it replicated two entries before crashing. Following this, S5 took the lead in Term 3, added an entry, and then crashed. Next, S2 took over leadership in Term 4, replicated the entry from Term 2, added its own entry for Term 4, and then crashed. This results in two possible outcomes: S5 reclaims leadership and truncates the entries from Term 2, or S1 regains leadership and commits the entries from Term 2. The entries from Term 2 are securely committed only after they are covered by a subsequent entry from a new leader. This example demonstrates how the Raft algorithm operates in a dynamic and often unpredictable set of circumstances. The sequence of events, which includes multiple leaders and crashes, demonstrates the complexity of maintaining a consistent state across a distributed system. This complexity is not immediately apparent, but it becomes important in situations involving leader changes and system failures. The example emphasizes the importance of a robust and well-thought-out approach to dealing with such complexities, which is precisely what Raft seeks to address. Additional Materials For further study and a deeper understanding of Raft, I recommend the following materials: the original Raft paper, which is ideal for implementation. Diego Ongaro's PhD dissertation provides more in-depth insights. Maxim Babenko's lecture goes into even greater detail. Raft Implementation Let's now move on to the Raft server implementation, which, in my opinion, benefits greatly from C++20 coroutines. In my implementation, the Persistent State is stored in memory. However, in real-world scenarios, it should be saved to disk. I'll talk more about the MessageHolder later. It functions similarly to a shared_ptr, but is specifically designed to handle Raft messages, ensuring efficient management and processing of these communications. C++ struct TState { uint64_t CurrentTerm = 1; uint32_t VotedFor = 0; std::vector<TMessageHolder<TLogEntry>> Log; }; In the Volatile State, I labeled entries with either L for "leader" or F for "follower" to clarify their use. The CommitIndex denotes the last log entry that was committed. In contrast, LastApplied is the most recent log entry applied to the state machine, and it is always less than or equal to the CommitIndex. The NextIndex is important because it identifies the next log entry to be sent to a peer. Similarly, MatchIndex keeps track of the last log entry that discovered a match. The Votes section contains the IDs of peers who voted for me. Timeouts are an important aspect to manage: HeartbeatDue and RpcDue manage leader timeouts, while ElectionDue handles follower timeouts. C++ using TTime = std::chrono::time_point<std::chrono::steady_clock>; struct TVolatileState { uint64_t CommitIndex = 0; // L,F uint64_t LastApplied = 0; // L,F std::unordered_map<uint32_t, uint64_t> NextIndex; // L std::unordered_map<uint32_t, uint64_t> MatchIndex; // L std::unordered_set<uint32_t> Votes; // C std::unordered_map<uint32_t, TTime> HeartbeatDue; // L std::unordered_map<uint32_t, TTime> RpcDue; // L TTime ElectionDue; // F }; Raft API My implementation of the Raft algorithm has two classes. The first is INode, which denotes a peer. This class includes two methods: Send, which stores outgoing messages in an internal buffer, and Drain, which handles actual message dispatch. Raft is the second class, and it manages the current peer's state. It also includes two methods: Process, which handles incoming connections, and ProcessTimeout, which must be called on a regular basis to manage timeouts, such as the leader election timeout. Users of these classes should use the Process, ProcessTimeout, and Drain methods as necessary. INode's Send method is invoked internally within the Raft class, ensuring that message handling and state management are seamlessly integrated within the Raft framework. C++ struct INode { virtual ~INode() = default; virtual void Send(TMessageHolder<TMessage> message) = 0; virtual void Drain() = 0; }; class TRaft { public: TRaft(uint32_t node, const std::unordered_map<uint32_t, std::shared_ptr<INode>>& nodes); void Process(TTime now, TMessageHolder<TMessage> message, const std::shared_ptr<INode>& replyTo = {}); void ProcessTimeout(TTime now); }; Raft Messages Now let's look at how I send and read Raft messages. Instead of using a serialization library, I read and send raw structures in TLV format. This is what the message header looks like: C++ struct TMessage { uint32_t Type; uint32_t Len; char Value[0]; }; For additional convenience, I've introduced a second-level header: C++ struct TMessageEx: public TMessage { uint32_t Src = 0; uint32_t Dst = 0; uint64_t Term = 0; }; This includes the sender's and receiver's ID in each message. With the exception of LogEntry, all messages inherit from TMessageEx. LogEntry and AppendEntries are implemented as follows: C++ struct TLogEntry: public TMessage { static constexpr EMessageType MessageType = EMessageType::LOG_ENTRY; uint64_t Term = 1; char Data[0]; }; struct TAppendEntriesRequest: public TMessageEx { static constexpr EMessageType MessageType = EMessageType::APPEND_ENTRIES_REQUEST; uint64_t PrevLogIndex = 0; uint64_t PrevLogTerm = 0; uint32_t Nentries = 0; }; To facilitate message handling, I use a class called MessageHolder, reminiscent of a shared_ptr: C++ template<typename T> requires std::derived_from<T, TMessage> struct TMessageHolder { T* Mes; std::shared_ptr<char[]> RawData; uint32_t PayloadSize; std::shared_ptr<TMessageHolder<TMessage>[]> Payload; template<typename U> requires std::derived_from<U, T> TMessageHolder<U> Cast() {...} template<typename U> requires std::derived_from<U, T> auto Maybe() { ... } }; This class includes a char array containing the message itself. It may also include a Payload (which is only used for AppendEntry), as well as methods for safely casting a base-type message to a specific one (the Maybe method) and unsafe casting (the Cast method). Here is a typical example of using the MessageHolder: C++ void SomeFunction(TMessageHolder<TMessage> message) { auto maybeAppendEntries = message.Maybe<TAppendEntriesRequest>(); if (maybeAppendEntries) { auto appendEntries = maybeAppendEntries.Cast(); } // if we are sure auto appendEntries = message.Cast<TAppendEntriesRequest>(); // usage with overloaded operator-> auto term = appendEntries->Term; auto nentries = appendEntries->Nentries; // ... } And a real-life example in the Candidate state handler: C++ void TRaft::Candidate(TTime now, TMessageHolder<TMessage> message) { if (auto maybeResponseVote = message.Maybe<TRequestVoteResponse>()) { OnRequestVote(std::move(maybeResponseVote.Cast())); } else if (auto maybeRequestVote = message.Maybe<TRequestVoteRequest>()) { OnRequestVote(now, std::move(maybeRequestVote.Cast())); } else if (auto maybeAppendEntries = message.Maybe<TAppendEntriesRequest>()) { OnAppendEntries(now, std::move(maybeAppendEntries.Cast())); } } This design approach improves the efficiency and flexibility of message handling in Raft implementations. Raft Server Let's discuss the Raft server implementation. The Raft server will set up coroutines for network interactions. First, we'll look at the coroutines that handle message reading and writing. The primitives used for these coroutines are discussed later in the article, along with an analysis of the network library. The writing coroutine is responsible for writing messages to the socket, whereas the reading coroutine is slightly more complex. To read, it must first retrieve the Type and Len variables, then allocate an array of Len bytes, and finally, read the rest of the message. This structure facilitates the efficient and effective management of network communications within the Raft server. C++ template<typename TSocket> TValueTask<void> TMessageWriter<TSocket>::Write(TMessageHolder<TMessage> message) { co_await TByteWriter(Socket).Write(message.Mes, message->Len); auto payload = std::move(message.Payload); for (uint32_t i = 0; i < message.PayloadSize; ++i) { co_await Write(std::move(payload[i])); } co_return; } template<typename TSocket> TValueTask<TMessageHolder<TMessage>> TMessageReader<TSocket>::Read() { decltype(TMessage::Type) type; decltype(TMessage::Len) len; auto s = co_await Socket.ReadSome(&type, sizeof(type)); if (s != sizeof(type)) { /* throw */ } s = co_await Socket.ReadSome(&len, sizeof(len)); if (s != sizeof(len)) { /* throw */} auto mes = NewHoldedMessage<TMessage>(type, len); co_await TByteReader(Socket).Read(mes->Value, len - sizeof(TMessage)); auto maybeAppendEntries = mes.Maybe<TAppendEntriesRequest>(); if (maybeAppendEntries) { auto appendEntries = maybeAppendEntries.Cast(); auto nentries = appendEntries->Nentries; mes.InitPayload(nentries); for (uint32_t i = 0; i < nentries; i++) mes.Payload[i] = co_await Read(); } co_return mes; } To launch a Raft server, create an instance of the RaftServer class and call the Serve method. The Serve method starts two coroutines. The Idle coroutine is responsible for periodically processing timeouts, whereas InboundServe manages incoming connections. C++ class TRaftServer { public: void Serve() { Idle(); InboundServe(); } private: TVoidTask InboundServe(); TVoidTask InboundConnection(TSocket socket); TVoidTask Idle(); } Incoming connections are received via the accept call. Following this, the InboundConnection coroutine is launched, which reads incoming messages and forwards them to the Raft instance for processing. This configuration ensures that the Raft server can efficiently handle both internal timeouts and external communication. C++ TVoidTask InboundServe() { while (true) { auto client = co_await Socket.Accept(); InboundConnection(std::move(client)); } co_return; } TVoidTask InboundConnection(TSocket socket) { while (true) { auto mes = co_await TMessageReader(client->Sock()).Read(); Raft->Process(std::chrono::steady_clock::now(), std::move(mes), client); Raft->ProcessTimeout(std::chrono::steady_clock::now()); DrainNodes(); } co_return; } The Idle coroutine works as follows: it calls the ProcessTimeout method every sleep second. It's worth noting that this coroutine uses asynchronous sleep. This design enables the Raft server to efficiently manage time-sensitive operations without blocking other processes, improving the server's overall responsiveness and performance. C++ while (true) { Raft->ProcessTimeout(std::chrono::steady_clock::now()); DrainNodes(); auto t1 = std::chrono::steady_clock::now(); if (t1 > t0 + dt) { DebugPrint(); t0 = t1; } co_await Poller.Sleep(t1 + sleep); } The coroutine was created for sending outgoing messages and is designed to be simple. It repeatedly sends all accumulated messages to the socket in a loop. In the event of an error, it starts another coroutine that is responsible for connecting (via the connect function). This structure ensures that outgoing messages are handled smoothly and efficiently while remaining robust through error handling and connection management. C++ try { while (!Messages.empty()) { auto tosend = std::move(Messages); Messages.clear(); for (auto&& m : tosend) { co_await TMessageWriter(Socket).Write(std::move(m)); } } } catch (const std::exception& ex) { Connect(); } co_return; With the Raft Server implemented, these examples show how coroutines greatly simplify development. While I haven't looked into Raft's implementation (trust me, it's much more complex than the Raft Server), the overall algorithm is not only simple but also compact in design. Next, we'll look at some Raft Server examples. Following that, I'll describe the network library I created from scratch specifically for the Raft Server. This library is critical to enabling efficient network communication within the Raft framework. Here's an example of launching a Raft cluster with three nodes. Each instance receives its own ID as an argument, as well as the other instances' addresses and IDs. In this case, the client communicates exclusively with the leader. It sends random strings while keeping a set number of in-flight messages and waiting for their commitment. This configuration depicts the interaction between the client and the leader in a multi-node Raft environment, demonstrating the algorithm's handling of distributed data and consensus. Shell $ ./server --id 1 --node 127.0.0.1:8001:1 --node 127.0.0.1:8002:2 --node 127.0.0.1:8003:3 ... Candidate, Term: 2, Index: 0, CommitIndex: 0, ... Leader, Term: 3, Index: 1080175, CommitIndex: 1080175, Delay: 2:0 3:0 MatchIndex: 2:1080175 3:1080175 NextIndex: 2:1080176 3:1080176 .... $ ./server --id 2 --node 127.0.0.1:8001:1 --node 127.0.0.1:8002:2 --node 127.0.0.1:8003:3 ... $ ./server --id 3 --node 127.0.0.1:8001:1 --node 127.0.0.1:8002:2 --node 127.0.0.1:8003:3 ... Follower, Term: 3, Index: 1080175, CommitIndex: 1080175, ... $ dd if=/dev/urandom | base64 | pv -l | ./client --node 127.0.0.1:8001:1 >log1 198k 0:00:03 [159.2k/s] [ <=> I measured the commit latency for configurations of both 3-node and 5-node clusters. As expected, the latency is higher for the 5-node setup: 3 Nodes 50 percentile (median): 292,872 ns 80 percentile: 407,561 ns 90 percentile: 569,164 ns 99 percentile: 40,279,001 ns 5 Nodes 50 percentile (median): 425,194 ns 80 percentile: 672,541 ns 90 percentile: 1,027,669 ns 99 percentile: 38,578,749 ns I/O Library Let's now look at the I/O library that I created from scratch and used in the Raft server's implementation. I began with the example below, taken from cppreference.com, which is an implementation of an echo server: C++ task<> tcp_echo_server() { char data[1024]; while (true) { std::size_t n = co_await socket.async_read_some(buffer(data)); co_await async_write(socket, buffer(data, n)); } } An event loop, a socket primitive, and methods like read_some/write_some (named ReadSome/WriteSome in my library) were required for my library, as well as higher-level wrappers such as async_write/async_read (named TByteReader/TByteWriter in my library). To implement the ReadSome method of the socket, I had to create an Awaitable as follows: C++ auto ReadSome(char* buf, size_t size) { struct TAwaitable { bool await_ready() { return false; /* always suspend */ } void await_suspend(std::coroutine_handle<> h) { poller->AddRead(fd, h); } int await_resume() { return read(fd, b, s); } TSelect* poller; int fd; char* b; size_t s; }; return TAwaitable{Poller_,Fd_,buf,size}; } When co_await is called, the coroutine suspends because await_ready returns false. In await_suspend, we capture the coroutine_handle and pass it along with the socket handle to the poller. When the socket is ready, the poller calls the coroutine_handle to restart the coroutine. Upon resumption, await_resume is called, which performs a read and returns the number of bytes read to the coroutine. The WriteSome, Accept, and Connect methods are implemented in a similar manner. The Poller is set up as follows: C++ struct TEvent { int Fd; int Type; // READ = 1, WRITE = 2; std::coroutine_handle<> Handle; }; class TSelect { void Poll() { for (const auto& ch : Events) { /* FD_SET(ReadFds); FD_SET(WriteFds);*/ } pselect(Size, ReadFds, WriteFds, nullptr, ts, nullptr); for (int k = 0; k < Size; ++k) { if (FD_ISSET(k, WriteFds)) { Events[k].Handle.resume(); } // ... } } std::vector<TEvent> Events; // ... }; I keep an array of pairs (socket descriptor, coroutine handle) that are used to initialize structures for the poller backend (in this case, select). Resume is called when the coroutines corresponding to ready sockets wake up. This is applied in the main function as follows: C++ TSimpleTask task(TSelect& poller) { TSocket socket(0, poller); char buffer[1024]; while (true) { auto readSize = co_await socket.ReadSome(buffer, sizeof(buffer)); } } int main() { TSelect poller; task(poller); while (true) { poller.Poll(); } } We start a coroutine (or coroutines) that enters sleep mode on co_await, and control is then passed to an infinite loop that invokes the poller mechanism. If a socket becomes ready within the poller, the corresponding coroutine is triggered and executed until the next co_await. To read and write Raft messages, I needed to create high-level wrappers over ReadSome/WriteSome, similar to: C++ TValueTask<T> Read() { T res; size_t size = sizeof(T); char* p = reinterpret_cast<char*>(&res); while (size != 0) { auto readSize = co_await Socket.ReadSome(p, size); p += readSize; size -= readSize; } co_return res; } // usage T t = co_await Read<T>(); To implement these, I needed to create a coroutine that also functions as an Awaitable. The coroutine is made up of a pair: coroutine_handle and promise. The coroutine_handle is used to manage the coroutine from the outside, whereas the promise is for internal management. The coroutine_handle can include Awaitable methods, which allow the coroutine's result to be awaited with co_await. The promise can be used to store the result returned by co_return and to awaken the calling coroutine. In coroutine_handle, within the await_suspend method, we store the coroutine_handle of the calling coroutine. Its value will be saved in the promise: C++ template<typename T> struct TValueTask : std::coroutine_handle<> { bool await_ready() { return !!this->promise().Value; } void await_suspend(std::coroutine_handle<> caller) { this->promise().Caller = caller; } T await_resume() { return *this->promise().Value; } using promise_type = TValuePromise<T>; }; Within the promise itself, the return_value method will store the returned value. The calling coroutine is woken up with an awaitable, which is returned in final_suspend. This is because the compiler, after co_return, invokes co_await on final_suspend. C++ template<typename T> struct TValuePromise { void return_value(const T& t) { Value = t; } std::suspend_never initial_suspend() { return {}; } // resume Caller here TFinalSuspendContinuation<T> final_suspend() noexcept; std::optional<T> Value; std::coroutine_handle<> Caller = std::noop_coroutine(); }; In await_suspend, the calling coroutine can be returned, and it will be automatically awakened. It is important to note that the called coroutine will now be in a sleeping state, and its coroutine_handle must be destroyed with destroy to avoid a memory leak. This can be accomplished, for example, in the destructor of TValueTask. C++ template<typename T> struct TFinalSuspendContinuation { bool await_ready() noexcept { return false; } std::coroutine_handle<> await_suspend( std::coroutine_handle<TValuePromise<T>> h) noexcept { return h.promise().Caller; } void await_resume() noexcept { } }; With the library description completed, I ported the libevent benchmark to it to ensure its performance. This benchmark generates a chain of N Unix pipes, each one linked to the next. It then initiates 100 write operations into the chain, which continues until there are 1000 total write calls. The image below depicts the benchmark's runtime as a function of N for various backends of my library (coroio) versus libevent. This test demonstrates that my library performs similarly to libevent, confirming its efficiency and effectiveness in managing I/O operations. Conclusion In closing, this article has described the implementation of a Raft server using C++20 coroutines, emphasizing the convenience and efficiency provided by this modern C++ feature. The custom I/O library, which was written from scratch, is critical to this implementation because it effectively handles asynchronous I/O operations. The performance of the library was validated against the libevent benchmark, demonstrating its competency. For those interested in learning more about or using these tools, the I/O library is available at coroio, and the Raft library at miniraft-cpp (linked at the beginning of the article). Both repositories provide a detailed look at how C++20 coroutines can be used to build robust, high-performance distributed systems.
The history of DevOps is definitely worth reading in a few good books about it. On that topic, “The Phoenix Project,” self-characterized as “a novel of IT and DevOps,” is often mentioned as a must-read. Yet for practitioners like myself, a more hands-on one is “The DevOps Handbook” (which shares Kim as an author in addition to Debois, Willis, and Humble) that recounts some of the watershed moments around the evolution of software engineering and provides good references around implementation. This book actually describes how to replicate the transformation explained in the Phoenix Project and provides case studies. In this brief article, I will use my notes on this great book to regurgitate a concise history of DevOps, add my personal experience and opinion, and establish a link to Cloud Development Environments (CDEs), i.e., the practice of providing access to and running, development environments online as a service for developers. In particular, I explain how the use of CDEs concludes the effort of bringing DevOps “fully online.” Explaining the benefits of this shift in development practices, plus a few personal notes, is my main contribution in this brief article. Before clarifying the link between DevOps and CDEs, let’s first dig into the chain of events and technical contributions that led to today’s main methodology for delivering software. The Agile Manifesto The creation of the Agile Manifesto in 2001 sets forth values and principles as a response to more cumbersome software development methodologies like Waterfall and the Rational Unified Process (RUP). One of the manifesto's core principles emphasizes the importance of delivering working software frequently, ranging from a few weeks to a couple of months, with a preference for shorter timescales. The Agile movement's influence expanded in 2008 during the Agile Conference in Toronto, where Andrew Shafer suggested applying Agile principles to IT infrastructure rather than just to the application code. This idea was further propelled by a 2009 presentation at the Velocity Conference, where a paper from Flickr demonstrated the impressive feat of "10 deployments a day" using Dev and Ops collaboration. Inspired by these developments, Patrick Debois organized the first DevOps Days in Belgium, effectively coining the term "DevOps." This marked a significant milestone in the evolution of software development and operational practices, blending Agile's swift adaptability with a more inclusive approach to the entire IT infrastructure. The Three Ways of DevOps and the Principles of Flow All the concepts that I discussed so far are today incarnated into the “Three Ways of DevOps,” i.e., the foundational principles that guide the practices and processes in DevOps. In brief, these principles focus on: Improving the flow of work (First Way), i.e., the elimination of bottlenecks, reduction of batch sizes, and acceleration of workflow from development to production, Amplifying feedback loops (Second Way), i.e., quickly and accurately collect information about any issues or inefficiencies in the system and Fostering a culture of continuous learning and experimentation (Third Way), i.e., encouraging a culture of continuous learning and experimentation. Following the leads from Lean Manufacturing and Agile, it is easy to understand what led to the definition of the above three principles. I delve more deeply into each of these principles in this conference presentation. For the current discussion, though, i.e., how DevOps history leads to Cloud Development Environments, we just need to look at the First Way, the principle of flow, to understand the causative link. Chapter 9 of the DevOps Handbook explains that the technologies of version control and containerization are central to implementing DevOps flows and establishing a reliable and consistent development process. At the center of enabling the flow is the practice of incorporating all production artifacts into version control to serve as a single source of truth. This enables the recreation of the entire production environment in a repeatable and documented fashion. It ensures that production-like code development environments can be automatically generated and entirely self-serviced without requiring manual intervention from Operations. The significance of this approach becomes evident at release time, which is often the first time where an application's behavior is observed in a production-like setting, complete with realistic load and production data sets. To reduce the likelihood of issues, developers are encouraged to operate production-like environments on their workstations, created on-demand and self-serviced through mechanisms such as virtual images or containers, utilizing tools like Vagrant or Docker. Putting these environments under version control allows for the entire pre-production and build processes to be recreated. Note that production-like environments really refer to environments that, in addition to having the same infrastructure and application configuration as the real production environments, also contain additional applications and layers necessary for development. Developers are encouraged to operate production-like environments (Docker icon) on their workstations using mechanisms such as virtual images or containers to reduce the likelihood of execution issues in production. From Developer Workstations to a CDE Platform The notion of self-service is already emphasized in the DevOps Handbook as a key enabler to the principle of flow. Using 2016 technology, this is realized by downloading environments to the developers’ workstations from a registry (such as DockerHub) that provides pre-configured, production-like environments as files (dubbed infrastructure-as-code). Docker is often a tool to implement this function. Starting from this operation, developers create an application in effect as follows: They access and copy files with development environment information to their machines, Add source code to it in the local storage, and Build the application locally using their workstation computing resources. This is illustrated in the left part of the figure below. Once the application works correctly, the source code is sent (“pushed) to a central code repository, and the application is built and deployed online, i.e., using Cloud-based resources and applications such as CI/CD pipelines. The three development steps listed above are, in effect, the only operations in addition to the authoring of source code using an IDE that is “local,” i.e., they use workstations’ physical storage and computing resources. All the rest of the DevOps operations are performed using web-based applications and used as-a-service by developers and operators (even when these applications are self-hosted by the organization.). The basic goal of Cloud Development Environments is to move these development steps online as well. To do that, CDE platforms, in essence, provide the following basic services, illustrated in the right part of the figure below: Manage development environments online as containers or virtual machines such that developers can access them fully built and configured, substituting step (1) above; then Provide a mechanism for authoring source code online, i.e., inside the development environment using an IDE or a terminal, substituting step (2); and finally Provide a way to execute build commands inside the development environment (via the IDE or terminal), substituting step (3). Figure: (left) The classic development data flow requires the use of the local workstation resources. (right) The cloud development data flow replaced local storage and computing while keeping a similar developer experience. On each side, operations are (1) accessing environment information, (2) adding code, and (3) building the application. Note that the replacement of step (2) can be done in several ways. For example, for example, the IDE can be browser-based (aka a Cloud IDE), or a locally installed IDE can implement a way to remotely author the code in the remote environment. It is also possible to use a console text editor via a terminal such as vim. I cannot conclude this discussion without mentioning that, often multiple containerized environments are used for testing on the workstation, in particular in combination with the main containerized development environment. Hence, cloud IDE platforms need to reproduce the capability to run containerized environments inside the Cloud Development Environment (itself a containerized environment). If this recursive process becomes a bit complicated to grasp, don’t worry; we have reached the end of the discussion and can move to the conclusion. What Comes Out of Using Cloud Development Environments in DevOps A good way to conclude this discussion is to summarize the benefits of moving development environments from the developers’ workstations online using CDEs. As a result, the use of CDEs for DevOps leads to the following advantages: Streamlined Workflow: CDEs enhance the workflow by removing data from the developer's workstation and decoupling the hardware from the development process. This ensures the development environment is consistent and not limited by local hardware constraints. Environment Definition: With CDEs, version control becomes more robust as it can uniformize not only the environment definition but all the tools attached to the workflow, leading to a standardized development process and consistency across teams across the organization. Centralized Environments: The self-service aspect is improved by centralizing the production, maintenance, and evolution of environments based on distributed development activities. This allows developers to quickly access and manage their environments without the need for Operations manual work. Asset Utilization: Migrating the consumption of computing resources from local hardware to centralized and shared cloud resources not only lightens the load on local machines but also leads to more efficient use of organizational resources and potential cost savings. Improved Collaboration: Ubiquitous access to development environments, secured by embedded security measures in the access mechanisms, allows organizations to cater to a diverse group of developers, including internal, external, and temporary workers, fostering collaboration across various teams and geographies. Scalability and Flexibility: CDEs offer scalable cloud resources that can be adjusted to project demands, facilitating the management of multiple containerized environments for testing and development, thus supporting the distributed nature of modern software development teams. Enhanced Security and Observability: Centralizing development environments in the Cloud not only improves security (more about secure CDEs) but also provides immediate observability due to their online nature, allowing for real-time monitoring and management of development activities. By integrating these aspects, CDEs become a solution for modern, in particular cloud-native software development, and align with the principles of DevOps to improve flow, but also feedback, and continuous learning. In an upcoming article, I will discuss the contributions of CDEs across all three ways of DevOps. In the meantime, you're welcome to share your feedback with me.
Last year, I wrote a post on OpenTelemetry Tracing to understand more about the subject. I also created a demo around it, which featured the following components: The Apache APISIX API Gateway A Kotlin/Spring Boot service A Python/Flask service And a Rust/Axum service I've recently improved the demo to deepen my understanding and want to share my learning. Using a Regular Database In the initial demo, I didn't bother with a regular database. Instead: The Kotlin service used the embedded Java H2 database The Python service used the embedded SQLite The Rust service used hard-coded data in a hash map I replaced all of them with a regular PostgreSQL database, with a dedicated schema for each. The OpenTelemetry agent added a new span when connecting to the database on the JVM and in Python. For the JVM, it's automatic when one uses the Java agent. One needs to install the relevant package in Python — see next section. OpenTelemetry Integrations in Python Libraries Python requires you to explicitly add the package that instruments a specific library for OpenTelemetry. For example, the demo uses Flask; hence, we should add the Flask integration package. However, it can become a pretty tedious process. Yet, once you've installed opentelemetry-distro, you can "sniff" installed packages and install the relevant integration. Shell pip install opentelemetry-distro opentelemetry-bootstrap -a install For the demo, it installs the following: Plain Text opentelemetry_instrumentation-0.41b0.dist-info opentelemetry_instrumentation_aws_lambda-0.41b0.dist-info opentelemetry_instrumentation_dbapi-0.41b0.dist-info opentelemetry_instrumentation_flask-0.41b0.dist-info opentelemetry_instrumentation_grpc-0.41b0.dist-info opentelemetry_instrumentation_jinja2-0.41b0.dist-info opentelemetry_instrumentation_logging-0.41b0.dist-info opentelemetry_instrumentation_requests-0.41b0.dist-info opentelemetry_instrumentation_sqlalchemy-0.41b0.dist-info opentelemetry_instrumentation_sqlite3-0.41b0.dist-info opentelemetry_instrumentation_urllib-0.41b0.dist-info opentelemetry_instrumentation_urllib3-0.41b0.dist-info opentelemetry_instrumentation_wsgi-0.41b0.dist-info The above setup adds a new automated trace for connections. Gunicorn on Flask Every time I started the Flask service, it showed a warning in red that it shouldn't be used in production. While it's unrelated to OpenTelemetry, and though nobody complained, I was not too fond of it. For this reason, I added a "real" HTTP server. I chose Gunicorn, for no other reason than because my knowledge of the Python ecosystem is still shallow. The server is a runtime concern. We only need to change the Dockerfile slightly: Dockerfile RUN pip install gunicorn ENTRYPOINT ["opentelemetry-instrument", "gunicorn", "-b", "0.0.0.0", "-w", "4", "app:app"] The -b option refers to binding; you can attach to a specific IP. Since I'm running Docker, I don't know the IP, so I bind to any. The -w option specifies the number of workers Finally, the app:app argument sets the module and the application, separated by a colon Gunicorn usage doesn't impact OpenTelemetry integrations. Heredocs for the Win You may benefit from this if you write a lot of Dockerfile. Every Docker layer has a storage cost. Hence, inside a Dockerfile, one tends to avoid unnecessary layers. For example, the two following snippets yield the same results. Dockerfile RUN pip install pip-tools RUN pip-compile RUN pip install -r requirements.txt RUN pip install gunicorn RUN opentelemetry-bootstrap -a install RUN pip install pip-tools \ && pip-compile \ && pip install -r requirements.txt \ && pip install gunicorn \ && opentelemetry-bootstrap -a install The first snippet creates five layers, while the second is only one; however, the first is more readable than the second. With heredocs, we can access a more readable syntax that creates a single layer: Dockerfile RUN <<EOF pip install pip-tools pip-compile pip install -r requirements.txt pip install gunicorn opentelemetry-bootstrap -a install EOF Heredocs are a great way to have more readable and more optimized Dockerfiles. Try them! Explicit API Call on the JVM In the initial demo, I showed two approaches: The first uses auto-instrumentation, which requires no additional action The second uses manual instrumentation with Spring annotations I wanted to demo an explicit call with the API in the improved version. The use-case is analytics and uses a message queue: I get the trace data from the HTTP call and create a message with such data so the subscriber can use it as a parent. First, we need to add the OpenTelemetry API dependency to the project. We inherit the version from the Spring Boot Starter parent POM: XML <dependency> <groupId>io.opentelemetry</groupId> <artifactId>opentelemetry-api</artifactId> </dependency> At this point, we can access the API. OpenTelemetry offers a static method to get an instance: Kotlin val otel = GlobalOpenTelemetry.get() At runtime, the agent will work its magic to return the instance. Here's a simplified class diagram focused on tracing: In turn, the flow goes something like this: Kotlin val otel = GlobalOpenTelemetry.get() //1 val tracer = otel.tracerBuilder("ch.frankel.catalog").build() //2 val span = tracer.spanBuilder("AnalyticsFilter.filter") //3 .setParent(Context.current()) //4 .startSpan() //5 // Do something here span.end() //6 Get the underlying OpenTelemetry Get the tracer builder and "build" the tracer Get the span builder Add the span to the whole chain Start the span End the span; after this step, send the data to the OpenTelemetry endpoint configured Adding a Message Queue When I did the talk based on the post, attendees frequently asked whether OpenTelemetry would work with messages such as MQ or Kafka. While I thought it was the case in theory, I wanted to make sure of it: I added a message queue in the demo under the pretense of analytics. The Kotlin service will publish a message to an MQTT topic on each request. A NodeJS service will subscribe to the topic. Attaching OpenTelemetry Data to the Message So far, OpenTelemetry automatically reads the context to find out the trace ID and the parent span ID. Whatever the approach, auto-instrumentation or manual, annotations-based or explicit, the library takes care of it. I didn't find any existing similar automation for messaging; we need to code our way in. The gist of OpenTelemetry is the traceparent HTTP header. We need to read it and send it along with the message. First, let's add MQTT API to the project. XML <dependency> <groupId>org.eclipse.paho</groupId> <artifactId>org.eclipse.paho.mqttv5.client</artifactId> <version>1.2.5</version> </dependency> Interestingly enough, the API doesn't allow access to the traceparent directly. However, we can reconstruct it via the SpanContext class. I'm using MQTT v5 for my message broker. Note that the v5 allows for metadata attached to the message; when using v3, the message itself needs to wrap them. JavaScript val spanContext = span.spanContext //1 val message = MqttMessage().apply { properties = MqttProperties().apply { val traceparent = "00-${spanContext.traceId}-${spanContext.spanId}-${spanContext.traceFlags}" //2 userProperties = listOf(UserProperty("traceparent", traceparent)) //3 } qos = options.qos isRetained = options.retained val hostAddress = req.remoteAddress().map { it.address.hostAddress }.getOrNull() payload = Json.encodeToString(Payload(req.path(), hostAddress)).toByteArray() //4 } val client = MqttClient(mqtt.serverUri, mqtt.clientId) //5 client.publish(mqtt.options, message) //6 Get the span context Construct the traceparent from the span context, according to the W3C Trace Context specification Set the message metadata Set the message body Create the client Publish the message Getting OpenTelemetry Data From the Message The subscriber is a new component based on NodeJS. First, we configure the app to use the OpenTelemetry trace exporter: JavaScript const sdk = new NodeSDK({ resource: new Resource({[SemanticResourceAttributes.SERVICE_NAME]: 'analytics'}), traceExporter: new OTLPTraceExporter({ url: `${collectorUri}/v1/traces` }) }) sdk.start() The next step is to read the metadata, recreate the context from the traceparent, and create a span. JavaScript client.on('message', (aTopic, payload, packet) => { if (aTopic === topic) { console.log('Received new message') const data = JSON.parse(payload.toString()) const userProperties = {} if (packet.properties['userProperties']) { //1 const props = packet.properties['userProperties'] for (const key of Object.keys(props)) { userProperties[key] = props[key] } } const activeContext = propagation.extract(context.active(), userProperties) //2 const tracer = trace.getTracer('analytics') const span = tracer.startSpan( //3 'Read message', {attributes: {path: data['path'], clientIp: data['clientIp']}, activeContext, ) span.end() //4 } }) Read the metadata Recreate the context from the traceparent Create the span End the span For the record, I tried to migrate to TypeScript, but when I did, I didn't receive the message. Help or hints are very welcome! Apache APISIX for Messaging Though it's not common knowledge, Apache APISIX can proxy HTTP calls as well as UDP and TCP messages. It only offers a few plugins at the moment, but it will add more in the future. An OpenTelemetry one will surely be part of it. In the meantime, let's prepare for it. The first step is to configure Apache APISIX to allow both HTTP and TCP: YAML apisix: proxy_mode: http&stream #1 stream_proxy: tcp: - addr: 9100 #2 tls: false Configure APISIX for both modes Set the TCP port The next step is to configure TCP routing: YAML upstreams: - id: 4 nodes: "mosquitto:1883": 1 #1 stream_routes: #2 - id: 1 upstream_id: 4 plugins: mqtt-proxy: #3 protocol_name: MQTT protocol_level: 5 #4 Define the MQTT queue as the upstream Define the "streaming" route. APISIX defines everything that's not HTTP as streaming Use the MQTT proxy. Note APISIX offers a Kafka-based one Address the MQTT version. For version above 3, it should be 5 Finally, we can replace the MQTT URLs in the Docker Compose file with APISIX URLs. Conclusion I've described several items I added to improve my OpenTelemetry demo in this post. While most are indeed related to OpenTelemetry, some of them aren't. I may add another component in another different stack, a front-end. The complete source code for this post can be found on GitHub.
Choosing the Right Path Among a Plethora of Mobile App Development Methodologies
February 7, 2024 by
Getting Hired as a Scrum Master or Agile Coach
February 7, 2024
by
CORE
Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
February 7, 2024 by
Enhancing Database Efficiency With MySQL Views: A Comprehensive Guide and Examples
February 7, 2024 by
Explainable AI: Making the Black Box Transparent
May 16, 2023 by
Developing Software Applications Under the Guidance of Data-Driven Decision-Making Principles
February 7, 2024 by
Kubernetes Updates and Maintenance: Minimizing Downtime Challenges
February 7, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
February 7, 2024 by
Choosing the Right Path Among a Plethora of Mobile App Development Methodologies
February 7, 2024 by
Low Code vs. Traditional Development: A Comprehensive Comparison
May 16, 2023 by
Unleashing the Power of Java Interfaces
February 7, 2024 by
From Algorithms to AI: The Evolution of Programming in the Age of Generative Intelligence
February 7, 2024 by
Five IntelliJ Idea Plugins That Will Change the Way You Code
May 15, 2023 by