DZone Spotlight

Saturday, August 31 View All Articles »

Utilizing Multiple Vectors and Advanced Search Data Model Design for City Data

By Tim Spann

CORE

Goal of This Application In this article, we will build an advanced data model and use it for ingestion and various search options. For the notebook portion, we will run a hybrid multi-vector search, re-rank the results, and display the resulting text and images. Ingest data fields, enrich data with lookups, and format: Learn to ingest data including JSON and images, format and transform to optimize hybrid searches. This is done inside the streetcams.py application. Store data into Milvus: Learn to store data in Milvus, an efficient vector database designed for high-speed similarity searches and AI applications. In this step, we are optimizing the data model with scalar and multiple vector fields — one for text and one for the camera image. We do this in the streetcams.py application. Use open source models for data queries in a hybrid multi-modal, multi-vector search: Discover how to use scalars and multiple vectors to query data stored in Milvus and re-rank the final results in this notebook. Display resulting text and images: Build a quick output for validation and checking in this notebook. Simple Retrieval-Augmented Generation (RAG) with LangChain: Build a simple Python RAG application (streetcamrag.py) to use Milvus for asking about the current weather via Ollama. While outputing to the screen we also send the results to Slack formatted as Markdown. Summary By the end of this application, you’ll have a comprehensive understanding of using Milvus, data ingest object semi-structured and unstructured data, and using open source models to build a robust and efficient data retrieval system. For future enhancements, we can use these results to build prompts for LLM, Slack bots, streaming data to Apache Kafka, and as a Street Camera search engine. Milvus: Open Source Vector Database Built for Scale Milvus is a popular open-source vector database that powers applications with highly performant and scalable vector similarity searches. Milvus has a distributed architecture that separates compute and storage, and distributes data and workloads across multiple nodes. This is one of the primary reasons Milvus is highly available and resilient. Milvus is optimized for various hardware and supports a large number of indexes. You can get more details in the Milvus Quickstart. For other options for running Milvus, check out the deployment page. New York City 511 Data REST Feed of Street Camera information with latitude, longitude, roadway name, camera name, camera URL, disabled flag, and blocked flag: JSON { "Latitude": 43.004452, "Longitude": -78.947479, "ID": "NYSDOT-badsfsfs3", "Name": "I-190 at Interchange 18B", "DirectionOfTravel": "Unknown", "RoadwayName": "I-190 Niagara Thruway", "Url": "https://nyimageurl", "VideoUrl": "https://camera:443/rtplive/dfdf/playlist.m3u8", "Disabled":true, "Blocked":false } We then ingest the image from the camera URL endpoint for the camera image: After we run it through Ultralytics YOLO, we will get a marked-up version of that camera image. NOAA Weather Current Conditions for Lat/Long We also ingest a REST feed for weather conditions meeting latitude and longitude passed in from the camera record that includes elevation, observation date, wind speed, wind direction, visibility, relative humidity, and temperature. JSON "currentobservation":{ "id":"KLGA", "name":"New York, La Guardia Airport", "elev":"20", "latitude":"40.78", "longitude":"-73.88", "Date":"27 Aug 16:51 pm EDT", "Temp":"83", "Dewp":"60", "Relh":"46", "Winds":"14", "Windd":"150", "Gust":"NA", "Weather":"Partly Cloudy", "Weatherimage":"sct.png", "Visibility":"10.00", "Altimeter":"1017.1", "SLP":"30.04", "timezone":"EDT", "state":"NY", "WindChill":"NA" } Ingest and Enrichment We will ingest data from the NY REST feed in our Python loading script. In our streetcams.py Python script does our ingest, processing, and enrichment. We iterate through the JSON results from the REST call then enrich, update, run Yolo predict, then we run a NOAA Weather lookup on the latitude and longitude provided. Build a Milvus Data Schema We will name our collection: "nycstreetcameras". We add fields for metadata, a primary key, and vectors. We have a lot of varchar variables for things like roadwayname, county, and weathername. Python FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True), FieldSchema(name='latitude', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='longitude', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='name', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='roadwayname', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='directionoftravel', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='videourl', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='url', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='filepath', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='creationdate', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='areadescription', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='elevation', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='county', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='metar', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weatherid', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weathername', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='observationdate', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='temperature', dtype=DataType.FLOAT), FieldSchema(name='dewpoint', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='relativehumidity', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='windspeed', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='winddirection', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='gust', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weather', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='visibility', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='altimeter', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='slp', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='timezone', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='state', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='windchill', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='weatherdetails', dtype=DataType.VARCHAR, max_length=8000), FieldSchema(name='image_vector', dtype=DataType.FLOAT_VECTOR, dim=512), FieldSchema(name='weather_text_vector', dtype=DataType.FLOAT_VECTOR, dim=384) The two vectors are image_vector and weather_text_vector, which contain an image vector and text vector. We add an index for the primary key id and for each vector. We have a lot of options for these indexes and they can greatly improve performance. Insert Data Into Milvus We then do a simple insert into our collection with our scalar fields matching the schema name and type. We have to run an embedding function on our image and weather text before inserting. Then we have inserted our record. We can then check our data with Attu. Building a Notebook for Report We will build a Jupyter notebook to query and report on our multi-vector dataset. Prepare Hugging Face Sentence Transformers for Embedding Sentence Text We utilize a model from Hugging Face, "all-MiniLM-L6-v2", a sentence transformer to build our Dense embedding for our short text strings. This text is a short description of the weather details for the nearest location to our street camera. See: Integrate with HuggingFace Prepare Embedding Model for Images We utilize a standard resnet34 Pytorch feature extractor that we often use for images. Instantiate Milvus As stated earlier, Milvus is a popular open-source vector database that powers AI applications with highly performant and scalable vector similarity search. For our example, we are connecting to Milvus running in Docker. Setting the URI as a local file, e.g., ./milvus.db, is the most convenient method, as it automatically utilizes Milvus Lite to store all data in this file. If you have a large scale of data, say more than a million vectors, you can set up a more performant Milvus server on Docker or Kubernetes. In this setup, please use the server URI, e.g.http://localhost:19530, as your uri. If you want to use Zilliz Cloud, the fully managed cloud service for Milvus, adjust the URI and token, which correspond to the Public Endpoint and API key in Zilliz Cloud. Prepare Our Search We are building two searches (AnnSearchRequest) to combine together for a hybrid search which will include a reranker. Display Our Results We display the results of our re-ranked hybrid search of two vectors. We show some of the output scalar fields and an image we read from the stored path. The results from our hybrid search can be iterated and we can easily access all the output fields we choose. filepath contains the link to the locally stored image and can be accessed from the key.entity.filepath. The key contains all our results, while key.entity has all of our output fields chosen in our hybrid search in the previous step. We iterate through our re-ranked results and display the image and our weather details. RAG Application Since we have loaded a collection with weather data, we can use that as part of a RAG (Retrieval Augmented Generation). We will build a completely open-source RAG application utilizing the local Ollama, LangChain, and Milvus. We set up our vector_store as Milvus with our collection. Python vector_store = Milvus( embedding_function=embeddings, collection_name="CollectionName", primary_field = "id", vector_field = "weather_text_vector", text_field="weatherdetails", connection_args={"uri": "https://localhost:19530"}, ) We then connect to Ollama. Python llm = Ollama( model="llama3", callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]), stop=["<|eot_id|>"], ) We prompt for interacting questions. Python query = input("\nQuery: ") We set up a RetrievalQA connection between our LLM and our vector store. We pass in our query and get the result. Python qa_chain = RetrievalQA.from_chain_type( llm, retriever=vector_store.as_retriever(collection = SC_COLLECTION_NAME)) result = qa_chain({"query": query}) resultforslack = str(result["result"]) We then post the results to a Slack channel. Python response = client.chat_postMessage(channel="C06NE1FU6SE", text="", blocks=[{"type": "section", "text": {"type": "mrkdwn", "text": str(query) + " \n\n" }, {"type": "divider"}, {"type": "section","text": {"type": "mrkdwn","text": str(resultforslack) +"\n" }] ) Below is the output from our chat to Slack. You can find all the source code for the notebook, the ingest script, and the interactive RAG application in GitHub below. Source Code Conclusion In this notebook, you have seen how you can use Milvus to do a hybrid search on multiple vectors in the same collection and re-ranking the results. You also saw how to build a complex data modal that includes multiple vectors and many scalar fields that represent a lot of metadata related to our data. You learned how to ingest JSON, images, and text to Milvus with Python. And finally, we built a small chat application to check out the weather for locations near traffic cameras. To build your own applications, please check out the resources below. Resources In the following list, you can find resources helpful in learning more about using pre-trained embedding models for Milvus, performing searches on text data, and a great example notebook for embedding functions. Milvus Reranking Milvus Hybrid Search 511NY: GET api/GetCameras Using PyMilvus's Model To Generate Text Embeddings HuggingFace: sentence-transformers/all-MiniLM-L6-v2 Pretrained Models Milvus: SentenceTransformerEmbeddingFunction Vectorizing JSON Data with Milvus for Similarity Search Milvus: Scalar Index Milvus: In-memory Index Milvus: On-disk Index GPU Index Not Every Field is Just Text, Numbers, or Vectors How good is Quantization in Milvus? More

Effortless Concurrency: Leveraging the Actor Model in Financial Transaction Systems

By Nikita Melnikov

Introduction to the Problem Managing concurrency in financial transaction systems is one of the most complex challenges faced by developers and system architects. Concurrency issues arise when multiple transactions are processed simultaneously, which can lead to potential conflicts and data inconsistencies. These issues manifest in various forms, such as overdrawn accounts, duplicate transactions, or mismatched records, all of which can severely undermine the system's reliability and trustworthiness. In the financial world, where the stakes are exceptionally high, even a single error can result in significant financial losses, regulatory violations, and reputational damage to the organization. Consequently, it is critical to implement robust mechanisms to handle concurrency effectively, ensuring the system's integrity and reliability. Complexities in Money Transfer Applications At first glance, managing a customer's account balance might seem like a straightforward task. The core operations — crediting an account, allowing withdrawals, or transferring funds between accounts — are essentially simple database transactions. These transactions typically involve adding or subtracting from the account balance, with the primary concern being to prevent overdrafts and maintain a positive or zero balance at all times. However, the reality is far more complex. Before executing any transaction, it's often necessary to perform a series of checks with other systems. For example, the system must verify that the account in question actually exists, which usually involves querying a central account database or service. Moreover, the system must ensure that the account is not blocked due to issues such as suspicious activity, regulatory compliance concerns, or pending verification processes. These additional steps introduce layers of complexity that go beyond simple debit and credit operations. Robust checks and balances are required to ensure that customer balances are managed securely and accurately, adding significant complexity to the overall system. Real-World Requirements (KYC, Fraud Prevention, etc.) Consider a practical example of a money transfer company that allows customers to transfer funds across different currencies and countries. From the customer's perspective, the process is simple: The customer opens an account in the system. A EUR account is created to receive money. The customer creates a recipient in the system. The customer initiates a transfer of €100 to $110 to the recipient. The system waits for the inbound €100. Once the funds arrive, they are converted to $110. Finally, the system sends $110 to the recipient. This process can be visualized as follows: While this sequence appears simple, real-world requirements introduce additional complexity: Payment verification: The system must verify the origin of the inbound payment. The payer's bank account must be valid. The bank's BIC code must be authorized within the system. If the payment originates from a non-bank payment system, additional checks are required. Recipient validation: The recipient's bank account must be active. Customer validation: The recipient must pass various checks, such as identity verification (e.g., a valid passport and a confirmed selfie ID). Source of funds and compliance: Depending on the inbound transfer amount, the source of funds may need to be verified. The fraud prevention system should review the inbound payment. Neither the sender nor the recipient should appear on any sanctions list. Transaction limits and fees: The system should calculate monthly and annual payment limits to determine applicable fees. If the transaction involves currency conversion, the system must handle foreign exchange rates. Audit and compliance: The system must log all transactions for auditing and compliance purposes. These requirements add significant complexity to what initially seems like a straightforward process. Additionally, based on the results of these checks, the payment may require manual review, further extending the payment process. Visualization of Data Flow and Potential Failure Points In a financial transaction system, the data flow for handling inbound payments involves multiple steps and checks to ensure compliance, security, and accuracy. However, potential failure points exist throughout this process, particularly when external systems impose restrictions or when the system must dynamically decide on the course of action based on real-time data. Standard Inbound Payment Flow Here's a simplified visualization of the data flow when handling an inbound payment, including the sequence of interactions between various components: Explanation of the Flow Customer initiates payment: The customer sends a payment to their bank. Bank sends payment: The bank forwards the payment to the transfer system. Compliance check: The transfer system checks the sender and recipient against compliance regulations. Verification checks: The system verifies if the sender and recipient have passed necessary identity and document verifications. Fraud detection: A fraud check is performed to ensure the payment is not suspicious. Statistic calculation: The system calculates transaction limits and other relevant metrics. Fee calculation: Any applicable fees are calculated. Confirmation: The system confirms receipt of the payment to the customer. Potential Failure Points and Dynamic Restrictions While the above flow seems straightforward, the process can become complicated due to dynamic changes, such as when an external system imposes restrictions on a customer's account. Here's how the process might unfold, highlighting the potential failure points: Explanation of the Potential Failure Points Dynamic restrictions: During the process, the compliance team may decide to restrict all operations for a specific customer due to sanctions or other regulatory reasons. This introduces a potential failure point where the process could be halted or altered mid-way. Database state conflicts: After compliance decides to restrict operations, the transfer system needs to update the state of the transfer in the database. The challenge here lies in managing the state consistency, particularly if multiple operations occur simultaneously or if there are conflicting updates. The system must ensure that the transfer's state is accurately reflected in the database, taking into account the restriction imposed. If not handled carefully, this could lead to inconsistent states or failed transactions. Decision points: The system's ability to dynamically recalculate the state and decide whether to accept or reject an inbound payment is crucial. Any misstep in this decision-making process could result in unauthorized transactions, blocked funds, or legal violations. Visualizing the data flow and identifying potential failure points in financial transaction systems reveals the complexity and risks involved in handling payments. By understanding these risks, system architects can design more robust mechanisms to manage state, handle dynamic changes, and ensure the integrity of the transaction process. Traditional Approaches to Concurrency There are various approaches to addressing concurrency challenges in financial transaction systems. Database Transactions and Their Limitations The most straightforward approach to managing concurrency is through database transactions. To start, let’s define our context: the transfer system stores its data in a Postgres database. While the database topology can vary — whether shared across multiple instances, data centers, locations, or regions — our focus here is on a simple, single Postgres database instance handling both reads and writes. To ensure that one transaction does not override another's data, we can lock the row associated with the transfer: SELECT * FROM transfers WHERE id = 'ABCD' FOR UPDATE; This command locks the row at the beginning of the process and releases the lock once the transaction is complete. The following diagram illustrates how this approach addresses the issue of lost updates: While this approach can solve the problem of lost updates in simple scenarios, it becomes less effective as the system scales and the number of active transactions increases. Scaling Issues and Resource Exhaustion Let’s consider the implications of scaling this approach. Assume that processing one payment takes 5 seconds, and the system handles 100 inbound payments every second. This results in 500 active transactions at any given time. Each of these transactions requires a database connection, which can quickly lead to resource exhaustion, increased latency, and degraded system performance, particularly under high load conditions. Locks: Local and Distributed Local locks are another common method for managing concurrency within a single application instance. They ensure that critical sections of code are executed by only one thread at a time, preventing race conditions and ensuring data consistency. Implementing local locks is relatively simple using constructs like synchronized blocks or ReentrantLocks in Java, which manages access to shared resources effectively within a single system. However, local locks fall short in distributed environments where multiple instances of an application need to coordinate their actions. In such scenarios, a local lock on one instance does not prevent conflicting actions on other instances. This is where distributed locks come into play. Distributed locks ensure that only one instance of an application can access a particular resource at any given time, regardless of which node in the cluster is executing the code. Implementing distributed locks is inherently more complex, often requiring external systems like ZooKeeper, Consul, Hazelcast, or Redis to manage the lock state across multiple nodes. These systems need to be highly available and consistent to prevent the distributed lock mechanism from becoming a single point of failure or a bottleneck. The following diagram illustrates the typical flow of a distributed lock system: The Problem of Ordering In distributed systems, where multiple nodes may request locks simultaneously, ensuring fair processing and maintaining data consistency can be challenging. Achieving an ordered queue of lock requests across nodes involves several difficulties: Network latency: Varying latencies can make strict ordering difficult to maintain Fault Tolerance: The ordering mechanism must be fault-tolerant and not become a single point of failure, which adds complexity to the system. Waiting of Lock Consumers and Deadlocks When multiple nodes hold various resources and wait for each other to release locks, a deadlock can occur, halting system progress. To mitigate this, distributed locks often incorporate timeouts. Timeouts Lock acquisition timeouts: Nodes specify a maximum wait time for a lock. If the lock is not granted within this time, the request times out, preventing indefinite waiting. Lock holding timeouts: Nodes holding a lock have a maximum duration to hold it. If the time is exceeded, the lock is automatically released to prevent resources from being held indefinitely. Timeout handling: When a timeout occurs, the system must handle it gracefully, whether by retrying, aborting, or triggering compensatory actions. Considering these challenges, guaranteeing reliable payment processing in a system that relies on distributed locking is a complex endeavor. Balancing the need for concurrency control with the realities of distributed systems requires careful planning and robust design. A Paradigm Shift: Simplifying Concurrency Let’s take a step back and review our transfer processing approach. By breaking the process into smaller steps, we can simplify each operation, making the entire system more manageable and reducing the risk of concurrency issues. When a payment is received, it triggers a series of checks, each requiring computations from different systems. Once all the results are in, the system decides on the next course of action. These steps resemble transitions in a finite state machine (FSM). Introducing a Message-Based Processing Model As shown in the diagram, payment processing involves a combination of commands and state transitions. For each command, the system identifies the initial state and the possible transition states. For example, if the system receives the [ReceivePayment] command, it checks if the transfer is in the created state. If not, it does nothing. For the [ApplyCheckResult] command, the system transitions the transfer to either checks_approved or checks_rejected based on the results of the checks. These checks are designed to be granular and quick to process, as each check operates independently and does not modify the transfer state directly. It only requires the input data to determine the result of the check. Here is how the code for such processing might look: Java interface Check<Input> { CheckResult run(Input input); } interface Processor<State, Command> { State process(State initial, Command command); } interface CommandSender<Command> { void send(UUID transferId, Command command); } Let’s see how these components interact to send, receive, and process checks: Java enum CheckStatus { NEW, ACCEPTED, REJECTED } record Check(UUID transferId, CheckType type, CheckStatus status, Data data); class CheckProcessor { void process(Check check) { // Run all required calculations // Send result to `TransferProcessor` } } enum TransferStatus { CREATED, PAYMENT_RECEIVED, CHECKS_SENT, CHECKS_PENDING, CHECKS_APPROVED, CHECKS_REJECTED } record Transfer(UUID id, List<Check> checks); sealed interface Command permits ReceivePayment, SendChecks, ApplyCheckResult {} class TransferProcessor { State process(State state, Command command) { // (1) If status == CREATED and command is `ReceivePayment` // (2) Write payment details to the state // (3) Send command `SendChecks` to self // (4) Set status = PAYMENT_RECEIVED // (4) If state = PAYMENT_RECEIVED and command is `SendChecks` // (5) Calculate all required checks (without processing) // (6) Send checks for processing to other processors // (7) Set status = CHECKS_SENT // (10) If status = CHECKS_SENT or CHECKS_PENDING // and command is ApplyCheckResult // (11) Update `transfer.checks()` // (12) Compute overall status // (13) If all checks are accepted - set status = CHECKS_APPROVED // (14) If any of the checks is rejected - set status CHECKS_REJECTED // (15) Otherwise - set status = CHECKS_PENDING } } This approach reduces processing latency by offloading check result calculations to separate processes, leading to fewer concurrent operations. However, it does not entirely solve the problem of ensuring atomic processing for commands. Communication Through Messages In this model, communication between different parts of the system occurs through messages. This approach enables asynchronous communication, decoupling components and enhancing flexibility and scalability. Messages are managed through queues and message brokers, which ensure orderly transmission and reception of messages. The diagram below illustrates this process: One-at-a-Time Message Handling To ensure correct and consistent command processing, it is crucial to order and linearize all messages for a single transfer. This means messages should be processed in the order they were sent, and no two messages for the same transfer should be processed simultaneously. Sequential processing guarantees that each step in the transaction lifecycle occurs in the correct sequence, preventing race conditions, data corruption, or inconsistent states. Here’s how it works: Message queue: A dedicated queue is maintained for each transfer to ensure that messages are processed in the order they are received. Consumer: The consumer fetches messages from the queue, processes them, and acknowledges successful processing. Sequential processing: The consumer processes each message one by one, ensuring that no two messages for the same transfer are processed simultaneously. Durable Message Storage Ensuring message durability is crucial in financial transaction systems because it allows the system to replay a message if the processor fails to handle the command due to issues like external payment failures, storage failures, or network problems. Imagine a scenario where a payment processing command fails due to a temporary network outage or a database error. Without durable message storage, this command could be lost, leading to incomplete transactions or other inconsistencies. By storing messages durably, we ensure that every command and transaction step is persistently recorded. If a failure occurs, the system can recover and replay the message once the issue is resolved, ensuring the transaction completes successfully. Durable message storage is also invaluable for dealing with external payment systems. If an external system fails to confirm a payment, we can replay the message to retry the operation without losing critical data, maintaining the integrity and consistency of our transactions. Additionally, durable message storage is essential for auditing and compliance, providing a reliable log of all transactions and actions taken by the system, and making it easier to track and verify operations when needed. The following diagram illustrates how durable message storage works: By using durable message storage, the system becomes more reliable and resilient, ensuring that failures are handled gracefully without compromising data integrity or customer trust. Kafka as a Messaging Backbone Apache Kafka is a distributed streaming platform designed for high-throughput, low-latency message handling. It is widely used as a messaging backbone in complex systems due to its ability to handle real-time data feeds efficiently. Let's explore Kafka's core components, including producers, topics, partitions, and message routing, to understand how it operates within a distributed system. Topics and Partitions Topics In Kafka, a topic is a category or feed name to which records are stored and published. Topics are divided into partitions to facilitate parallel processing and scalability. Partitions Each topic can be divided into multiple partitions, which are the fundamental units of parallelism in Kafka. Partitions are ordered, immutable sequences of records continually appended to a structured commit log. Kafka stores data in these partitions across a distributed cluster of brokers. Each partition is replicated across multiple brokers to ensure fault tolerance and high availability. The replication factor determines the number of copies of the data, and Kafka automatically manages the replication process to ensure data consistency and reliability. Each record within a partition has a unique offset, serving as the identifier for the record's position within the partition. This offset allows consumers to keep track of their position and continue processing from where they left off in case of a failure. Message Routing Kafka's message routing is a key mechanism that determines how messages are distributed across the partitions of a topic. There are several methods for routing messages: Round-robin: The default method where messages are evenly distributed across all available partitions to ensure a balanced load and efficient use of resources Key-based routing: Messages with the same key are routed to the same partition, which is useful for maintaining the order of related messages and ensuring they are processed sequentially. For example, all transactions for a specific account can be routed to the same partition using the account ID as the key. Custom partitioners: Kafka allows custom partitioning logic to define how messages should be routed based on specific criteria. This is useful for complex routing requirements not covered by the default methods. This routing mechanism optimizes performance, maintains message order when needed, and supports scalability and fault tolerance. Producers Kafka producers are responsible for publishing records to topics. They can specify acknowledgment settings to control when a message is considered successfully sent: acks=0: No acknowledgment is needed, providing the lowest latency but no delivery guarantees acks=1: The leader broker acknowledges the message, ensuring it has been written to the leader's log. acks=all: All in-sync replicas must acknowledge the message, providing the highest level of durability and fault tolerance. These configurations allow Kafka producers to meet various application requirements for message delivery and persistence, ensuring that data is reliably stored and available for consumers. Consumers Kafka consumers read data from Kafka topics. A key concept in Kafka's consumer model is the consumer group. A consumer group consists of multiple consumers working together to read data from a topic. Each consumer in the group reads from different partitions of the topic, allowing for parallel processing and increased throughput. When a consumer fails or leaves the group, Kafka automatically reassigns the partitions to the remaining consumers, ensuring fault tolerance and high availability. This dynamic balancing of partition assignments ensures that the workload is evenly distributed among the consumers in the group, optimizing resource utilization and processing efficiency. Kafka's ability to manage high volumes of data, ensure fault tolerance, and maintain message order makes it an ideal choice for serving as a messaging backbone in distributed systems, particularly in environments requiring real-time data processing and robust concurrency management. Messaging System Using Kafka Incorporating Apache Kafka as the messaging backbone into our system allows us to address various challenges associated with message handling, durability, and scalability. Let's explore how Kafka aligns with our requirements and facilitates the implementation of an Actor model-based system. One-at-a-Time Message Handling To ensure that messages for a specific transfer are handled sequentially and without overlap, we can create a Kafka topic named transfer.commands with multiple partitions. Each message's key will be the transferId, ensuring that all commands related to a particular transfer are routed to the same partition. Since a partition can only be consumed by one consumer at a time, this setup guarantees one-at-a-time message handling for each transfer. Durable Message Store Kafka's architecture is designed to ensure message durability by persisting messages across its distributed brokers. Here are some key Kafka configurations that enhance message durability and reliability: retention.ms: Specifies how long Kafka retains a record before it is deleted; for example, setting log.retention.ms=604800000 retains messages for 7 days log.segment.bytes: Controls the size of each log segment; for instance, setting log.segment.bytes=1073741824 creates new segments after 1 GB min.insync.replicas: Defines the minimum number of replicas that must acknowledge a write before it is considered successful; setting min.insync.replicas=2 ensures that at least two replicas confirm the write. acks: A producer setting that specifies the number of acknowledgments required. Setting acks=all ensures that all in-sync replicas must acknowledge the message, providing high durability. Example configurations for ensuring message durability: Java # Example 1: Retention Policy log.retention.ms=604800000 # Retain messages for 7 days log.segment.bytes=1073741824 # 1 GB segment size # Example 2: Replication and Acknowledgment min.insync.replicas=2 # At least 2 replicas must acknowledge a write acks=all # Producer requires acknowledgment from all in-sync replicas # Example 3: Producer Configuration acks=all # Ensures high durability retries=5 # Number of retries in case of transient failures Revealing the Model: The Actor Pattern In our system, the processor we previously discussed will now be referred to as an Actor. The Actor model is well-suited for managing state and handling commands asynchronously, making it a natural fit for our Kafka-based system. Core Concepts of the Actor Model Actors as fundamental units: Each Actor is responsible for receiving messages, processing them, and modifying its internal state. This aligns with our use of processors to handle commands for each transfer. Asynchronous message passing: Communication between Actors occurs through Kafka topics, allowing for decoupled, asynchronous interactions. State isolation: Each Actor maintains its own state, which can only be modified by sending a command to the Actor. This ensures that state changes are controlled and sequential. Sequential message processing: Kafka guarantees that messages within a partition are processed in order, which supports the Actor model's need for sequential handling of commands. Location transparency: Actors can be distributed across different machines or locations, enhancing scalability and fault tolerance. Fault tolerance: Kafka’s built-in fault-tolerance mechanisms, combined with the Actor model’s distributed nature, ensure that the system can handle failures gracefully. Scalability: The system’s scalability is determined by the number of Kafka partitions. For instance, with 64 partitions, the system can handle 64 concurrent commands. Kafka's architecture allows us to scale by adding more partitions and consumers as needed. Implementing the Actor Model in the System We start by defining a simple interface for managing the state: Java interface StateStorage<K, S> { S newState(); S get(K key); void put(K key, S state); } Next, we define the Actor interface: Java interface Actor<S, C> { S receive(S state, C command); } To integrate Kafka, we need helper interfaces to read the key and value from Kafka records: Java interface KafkaMessageKeyReader<K> { K readKey(byte[] key); } interface KafkaMessageValueReader<V> { V readValue(byte[] value); } Finally, we implement the KafkaActorConsumer, which manages the interaction between Kafka and our Actor system: Java class KafkaActorConsumer<K, S, C> { private final Supplier<Actor<S, C>> actorFactory; private final StateStorage<K, S> storage; private final KafkaMessageKeyReader<K> keyReader; private final KafkaMessageValueReader<C> valueReader; public KafkaActorConsumer(Supplier<Actor<S, C>> actorFactory, StateStorage<K, S> storage, KafkaMessageKeyReader<K> keyReader, KafkaMessageValueReader<C> valueReader) { this.actorFactory = actorFactory; this.storage = storage; this.keyReader = keyReader; this.valueReader = valueReader; } public void consume(ConsumerRecord<byte[], byte[]> record) { // (1) Read the key and value from the record K messageKey = keyReader.readKey(record.key()); C messageValue = valueReader.readValue(record.value()); // (2) Get the current state from the storage S state = storage.get(messageKey); if (state == null) { state = storage.newState(); } // (3) Get the actor instance Actor<S, C> actor = actorFactory.get(); // (4) Process the message S newState = actor.receive(state, messageValue); // (5) Save the new state storage.put(messageKey, newState); } } This implementation handles the consumption of messages from Kafka, processes them using an Actor, and updates the state accordingly. Additional considerations like error handling, logging, and tracing can be added to enhance the robustness of this system. By combining Kafka’s powerful messaging capabilities with the Actor model’s structured approach to state management and concurrency, we can build a highly scalable, resilient, and efficient system for handling financial transactions. This setup ensures that each command is processed correctly, sequentially, and with full durability guarantees. Advanced Topics Outbox Pattern The Outbox Pattern is a critical design pattern for ensuring reliable message delivery in distributed systems, particularly when integrating PostgreSQL with Kafka. The primary issue it addresses is the risk of inconsistencies where a transaction might be committed in PostgreSQL, but the corresponding message fails to be delivered to Kafka due to a network issue or system failure. This can lead to a situation where the database state and the message stream are out of sync. The Outbox Pattern solves this problem by storing messages in a local outbox table within the same PostgreSQL transaction. This ensures that the message is only sent to Kafka after the transaction is successfully committed. By doing so, it provides exactly-once delivery semantics, preventing message loss and ensuring consistency between the database and the message stream. Implementing the Outbox Pattern With the Outbox Pattern in place, the KafkaActorConsumer and Actor implementations can be adjusted to accommodate this pattern: Java record OutboxMessage(UUID id, String topic, byte[] key, Map<String, byte[]> headers, byte[] payload) {} record ActorReceiveResult<S, M>(S newState, List<M> messages) {} interface Actor<S, C> { ActorReceiveResult<S, OutboxMessage> receive(S state, C command); } class KafkaActorConsumer<K, S, C> { public void consume(ConsumerRecord<byte[], byte[]> record) { // ... other steps // (5) Process the message var result = actor.receive(state, messageValue); // (6) Save the new state storage.put(messageKey, result.newState()); } @Transactional public void persist(S state, List<OutboxMessage> messages) { // (7) Persist the new state storage.put(stateKey, state); // (8) Persist the outbox messages for (OutboxMessage message : messages) { outboxTable.save(message); } } } In this implementation: The Actor now returns an ActorReceiveResult containing the new state and a list of outbox messages that need to be sent to Kafka. The KafkaActorConsumer processes these messages and persists both the state and the messages in the outbox table within the same transaction. After the transaction is committed, an external process (e.g., Debezium) reads from the outbox table and sends the messages to Kafka, ensuring exactly-once delivery. Toxic Messages and Dead-Letters In distributed systems, some messages might be malformed or cause errors that prevent successful processing. These problematic messages are often referred to as "toxic messages." To handle such scenarios, we can implement a dead-letter queue (DLQ). A DLQ is a special queue where unprocessable messages are sent for further investigation. This approach ensures that these messages do not block the processing of other messages and allows for the root cause to be addressed without losing data. Here's a basic implementation for handling toxic messages: Java class ToxicMessage extends Exception {} class LogicException extends ToxicMessage {} class SerializationException extends ToxicMessage {} class DefaultExceptionDecider { public boolean isToxic(Throwable e) { return e instanceof ToxicMessage; } } interface DeadLetterProducer { void send(ConsumerRecord<?, ?> record, Throwable e); } class Consumer { private final ExceptionDecider exceptionDecider; private final DeadLetterProducer deadLetterProducer; void consume(ConsumerRecord<String, String> record) { try { // process record } catch (Exception e) { if (exceptionDecider.isToxic(e)) { deadLetterProducer.send(record, e); } else { // throw exception to retry the operation throw e; } } } } In this implementation: ToxicMessage: A base exception class for any errors deemed "toxic," meaning they should not be retried but rather sent to the DLQ DefaultExceptionDecider: Decides whether an exception is toxic and should trigger sending the message to the DLQ DeadLetterProducer: Responsible for sending messages to the DLQ Consumer: Processes messages and uses the ExceptionDecider and DeadLetterProducer to handle errors appropriately Conclusion By leveraging Kafka as the messaging backbone and implementing the Actor model, we can build a robust, scalable, and fault-tolerant financial transaction system. The Actor model offers a straightforward approach to managing state and concurrency, while Kafka provides the tools necessary for reliable message handling, durability, and partitioning. The Actor model is not a specialized or complex framework but rather a set of simple abstractions that can significantly increase the scalability and reliability of our system. Kafka’s built-in features, such as message durability, ordering, and fault tolerance, naturally align with the principles of the Actor model, enabling us to implement these concepts efficiently and effectively without requiring additional frameworks. Incorporating advanced patterns like the Outbox Pattern and handling toxic messages with DLQs further enhances the system's reliability, ensuring that messages are processed consistently and that errors are managed gracefully. This comprehensive approach ensures that our financial transaction system remains reliable, scalable, and capable of handling complex workflows seamlessly. More

Trend Report

Database Systems

In 2024, the focus around databases is on their ability to scale and perform in modern data architectures. It's not just centered on distributed, cloud-centric data environments anymore, but rather on databases built and managed in a way that allows them to be used optimally in advanced applications. This modernization of database architectures allows for developers and organizations to be more flexible with their data. With the advancements in automation and the proliferation of artificial intelligence, the way data capabilities and databases are built, managed, and scaled has evolved at an exponential rate.This Trend Report explores database adoption and advancements, including how to leverage time series databases for analytics, why developers should use PostgreSQL, modern, real-time streaming architectures, database automation techniques for DevOps, how to take an AI-focused pivot within database systems practices, and more. The goal of this Trend Report is to equip developers and IT professionals with tried-and-true practices alongside forward-looking industry insights to allow them to modernize and future-proof their database architectures.

Refcard #397

Secrets Management Core Practices

By Apostolos Giannakidis

CORE

Refcard #371

Data Pipeline Essentials

By Sudip Sengupta

CORE

Enhancing Software Quality with Checkstyle and PMD: A Practical Guide

It is widely agreed that maintaining a high-quality standard in software development is crucial for any project. However, the approach to achieving this level of quality needs further discussion. One highly effective method for ensuring quality is through software design or architecture governance. In this article, I will explain how you can use two powerful tools — Checkstyle and PMD — to establish and enforce coding standards, thus improving your project’s overall code quality and maintainability. Understanding Checkstyle and PMD Checkstyle is a development tool that helps you and your team establish a consistent code style standard across your project. By setting rules for code formatting, naming conventions, and other stylistic aspects, Checkstyle enforces a baseline for code quality that all team members must adhere to. This consistency is crucial, especially in large teams or projects with multiple contributors. PMD, on the other hand, is a static analysis tool designed to identify potential vulnerabilities, inefficiencies, and problematic patterns in your codebase. PMD scans your code and highlights unused variables, inefficient loops, and possible security flaws. Together, these tools provide a comprehensive approach to maintaining code quality and preventing the gradual decay of your software’s architecture — a phenomenon known as software erosion. Benefits of Combining Checkstyle and PMD By integrating Checkstyle and PMD into your development process, you can achieve several key benefits: Increased onboarding speed: New team members can quickly understand the project’s coding standards and architectural guidelines, as violations are flagged during the build process. This immediate feedback helps new developers align with the team’s practices from the get-go, reducing the learning curve. Reduced time in code reviews: While code reviews remain an essential part of the development process, Checkstyle and PMD can automate the enforcement of coding standards. Rather than getting bogged down in stylistic issues, it allows reviewers to focus on more complex aspects of the code, such as logic and architecture. Combatting software erosion: Software erosion refers to the gradual deterioration of a software system’s structure and performance over time. By consistently enforcing coding standards and identifying potential issues early with Checkstyle and PMD, you can prevent this decay and maintain a robust codebase. This concept is further explored in the book Building Evolutionary Architectures: Automated Software Governance. Enhanced software governance: Automating code style and quality checks is critical to establishing or enhancing software governance within your organization. This approach ensures consistency and sets the stage for long-term maintainability and scalability of your software. Implementing Checkstyle and PMD in a Maven Project To make integrating these tools into your project easier, especially if you’re using Maven, you can configure both Checkstyle and PMD directly in your pom.xml file. Below is an example of how to set this up. Checkstyle Configuration XML <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-checkstyle-plugin</artifactId> <version>${maven.checkstyle.plugin.version}</version> <executions> <execution> <id>verify-style</id> <phase>process-classes</phase> <goals> <goal>check</goal> </goals> </execution> </executions> <configuration> <excludes>**/module-info.java,${checkstyle.excludes}</excludes> <logViolationsToConsole>true</logViolationsToConsole> <consoleOutput>true</consoleOutput> <checkstyleRules> <module name="Checker">  </module> </checkstyleRules> </configuration> </plugin> In this configuration, any defined rule violation will break the build, ensuring issues are addressed promptly. For example, this setup enforces a maximum line length of 180 characters and a maximum file size of 3500 lines. PMD Configuration XML <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-pmd-plugin</artifactId> <version>${apache.pdm.plugin.version}</version> <configuration> <rulesets> <ruleset>pmd/bestpractices.xml</ruleset> <ruleset>pmd/codestyle.xml</ruleset> <ruleset>/category/java/security.xml</ruleset> <ruleset>/category/java/performance.xml</ruleset> </rulesets> <failOnViolation>true</failOnViolation> <printFailingErrors>true</printFailingErrors> </configuration> <executions> <execution> <goals> <goal>check</goal> </goals> </execution> </executions> </plugin> PMD comes with various pre-configured rulesets that you can use to enforce best practices, code style, security, and performance standards. You can start with these default configurations and gradually customize them to meet your project’s specific needs. Practical Example Consider the following simple code example that triggers Checkstyle and PMD violations: Java public class Animal { public void eat(String food) { if("meat".equals(food)) System.out.println("Animal is eating meat"); } } When Checkstyle and PMD analyze this code, they generate warnings and errors, such as missing braces in the if statement and the usage of System.out.println, which PMD flags as bad practice. After addressing these issues, the corrected code might look like this: Java public class Animal { private static final Logger LOGGER = Logger.getLogger(Animal.class.getName()); public void eat(String food) { if("meat".equals(food)){ LOGGER.info("Animal is eating meat"); } } } In this improved version, braces have been added to the if statement, and the System.out.println has been replaced with a logger, adhering to best practices. Applying to Legacy Projects While starting a new project with these tools is ideal, many developers work on legacy systems. In such cases, it’s wise to gradually introduce Checkstyle and PMD. Begin with Checkstyle to enforce coding standards and then incrementally add PMD checks. This phased approach helps the team adapt to the new standards without feeling overwhelmed, ultimately improving the codebase without discouraging developers. Conclusion Integrating Checkstyle and PMD into your development process is a practical step toward achieving higher software quality. These tools not only enforce consistency and best practices but also help in maintaining the long-term health of your codebase. Whether starting a new project or modernizing a legacy one, these tools are invaluable in your journey toward better software governance and quality assurance. For a practical reference and to see these concepts in action, you can explore a detailed implementation available in this GitHub repository. This repository provides concrete examples of how Checkstyle and PMD are configured and used in a Maven project, offering a valuable resource as you apply these practices to your own work. Video

By Otavio Santana

CORE

High Fidelity Data: Balancing Privacy and Usage

The effective de-identification algorithms that balance data usage and privacy are critical. Industries like healthcare, finance, and advertising rely on accurate and secure data analysis. However, existing de-identification methods often compromise either the data usability or privacy protection and limit advanced applications like knowledge engineering and AI modeling. To address these challenges, we introduce High Fidelity (HiFi) data, a novel approach to meet the dual objectives of data usability and privacy protection. High-fidelity data maintains the original data's usability while ensuring compliance with stringent privacy regulations. Firstly, the de-identification approaches and their strengths and weaknesses are examined. Then four fundamental features of HiFi data are specified and rationalized: visual integrity, population integrity, statistical integrity, and ownership integrity. Lastly, the balancing of data usage and privacy protection is discussed with examples. Current Status of De-Identification De-identification is the process of reducing the informative content in data to decrease the probability of discovering an individual’s identity. The growing use of personal information for extended purposes may introduce more risk of privacy leakage. Various metrics and algorithms have been developed to de-identify data. HHS published a detailed guide, "Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule," known as Safe Harbor, to measure de-identified patient health records. Common de-identification approaches are as follows: Redaction and Suppression This approach involves removing certain data elements from database records. A common difficulty with these approaches is to define "done properly." Removal of elements can significantly impact the effective use of data and possible loss of critical information for analysis. Blurring Blurring is reducing the data precision by combining several data elements. Three main approaches are: Aggregation: Combining individual data points into larger groups (e.g., summarizing data by region instead of individual address) Generalization: Replacing specific data with broader categories (e.g., replacing age with age range) Pixelation: Lowering the resolution of data (e.g., less precise geographic coordinates) Blurring methods are used in various reports or statistical summaries to provide a level of anonymity without fully protecting individual data rather than general-purpose de-identification. Masking Masking involves replacing data elements with either random or made-up value, or with another value in the dataset. It may decrease the accuracy of computations in many cases, affecting the validity and usability. The main variants in this category include: Pseudonymization: Assigning pseudonyms to data elements to mask their original values while maintaining consistency across the dataset Perturbation randomization: Adding random noise to data elements to mask their true values without completely distorting the overall dataset Swapping/Shuffling: Exchanging values between records to mask identities while preserving the dataset's statistical properties Noise differential privacy: Injecting statistical noise into the data to protect privacy while allowing for meaningful aggregate analysis High Fidelity Data: What and Why There are several key needs for HiFi Data, including but not limited to: Privacy and regulatory compliance: Ensuring data privacy and adhering to associated regulations Safe data utilization: Discover business insight without risking privacy. AI modeling: Train AI models with real-world data for better and more accurate behavior of the model itself and agents. Rapid data access for production issues: Access to production quality data during issues or unexpected network traffic without compromising privacy Given these complex and multifaceted requirements, a breakthrough solution is necessary that ensures: Privacy protection: Privacy and sensitive data is encoded to prevent privacy leaks. Data integrity: The transformed data retains the same structure, size, and logical consistency as the original data. Usage for analysis and AI: For analysis, projections, and AI modeling, the transformation should preserve statistical characteristics and population properties ideally in a lossless fashion. Quick access: Transforming should be quick and on-demand-based to ensure the transformation is accessible for production issues. High Fidelity Data Specification High Fidelity Data refers to data that is faithfulness to original features after transformation and/or encoding, including: Visual integrity: The transformed data retains its original format, making it "look and feel" the same as the original ones (e.g., dates still appear as dates, phone numbers as phone numbers). Population integrity: The transformed data preserves the population characteristics of the original dataset, ensuring that the distribution and relationships within the data remain intact. Statistical integrity: The statistical properties are maintained, ensuring that analyses performed on the encoded data yield results similar to those on the original data. Ownership integrity: The data retains information about its origin, ensuring that the ownership and provenance of the data are preserved to avoid unnecessary extended use. High Fidelity Data maintains privacy, usability, and integrities, making it suitable for data analysis, AI modeling, and reliable deployment by testing of production quality data. Visual Integrity Visual Integrity means the transformed data should comply with the original data in ways: Length of words and phrases: Transformations should maintain the original length of the data. For instance, Base64 or AES encrypted names would make them 15-30% longer, which is undesirable. Data types: Data types should be preserved (e.g., phone numbers should remain as dashed digital characters). The last four digits extracted as integers would break or change the validation pipeline. Data format: Remain consistent with the original Internal structure of composite data: Complex data types, like addresses, should maintain their internal structure. Although visual integrity might not seem significant at first glance, it profoundly impacts how analysts use the data and how trained LLMs predict outcomes. As shown in the following HiFi Data Visual Integrity: Transformed birthdates still appear as dates. Transformed phone numbers or SSNs still resemble phone numbers or SSNs, rather than random strings. Transformed emails look like valid email addresses but cannot be looked up on a server. No need for popular domains like "Gmail" to encode, but for less common domains, the domain is encoded as well. Visual integrity is critical in complex software ecosystems, especially production environments. Changes in data type and length could cause database schema changes, which are labor-intensive, time-consuming, and error-prone. Validation failures during QA could restart development sprints, and may even trigger configuration changes in firewalls and security monitoring systems. For instance, invalid email addresses or phone numbers might trigger security alerts. Preserving the "Look & Feel" of data is essential for data engineers and analysts, leading to less error-prone insights. Population Integrity Population integrity ensures the consistency of report and summary statistics is maintained in a lossless fashion before and after transformation. Population distributions: The transformed data should mirror the original data's population distribution (e.g., in healthcare, the percentage of patients from different states should remain consistent). Correlations and relations: The internal relationships and correlations between data elements should be preserved which is crucial for analyses that rely on understanding the interplay between different variables. For example, if one "John" had 100 records in the database, after transforming, there would still be 100 records of "John", with each "John" represented only once. Maintaining population integrity is essential to ensure the transformed data remains useful for statistical analysis and modeling for these reasons: Accurate analysis: Analysts can rely on the transformed data to provide the same insights as the original data, ensuring that trends and patterns are correctly discovered. Reliable data linkage: Encoded data can still be linked across different datasets without loss of information, allowing for comprehensive analyses that require data integration. Consistent results: Ensures that the results of data queries and analyses are consistent with what would be obtained from the original dataset In healthcare, maintaining population integrity ensures accurate tracking of patient records and health outcomes even after data de-identification. In finance, it enables precise analysis of transaction histories and customer behavior without compromising privacy. For example, in a region defined by a set of zip codes, the ratio of vaccine takers to non-takers should remain consistent before and after data de-identification. Preserved population integrity ensures that encoded datasets remain useful and reliable for all analytical purposes without the privacy risk. Statistical Integrity Statistical integrity ensures that the statistical properties, like mean, standard deviation(STD), entropy, and more of the original dataset are preserved in the transformed data. This integrity allows for accurate and meaningful analysis, projection, and deep mining of insight and knowledge. It includes: Preservation of statistical properties: Mean, STD, and other statistical measures should be maintained. Ensures that statistical analyses yield consistent outcomes through cross-transformation Accuracy of analysis and modeling: Crucial for applications in machine learning and AI modeling, like user pharmacy visiting projection and visiting Maintaining statistical integrity is essential for several reasons: Accurate statistical analysis: Analysts can perform statistical tests and derive insights from the transformed data with confidence, knowing that the results will be reflective of the original data. Valid predictive modeling: Machine learning models and other predictive analytics can be trained on the transformed data without losing the accuracy and reliability of the predictions. Consistency across studies: Ensures that findings from different studies or analyses are consistent, facilitating reliable comparisons and meta-analyses For example, in the healthcare industry, preserving statistical integrity allows researchers to accurately assess the prevalence of diseases, the effectiveness of treatments, and the distribution of health outcomes. In finance, it enables the precise evaluation of risk, performance metrics, and market trends. By ensuring consistent statistical properties, Statistical Integrity supports robust and reliable data analysis, enabling stakeholders to make informed decisions based on accurate and trustworthy insights. Ownership Integrity Owner means an entity that has full control of the original data set. Entity usually refers to a person, but it can also mean a company, an application, or a system. Ownership Integrity ensures that the provenance and ownership information of the data is preserved throughout the transformation process. The data owner can perform additional new transformations as needed in case the scope/requirement is changed. Data ownership: Retaining ownership is crucial for maintaining data governance and regulation compliance. Provenance: Reserving the data source origination plays an important role in the traceability and accountability of the transformed data. Maintaining ownership integrity is crucial for several reasons: Regulation compliance: Helps organizations comply with legal and regulatory requirements by maintaining clear records of data provenance and ownership Data accountability: Since the transformation is project-based, it can be designed to be reusable or not reusable. For example, different purposes for data analysis and/or model training may transform data accordingly with different data subsets of its origin without cross reference. Data governance: Supports robust data governance through its lifecycle to avoid unnecessary or unintentional reuse Trust and transparency: Builds trust with stakeholders by demonstrating that the organization maintains high standards of data integrity and accountability; Users of the transformed data can be assured that it comes from the original owner. In healthcare, ownership integrity allows the tracking of patient records back to the original healthcare provider. In finance, it ensures that transaction data can be traced back to the original financial institution, supporting regulatory compliance and auditability. Preserved ownership integrity ensures that encoded datasets remain transparent, accountable, and compliant with regulations, providing confidence to all stakeholders involved. Summary of High-Fidelity Data High Fidelity Data offers a balanced approach to data transformation, combining privacy protection with the preservation of data usability, making it a valuable asset across various industries. Specification High Fidelity Data (HiFi Data) specification aims to maintain the original data's usability while ensuring privacy and compliance with regulations. HiFi Data should offer the following features: Visual integrity: The encoded data retains its original format, ensuring it looks and feels the same as the raw data. Population integrity: The transformed data preserves the population characteristics of the original dataset, like distribution and frequency. Statistical integrity: The preserved statistical properties ensure accurate analysis and projection. Ownership integrity: The ownership and provenance are preserved through the transformation which prevents unauthorized re-use. Benefits Regulatory compliance: Helps organizations comply with legal and regulatory requirements by maintaining data ownership and provenance. Data usability: Encoded data retains its usability for analysis, reporting, and machine learning, without compromising privacy and re-architecting the complicated process management. Data accountability: Population, statistical, and ownership integrity make data governance consistent and accountable. Enhanced security: This makes re-identification extremely difficult. Consistency: Supports consistent encoding across different data sources and projects, promoting uniformity in data handling. Usage Healthcare: Ensuring compliance with HIPAA. HiFi Data can be used for population health research and health services research without risking patient privacy. Finance: Financial models and analyses can be conducted accurately without exposing sensitive information. Advertising: Enables the use of detailed customer data for targeted advertising while protecting individual identities. Data analysis and AI modeling: Provides high-quality data for training models, ensuring they reflect real-world scenarios without compromising privacy-sensitive information.

By Zilong Tang

Overcoming the Retry Dilemma in Distributed Systems

“Insanity is doing the same thing over and over again, but expecting different results” - Source unknown As you can see in the quote above, humans have this tendency to retry things even when results are not going to change. This was manifested in systems designs as well where we pushed these biases when designing systems. If you look closely there are two broad categories of failures: Cases where retry makes sense Where retries don’t make sense In the first category, transient failures like network glitches or intermittent service overloads are examples for which retrying makes sense. For the second, where the failure originates from something like the request itself is malformed, requests are getting throttled (429s), or service is load-shedding (5xx), it doesn’t make much sense to retry. While both of these categories need special attention and appropriate handling, in this article, we will primarily focus on category 1, where it makes sense to retry. The Case for Retries In the modern world, a typical customer-facing service is made up of various microservices. A customer query to a server goes through a complex call graph and typically deals with many services. On a happy day, a single customer query meets no error (or independent failure) to give an illusion of a successful call event. For example, a service that is dependent on 5 services with 99.99% availability, can only achieve a max of 99.95% of availability (reference doc) for its callers. The key point here is that even though each individual dependency has an excellent availability of 99.99%, the cumulative effect of depending on 5 such services results in a lower overall availability for the main service. This is because the probability of all 5 dependencies succeeding on a single request is the product of their individual probabilities. Overall Availability = 1 - (1 - Individual Availability) ^ (# of Dependencies) Now, by using the same formula, we can see that to maintain a 99.99% availability at the main service without any retries, all the dependencies need to have an availability higher than 99.998%. So, this begs a question: how do you achieve 99.99% availability at the main service? The answer is we need retries! Retries Sweet Spot We discussed above that the maximum availability that you can achieve without retries is 99.95% with those 5 dependencies. Now, if we expand our above formula and try to model the overall availability to 99.99% of the main service, it will include retries as a factor in considering it. So, the formula becomes: Overall Availability = 1 - (1 - Individual Availability) ^ (# of Dependencies + # of Retries) If you plug these values, it gives you 99.99% = 1 - (1 - 0.9999) ^ (5 + Number of Retries). This gives you # of retries = 2, which means that by adding two retries at the main service, you will be able to achieve 99.99% availability. This demonstrates how retries can be an effective strategy to overcome the effect of cumulative availability reduction when relying on multiple dependencies and help achieve the desired overall service-level objectives. This is great and obvious! So why this article!? Retry Storm and Call Escalation While these retries help, they also bring trouble with them. The main problem with retries is when one of the services you depend on is in trouble or having a bad day. Now when you retry when it's already down, it is like you are kicking it where it hurts! — potentially delaying the service’s recovery. Now think of a scenario where this call graph is multi-level deep; for example, the main service depends on 5 sub-services which in turn depend on another 5 services. Now when there is a failure, you retry at the top, and this will lead to 5 * 5 = 25 retries. What if those are further dependent on 5 services? So for one retry at the top, you may end up with 5 * 5 * 5 retries, and so on. While retries don’t help with faster recovery, they can take down the service that was operating partially with this extra generated load. Now, when those fail, they further increase the failure leading to more failures and this retry storm starts and creates long-lasting outages. At the lowest level, the call volume reaches 2 ^ N, which is catastrophic and would have recovered much faster had there been no retries at all. This brings us to the meat of the article where we say we like retries, but they are also dangerous if not done right. So, how can we do it right? Strategies for Limiting Retries To benefit from retries while keeping the retry storms and cascading failures from happening, we need a way to stop the excessive retries. As highlighted above, even a single retry at the top can cause too much damage when the call graph is deep (2 ^ N). We need to limit retries in aggregate. These are some of the practical tips on how this can be done: Bounded Retries The idea with the bounded retries is that you have an upper bound in how many retries you can do. The upper bound can be decided based on the time; for example, every minute you can make 10 retries or it’s based on the success rate for every 1000 success calls, you give service a single retry credit and you keep getting until you reach the fixed upper bound. Circuit Breaking The philosophy of the circuit-breaking technique is to don’t hammer whats already down. With the circuit breaker pattern, what you do is when you meet an error, you close the connection, stop making calls to the server, and give it breathing room. To check for the recovery, you have a thread that makes a single invoke to the service on a periodic basis to check if it has recovered. Once it has recovered, you gradually start the traffic and go back to normal operation mode. This gives the receiving service the much-needed time to recover. This topic is covered in much detail in Martin Fowler’s article. Retry Strategies There are techniques in TCP congestion control like AIMD (Additive Increase and Multiplicative decrease) that can be employed. AIMD basically says that you slowly increase the traffic to a service (think of it like connection + 1), but you immediately reduce the traffic when faced with an error (think of it like active Connection / 2). You keep doing that till you equilibrium. Exponential backoff is another technique where you back off for a period of time upon meeting an error and subsequently increase the time you back off up to some maximum time. The subsequent increase is generally like 1, 2, 4, 8.. 2^retryCount, but could also be Fibonacci-based like 1,2,3,5,8…. There is another gotcha while retrying, which is to keep the jitter in mind. Here is a great article from Marc Brooker who goes into more depth about exponential backoff and retries. While these techniques talk about client-side protection, we could also employ some guardrails on the server side. Some of the things that we can consider are: Explicit backpressure contract: When under load, reject caller request and pass on the metadata that you are failing because of overload and ask it to pass to its upstream and so on. Avoid wastage work: In the case where you expect service to be under load, avoid doing wastage work. You can check whose caller has timed out and don’t need an answer to drop the work altogether. Load shedding: When under stress, load shed aggressively. Investigate in mechanism where you can identify duplicate requests and discard one of them. Use signals like CPU, memory, etc. to compliment the load-shedding decision-making. Conclusion In distributed systems, transient failures in remote interactions are unavoidable. To build higher availability systems, we rely on retries. Retries can lead to big outages because of call escalations. Clients can adopt various approaches to avoid overloading the system in case of failures. Services should also employ techniques to protect themselves in case a client goes rogue. Disclaimer The guidance provided here offers principles and practices that could broadly improve the reliability of services in many conditions. However, I would advise that you do not view this as a one-size-fits-all mandate. Instead, I suggest that you and your team evaluate these recommendations through the lens of your specific needs and circumstances.

By Rajesh Pandey

The Evolution of Conversational AI: Blending Determinism With Dynamism

Conversational AI agents have come a long way from their early days of simple, scripted interactions. With the explosion of large language models (LLMs) like GPT-3, Gemini, and beyond, the landscape of human-computer interaction is undergoing a significant transformation. These AI agents are increasingly expected to mimic human-like interactions, which demands a delicate balance between deterministic (convergent) workflows and dynamic, creative responses (divergent). This dual approach is redefining how these agents function across various domains, including education, customer service, and personal assistance. The Need for Determinism in Conversational AI At the core of any effective and complex conversational AI agent lies a structured, deterministic flow. Determinism ensures that the agent follows a predefined path (happy path), making decisions based on set criteria, providing predictable outcomes, and accomplishing the goals of that specific agent. This is particularly crucial in educational contexts, such as language learning, where the agent must track a user’s progress and deliver content in a logical sequence. Let’s consider an AI agent designed to teach Spanish to non-Spanish speakers. The learning process must be carefully structured to ensure that the student builds their knowledge incrementally. The agent needs to assess where the student is in their learning journey and determine the next appropriate step, whether it’s introducing new vocabulary, reinforcing grammar rules, or practicing conversation skills. In this scenario, the agent operates within a deterministic framework, governed by workflows that handle the control of its main actions. Platforms like Voiceflow, Langflow, and Dialogflow CX are instrumental in building these deterministic workflows. They allow developers to design conversational experiences where each interaction is mapped out in advance. This ensures that the AI agent can reliably guide the user through the learning process, providing a sense of progression and achievement. In this deterministic mode, the agent is less about simulating human conversation and more about delivering structured, purposeful instruction. The Role of Dynamism and Creativity However, not all interactions can or should be rigidly scripted. Human conversations are inherently dynamic, filled with nuances, unexpected questions, and shifts in context. To address this, modern conversational AI agents are increasingly incorporating LLMs to introduce a level of creativity and natural interaction that deterministic workflows cannot provide. Returning to the example of the Spanish language learning agent, while the core lessons may be delivered through a deterministic flow, there comes a point where the conversation might need to become more fluid and adaptable. For instance, after completing a structured lesson, the agent could transition to a more open-ended conversation managed by an LLM. Here, the AI could engage the learner in a casual chat in Spanish, responding to any topic the learner introduces, providing explanations, and correcting mistakes in a manner that feels more like conversing with a native speaker than interacting with a machine. This dynamic mode is powered by the LLM’s ability to generate responses that are contextually relevant and varied, offering a more natural and engaging user experience. It allows the AI to handle the unpredictable nature of human interaction, filling in gaps where deterministic systems might falter. The Synergy of Determinism and Dynamism The future of conversational AI lies in the seamless integration of these two approaches. Deterministic workflows provide the structure necessary for achieving specific goals, ensuring that users are guided efficiently and effectively through complex processes. At the same time, the dynamism introduced by LLMs brings the interaction to life, making it more engaging and human-like. In practical terms, this means designing AI agents that can transition smoothly between these modes. For example, an agent could start a conversation with a user in a deterministic mode, collecting necessary information and performing specific tasks. Once these tasks are complete, the agent could switch to a more dynamic mode, allowing the conversation to flow naturally and adapt to the user’s needs and preferences. The integration of these approaches can significantly enhance the user experience, making AI agents not just tools for completing tasks, but also companions capable of engaging, meaningful interactions. As these technologies continue to evolve, we can expect conversational AI to become even more sophisticated, blending the predictability of deterministic workflows with the creativity and adaptability of LLMs to create truly human-like interactions. Large Language Models and the Future of Conversational AI Large Language Models have revolutionized the field of conversational AI, enabling agents to generate text that is indistinguishable from human writing. But not only that, LLMs have been used to create a wide range of agents from chatbots to assistants, tutors, etc. By giving them instructions and data, they can perform a wide range of tasks; such as, for example, detecting the correctness, sentiment, and aggressiveness of a text. This is what we call Instructional AI, a new way to interact with LLMs and make them perform specific tasks in conversational AI applications. An example has been developed to showcase these capabilities. This project implements a simple language correctness detector that detects grammatical errors, sentiment, and aggressiveness, and provides solutions for the errors in the text. The project will be part of a bigger conversational AI agent that helps users improve their language skills. This is an input/output example of the project: Input: { language: "Spanish", text: "Yo soy enfadado" } Output: { result: { sentiment: 'angry', aggressiveness: 2, correctness: 7, errors: [ "The correct form of the verb 'estar' should be used instead of 'ser' when expressing emotions or states." ], solution: 'Yo estoy enfadado', language: 'Spanish' } } The project can be found here. Conclusion The explosion of LLMs has brought a new dimension to conversational AI, making it possible to create agents that interact with humans in ways that are increasingly natural and engaging. However, the complexity of human interaction requires more than just creativity also demands structure and determinism. By combining deterministic workflows with the dynamic capabilities of LLMs, developers can create AI agents that not only perform tasks efficiently but also offer a level of interaction that feels genuinely human. Moreover, the integration of Instructional AI with LLMs is a new way to interact with them and make them perform specific tasks in conversational AI applications. This blend of structure and creativity is the key to the next generation of conversational AI, where agents can guide, teach, and interact in ways that are both predictable and profoundly engaging. Happy coding! Resources "Generative AI is new and exciting but conversation design principles are forever," by Alessia Sachi

By Xavier Portilla Edo

CORE

Java Concurrency: Visibility and Synchronized

Previously, we examined the happens before guarantee in Java. This guarantee gives us confidence when we write multithreaded programs with regard to the re-ordering of statements that can happen. In this post, we shall focus on variable visibility between two threads and what happens when we change a variable that is shared. Code Examination Let’s examine the following code snippet: Java import java.util.Date; public class UnSynchronizedCountDown { private int number = Integer.MAX_VALUE; public Thread countDownUntilAsync(final int threshold) { return new Thread(() -> { while (number>threshold) { number--; System.out.println("Decreased "+number +" at "+ new Date()); } }); } private void waitUntilThresholdReached(int threshold) { while (number>threshold) { } } public static void main(String[] args) { int threshold = 2125840327; UnSynchronizedCountDown unSynchronizedCountDown = new UnSynchronizedCountDown(); unSynchronizedCountDown.countDownUntilAsync(threshold).start(); unSynchronizedCountDown.waitUntilThresholdReached(threshold); System.out.println("Threshold reached at "+new Date()); } } This is a bad piece of code: two threads operate on the same variable number without any synchronization. Now the code will likely run forever! Regardless of when the countDown thread reaches the goal, the main thread will not pick the new value which is below the threshold. This is because the changes made to the number variable have not been made visible to the main thread. So it’s not only about synchronizing and issuing thread-safe operations but also ensuring that the changes a thread has made are visible. Visibility and Synchronized Intrinsic locking in Java guarantees that one thread can see the changes of another thread. So when we use synchronized the changes of a thread become visible to the other thread that has stumbled on the synchronized block. Let’s change our example and showcase this: Java package com.gkatzioura.concurrency.visibility; public class SynchronizedCountDown { private int number = Integer.MAX_VALUE; private String message = "Nothing changed"; private static final Object lock = new Object(); private int getNumber() { synchronized (lock) { return number; } } public Thread countDownUntilAsync(final int threshold) { return new Thread(() -> { message = "Count down until "+threshold; while (number>threshold) { synchronized (lock) { number--; if(number<=threshold) { } } } }); } private void waitUntilThresholdReached(int threshold) { while (getNumber()>threshold) { } } public static void main(String[] args) { int threshold = 2147270516; SynchronizedCountDown synchronizedCountDown = new SynchronizedCountDown(); synchronizedCountDown.countDownUntilAsync(threshold).start(); System.out.println(synchronizedCountDown.message); synchronizedCountDown.waitUntilThresholdReached(threshold); System.out.println(synchronizedCountDown.message); } } Access to the number variable is protected by a lock. Also modifying the variable is synchronized using the same lock. Eventually, the program will terminate as expected since we will reach the threshold. Every time we enter the synchronized block the changes made by the countdown thread will be visible to the main thread. This applies not only to the variables involved on a synchronized block but also to the variables that were visible to the countdown thread. Thus although the message variable was not inside any synchronized block at the end of the program its altered value got publicized, thus saw the right value printed.

By Emmanouil Gkatziouras

CORE

Are You Tired of Fragile Tests? Meet data-testid

In the realm of front-end development, ensuring that your application is thoroughly tested and maintains high quality is paramount. One of the strategies that can significantly enhance both the development and testing processes is the use of the data-testid attribute. This attribute, specifically designed for testing purposes, offers numerous advantages, particularly from a QA perspective. Benefits of Using data-testid Stable and Reliable Locators Benefit One of the primary challenges in automated testing is ensuring that test scripts remain stable as the UI evolves. Typically, selectors like classes and IDs are used to locate elements in the DOM, but these can change frequently as the design or structure of the UI is updated. data-testid provides a stable and reliable way to locate elements, as it is intended solely for testing purposes and is less likely to be altered. Impact on Automation Automated tests become more resilient and less prone to failure due to changes in the UI. This reduces the maintenance burden on the QA team, allowing them to focus on expanding test coverage rather than constantly updating selectors. Clear Separation of Concerns Benefit data-testid ensures that testing selectors are decoupled from the visual and functional aspects of the UI. Unlike classes and IDs, which are tied to styling and functionality, data-testid is dedicated solely to testing, meaning that changes to the UI’s look or behavior won't impact the test scripts. Impact on Automation This separation promotes a cleaner codebase and prevents tests from becoming fragile due to design changes. Developers can refactor UI components without worrying about breaking the test automation, as long as the data-testid values remain unchanged. Encourages a Test-First Approach Benefit The use of data-testid encourages developers to think about testability from the outset. By including data-testid attributes during development, teams can ensure that their UI components are easily testable and that the testing process is considered throughout the development lifecycle. Impact on Automation This test-first approach can lead to more robust and comprehensive test coverage. When testability is a priority from the beginning, automated tests can be created more quickly and with greater confidence in their effectiveness. How Can I Implement This Approach? I’ve created a separate step-by-step guide to implement this approach, "Mastering Test Automation: How data-testid Can Revolutionize UI Testing." Impact on Automation Development Simplified Locator Strategy By using data-testid attributes, test automation engineers can adopt a simplified and consistent locator strategy across the entire test suite. This reduces the complexity of writing and maintaining test scripts and minimizes the time spent dealing with flaky tests due to changing locators. Reduced Test Maintenance The stability provided by data-testid attributes means that automated tests require less frequent updates, even as the UI evolves. This leads to lower maintenance costs and allows the QA team to invest their time in creating new tests or enhancing existing ones. Improved Collaboration Between Developers and QA By using data-testid, developers and QA engineers can work more closely together. Developers can ensure that the elements they create are easily identifiable in tests, while QA engineers can provide feedback on which elements need data-testid attributes. This collaboration fosters a more cohesive development process and helps ensure that the application is thoroughly tested. Scalability of the Automation Suite A consistent use of data-testid makes the automation suite more scalable. As the application grows, the test suite can expand with it, confident that the locators will remain stable and that tests will continue to provide reliable results. Impact on Overall QA Process and Product Delivery Implementing data-testid attributes in front-end development has a profound impact on the overall QA process and product delivery: Increased Test Reliability Automated tests that rely on data-testid attributes are less likely to break, leading to more reliable test results. This reliability ensures that the QA team can quickly identify and address issues, reducing the likelihood of bugs making it into production. Faster Development and Testing Cycles With data-testid, both development and testing processes become more efficient. Developers can refactor code without fear of breaking tests, and QA engineers can write tests more quickly and with greater confidence. This efficiency leads to faster development and testing cycles, allowing the team to deliver high-quality products more rapidly. Reduced Technical Debt The stability and maintainability provided by data-testid attributes help reduce technical debt related to testing. With less time spent on test maintenance and more time available for enhancing test coverage, the QA team can focus on preventing bugs rather than constantly fixing them. Better Stakeholder Confidence Reliable, consistent test results build confidence among stakeholders, including product managers, developers, and end-users. Knowing that critical functionalities are thoroughly tested before release can provide peace of mind and support smoother product rollouts. Potential for Misuse While data-testid is a powerful tool, it should be used judiciously. Overuse of data-testid attributes on every element can clutter the HTML and lead to unnecessary complexity. It’s important to apply data-testid selectively, focusing on elements that are critical for testing, to avoid introducing unnecessary overhead. Conclusion Using data-testid attributes in front-end development is highly beneficial from a QA standpoint. It provides reliable locators, promotes best practices, and improves collaboration between development and QA teams. The impact on automation development is overwhelmingly positive, resulting in more robust, maintainable, and scalable automated test suites. However, it’s essential to use this approach judiciously to avoid unnecessary overhead. References playwright-locate-by-test-id Cypress-locate-by-test-id Selenium-locate-by-test-id

By Shivam Bharadwaj

DORA Metrics: Tracking and Observability With Jenkins, Prometheus, and Observe

DORA (DevOps Research and Assessment) metrics, developed by the DORA team have become a standard for measuring the efficiency and effectiveness of DevOps implementations. As organizations start to adopt DevOps practices to accelerate software delivery, tracking performance and reliability becomes critical. DORA metrics help organizations address these critical tasks by providing a framework for understanding how well teams are delivering software and how quickly they can recover from failures. This article will delve into DORA metrics, demonstrate how to track them using Jenkins, and explore how to use Prometheus for collecting and displaying these metrics in Observe. What Are DORA Metrics? DORA metrics are a set of four key performance indicators (KPIs) that help organizations evaluate their software delivery performance. These metrics are: Deployment Frequency (DF): Measures how often code is deployed to production Lead Time for Changes (LT): Time taken from code commit to production deployment Change Failure Rate (CFR): The percentage of changes failed in production Mean Time to Restore (MTTR): The average time it takes to recover from a failure in production These metrics are valuable because they provide actionable insights into software development and deployment practices. High-performing teams tend to deploy more frequently and have shorter lead times, lower failure rates, and quicker recovery times, leading to more resilient and robust applications. Tracking DORA Metrics in Jenkins Jenkins is a widely used automation server to enable continuous integration and delivery (CI/CD). Below is an example of how to track DORA metrics using a Jenkins pipeline, using shell commands and scripts to log deployment frequency, calculate lead time for changes, monitor change failure rate, and determine the mean time to restore. Groovy pipeline { agent any environment { DEPLOY_LOG = 'deploy.log' FAIL_LOG = 'fail.log' } //Build Application stages { stage('Build') { steps { echo 'Building the application...' // Run required build commands sh 'make build' } } // Test Application stage('Test') { steps { echo 'Running tests...' // run required test commands sh 'make test' } } // Deploy application stage('Deploy') { steps { echo 'Deploying the application...' // run the deployment steps sh 'make deploy' // Log the deployment into log file to compute deployment frequency sh "echo $(date '+%F_%T') >> ${DEPLOY_LOG}" } } } post { always { script { // Computing deployment frequency (DF) def deploymentCount = sh(script: "wc -l < ${DEPLOY_LOG}", returnStdout: true).trim() echo "# of Deployments: ${deploymentCount}" // Writing build failures into log for computing CFR if (currentBuild.result == 'FAILURE') { sh "echo $(date '+%F_%T') >> ${FAIL_LOG}" } // Computing Change Failure Rate (CFR) def failureCount = sh(script: "wc -l < ${FAIL_LOG}", returnStdout: true).trim() def CFR = (failureCount.toInteger() * 100) / deploymentCount.toInteger() echo "Change Failure Rate: ${CFR}%" // Computing Lead Time for Changes(LTC) using last commit and deploy times def commitTime = sh(script: "git log -1 --pretty=format:'%ct'", returnStdout: true).trim() def currentTime = sh(script: "date +%s", returnStdout: true).trim() def leadTime = (currentTime.toLong() - commitTime.toLong()) / 3600 echo "Lead Time for Changes: ${leadTime} hours" } } //End if pipeline success { echo 'Deployment Successful!' } failure { echo 'Deployment failed!' // Failure handling } } } In the above script, each deployment is logged as a timestamp in the deploy file, which can be used to determine the deployment frequency as you go. Similarly, failures are logged as timestamps in the fail log file and both counts are used to compute change failure rate. Additionally, the time difference between the last commit time and the current time provides the lead time for changes. Monitoring DORA Metrics With Prometheus and Observe Prometheus is an open-source monitoring and alerting toolkit commonly used for collecting metrics from applications. Combined with Observe, a modern observability platform, Prometheus can be used to visualize and monitor DORA metrics in real-time. Install Prometheus on server: Download and install Prometheus from the link. Configure Prometheus: Set up the prometheus.yml configuration file to define the metrics to be collected and time intervals. Example configuration: YAML #setting time interval at which metrics are collected global: scrape_interval: 30s #Configuring Prometheus to collect metics from Jenkins on specific port scrape_configs: - job_name: 'jenkins' static_configs: - targets: ['<JENKINS_SERVER>:<PORT>'] Expose Metrics in Jenkins: You can use either the Prometheus plugin for Jenkins or a custom script to expose metrics in a format that Prometheus can use to collect. Example Python script: Python from prometheus_client import start_http_server, Gauge import random import time # Creating Prometheus metrics gauges for the four DORA KPIs DF = Gauge('Deployment Frequency', 'No. of deployments in a day') LT = Gauge('Lead Time For Changes', 'Average lead time for changes in hours') CFR = Gauge('Change Failure Rate', 'Percentage of changes failures in production') MTTR = Gauge('Mean Time To Restore', 'Mean time to restore service after failure in minutes') #Start server start_http_server(8000) #Sending random values to generate sample metrics to test while True: DF.set(random.randint(1, 9)) LT.set(random.uniform(1, 18)) CFR.set(random.uniform(0, 27)) MTTR.set(random.uniform(1, 45)) #Sleep for 30s time.sleep(30) Save this script on the server where Jenkins is running and run it to expose the metrics on port 8000. Add Prometheus Data Source to Observe: Observe is a monitoring and observability tool that provides advanced features for monitoring, analyzing, and visualizing observability data. In Observe, you can add Prometheus as a data source by navigating to the integrations section and configuring Prometheus with the appropriate endpoint URL. Set up Dashboards in Observe, and create dashboards with widgets to display graphs for these different metrics. Set up monitoring to configure alerts on set thresholds and analyze trends and patterns by drilling down into specific metrics. Conclusion DORA metrics are essential for assessing the performance and efficiency of DevOps practices. By implementing tracking in Jenkins pipelines and leveraging monitoring tools like Prometheus and Observe, organizations can gain deep insights into their software delivery processes. These metrics help teams continuously improve, making data-driven decisions that enhance deployment frequency, reduce lead time, minimize failures, and accelerate recovery. Adopting a robust observability strategy ensures that these metrics are visible to stakeholders, fostering a culture of transparency and continuous improvement in software development and delivery.

By Bhargavi Gorantla

Better Search Results Through Intelligent Chunking and Metadata Integration

Often, the knowledge bases over which we develop an LLM-based retrieval application contain a lot of data in various formats. To provide the LLM with the most relevant context to answer the question specific to a section within the knowledge base, we rely on chunking the text within the knowledge base and keeping it handy. Chunking Chunking is the process of slicing text into meaningful units to improve information retrieval. By ensuring each chunk represents a focused thought or idea, chunking assists in maintaining the contextual integrity of the content. In this article, we will look at 3 aspects of chunking: How poor chunking leads to less relevant results How good chunking leads to better results How good chunking with metadata leads to well-contextualized results To effectively showcase the importance of chunking, we will take the same piece of text, apply 3 different chunking methodologies to it, and examine how information is retrieved based on the query. Chunk and Store to Qdrant Let us look at the following code which shows three different ways to chunk the same text. Python import qdrant_client from qdrant_client.models import PointStruct, Distance, VectorParams import openai import yaml # Load configuration with open('config.yaml', 'r') as file: config = yaml.safe_load(file) # Initialize Qdrant client client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key']) # Initialize OpenAI with the API key openai.api_key = config['openai']['api_key'] def embed_text(text): print(f"Generating embedding for: '{text[:50]}'...") # Show a snippet of the text being embedded response = openai.embeddings.create( input=[text], # Input needs to be a list model=config['openai']['model_name'] ) embedding = response.data[0].embedding # Access using the attribute, not as a dictionary print(f"Generated embedding of length {len(embedding)}.") # Confirm embedding generation return embedding # Function to create a collection if it doesn't exist def create_collection_if_not_exists(collection_name, vector_size): collections = client.get_collections().collections if collection_name not in [collection.name for collection in collections]: client.create_collection( collection_name=collection_name, vectors_config=VectorParams(size=vector_size, distance=Distance.COSINE) ) print(f"Created collection: {collection_name} with vector size: {vector_size}") # Collection creation else: print(f"Collection {collection_name} already exists.") # Collection existence check # Text to be chunked which is flagged for AI and Plagiarism but is just used for illustration and example. text = """ Artificial intelligence is transforming industries across the globe. One of the key areas where AI is making a significant impact is healthcare. AI is being used to develop new drugs, personalize treatment plans, and even predict patient outcomes. Despite these advancements, there are challenges that must be addressed. The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. As AI continues to evolve, it is crucial that these challenges are not overlooked. By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone. """ # Poor Chunking Strategy def poor_chunking(text, chunk_size=40): chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)] print(f"Poor Chunking produced {len(chunks)} chunks: {chunks}") # Show chunks produced return chunks # Good Chunking Strategy def good_chunking(text): import re sentences = re.split(r'(?<=[.!?]) +', text) print(f"Good Chunking produced {len(sentences)} chunks: {sentences}") # Show chunks produced return sentences # Good Chunking with Metadata def good_chunking_with_metadata(text): chunks = good_chunking(text) metadata_chunks = [] for chunk in chunks: if "healthcare" in chunk: metadata_chunks.append({"text": chunk, "source": "Healthcare Section", "topic": "AI in Healthcare"}) elif "ethical implications" in chunk or "data privacy" in chunk: metadata_chunks.append({"text": chunk, "source": "Challenges Section", "topic": "AI Challenges"}) else: metadata_chunks.append({"text": chunk, "source": "General", "topic": "AI Overview"}) print(f"Good Chunking with Metadata produced {len(metadata_chunks)} chunks: {metadata_chunks}") # Show chunks produced return metadata_chunks # Store chunks in Qdrant def store_chunks(chunks, collection_name): if len(chunks) == 0: print(f"No chunks were generated for the collection '{collection_name}'.") return # Generate embedding for the first chunk to determine vector size sample_text = chunks[0] if isinstance(chunks[0], str) else chunks[0]["text"] sample_embedding = embed_text(sample_text) vector_size = len(sample_embedding) create_collection_if_not_exists(collection_name, vector_size) for idx, chunk in enumerate(chunks): text = chunk if isinstance(chunk, str) else chunk["text"] embedding = embed_text(text) payload = chunk if isinstance(chunk, dict) else {"text": text} # Always ensure there's text in the payload client.upsert(collection_name=collection_name, points=[ PointStruct(id=idx, vector=embedding, payload=payload) ]) print(f"Chunks successfully stored in the collection '{collection_name}'.") # Execute chunking and storing separately for each strategy print("Starting poor_chunking...") store_chunks(poor_chunking(text), "poor_chunking") print("Starting good_chunking...") store_chunks(good_chunking(text), "good_chunking") print("Starting good_chunking_with_metadata...") store_chunks(good_chunking_with_metadata(text), "good_chunking_with_metadata") The above code does the following: embed_text method takes in the text, generates embedding by using the OpenAI embedding model, and returns the embedding generated. Initializes a text string that is used for chunking and later content retrieval Poor chunking strategy: Splits text into chunks of 40 characters each Good chunking strategy: Splits text based on sentences to obtain a more meaningful context Good chunking strategy with metadata: Adds appropriate metadata to sentence-level chunks Once embeddings are generated for the chunks, they are stored in corresponding collections in Qdrant Cloud. Keep in mind the poor chunks are created only to showcase how poor chunking impacts retrieval. Below are the screenshots from Qdrant Cloud for the chunks, where you can see metadata was added to the sentence-level chunks to indicate the source and topic. Retrieval Results Based on Chunking Strategy Now let us write some code to retrieve the content from Qdrant Vector DB based on a query. Python import qdrant_client import openai import yaml # Load configuration with open('config.yaml', 'r') as file: config = yaml.safe_load(file) # Initialize Qdrant client client = qdrant_client.QdrantClient(config['qdrant']['url'], api_key=config['qdrant']['api_key']) # Initialize OpenAI with the API key openai.api_key = config['openai']['api_key'] def embed_text(text): response = openai.embeddings.create( input=[text], # Ensure input is a list of strings model=config['openai']['model_name'] ) # Correctly access the embedding data embedding = response.data[0].embedding # Access using the attribute, not as a dictionary return embedding # Define the query query = "ethical implications of AI in healthcare" query_embedding = embed_text(query) # Function to perform retrieval and print results def retrieve_and_print(collection_name): result = client.search( collection_name=collection_name, query_vector=query_embedding, limit=3 ) print(f"\nResults from '{collection_name}' collection for the query: '{query}':") if not result: print("No results found.") return for idx, res in enumerate(result): if 'text' in res.payload and res.payload['text']: print(f"Result {idx + 1}:") print(f" Text: {res.payload['text']}") print(f" Source: {res.payload.get('source', 'N/A')}") print(f" Topic: {res.payload.get('topic', 'N/A')}") else: print(f"Result {idx + 1}:") print(" No relevant text found for this chunk. It may be too fragmented or out of context to match the query effectively.") # Execute retrieval and provide appropriate explanations retrieve_and_print("poor_chunking") retrieve_and_print("good_chunking") retrieve_and_print("good_chunking_with_metadata") The above code does the following: Defines a query and generates the embedding for the query The search query is set to "ethical implications of AI in healthcare". The retrieve_and_print function searches the particular Qdrant collection and retrieves the top 3 vectors closest to the query embedding. Now let us look at the output: PowerShell python retrieval_test.py Results from 'poor_chunking' collection for the query: 'ethical implications of AI in healthcare': Result 1: Text: . The ethical implications of AI in heal Source: N/A Topic: N/A Result 2: Text: ant impact is healthcare. AI is being us Source: N/A Topic: N/A Result 3: Text: Artificial intelligence is transforming Source: N/A Topic: N/A Results from 'good_chunking' collection for the query: 'ethical implications of AI in healthcare': Result 1: Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. Source: N/A Topic: N/A Result 2: Text: One of the key areas where AI is making a significant impact is healthcare. Source: N/A Topic: N/A Result 3: Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone. Source: N/A Topic: N/A Results from 'good_chunking_with_metadata' collection for the query: 'ethical implications of AI in healthcare': Result 1: Text: The ethical implications of AI in healthcare, data privacy concerns, and the need for proper regulation are all critical issues. Source: Healthcare Section Topic: AI in Healthcare Result 2: Text: One of the key areas where AI is making a significant impact is healthcare. Source: Healthcare Section Topic: AI in Healthcare Result 3: Text: By addressing these issues head-on, we can ensure that AI is used in a way that benefits everyone. Source: General Topic: AI Overview The output for the same search query varies depending on the chunking strategy implemented. Poor chunking strategy: The results here are less relevant, as you can notice, and that is because the text was split into small, arbitrary chunks. Good chunking strategy: The results here are more relevant because the text was split into sentences, preserving the semantic meaning. Good chunking strategy with metadata: The results here are most accurate because the text was thoughtfully chunked and enhanced using metadata. Inference From the Experiment Chunking needs to be carefully strategized, and the chunk size should not be too small or too big. An example of poor chunking is when the chunks are too small, cutting off sentences in unnatural places, or too big, with multiple topics included in the same chunk, making it very confusing for retrieval. The whole idea of chunking revolves around the concept of providing better context to the LLM. Metadata massively enhances properly structured chunking by providing extra layers of context. For example, we have added source and topic as metadata elements to our chunks. The retrieval system benefits from this additional information. For example, if the metadata indicates that a chunk belongs to the "Healthcare Section," the system can prioritize these chunks when a query related to healthcare is made. By improving upon chunking, the results can be structured and categorized. If the query matches multiple contexts within the same text, we can identify which context or section the information belongs to by looking at the metadata for the chunks. Keep these strategies in mind and chunk your way to success in LLM-based search applications.

By Pavan Vemuri

Methodcentipede

When I was a child, I used to lie on the bed and gaze for a long time at the patterns on an old Soviet rug, seeing animals and fantastical figures within them. Now, I more often look at code, but similar images still emerge in my mind. Like on the rug, these images form repetitive patterns. They can be either pleasing or repulsive. Today, I want to tell you about one such unpleasant pattern that can be found in programming. Scenario Imagine a service that processes a client registration request and sends an event about it to Kafka. In this article, I will show an implementation example that I consider an antipattern and suggest an improved version. Option 1: Methodcentipede The Java code below shows the code of the RegistrationService class, which processes the request and sends the event. Java public class RegistrationService { private final ClientRepository clientRepository; private final KafkaTemplate<Object, Object> kafkaTemplate; private final ObjectMapper objectMapper; public void registerClient(RegistrationRequest request) { var client = clientRepository.save(Client.builder() .email(request.email()) .firstName(request.firstName()) .lastName(request.lastName()) .build()); sendEvent(client); } @SneakyThrows private void sendEvent(Client client) { var event = RegistrationEvent.builder() .clientId(client.getId()) .email(client.getEmail()) .firstName(client.getFirstName()) .lastName(client.getLastName()) .build(); Message message = MessageBuilder .withPayload(objectMapper.writeValueAsString(event)) .setHeader(KafkaHeaders.TOPIC, "topic-registration") .setHeader(KafkaHeaders.KEY, client.getEmail()) .build(); kafkaTemplate.send(message).get(); } @Builder public record RegistrationEvent(int clientId, String email, String firstName, String lastName) {} } The structure of the code can be simplified as follows: Here, you can see that the methods form an unbroken chain through which the data flows, like through a long, narrow intestine. The methods in the middle of this chain are responsible not only for the logic directly described in their body but also for the logic of the methods they call and their contracts (e.g., the need to handle specific errors). All methods preceding the invoked one inherit its entire complexity. For example, if kafkaTemplate.send has a side effect of sending an event, then the calling sendEvent method also acquires the same side effect. The sendEvent method also becomes responsible for serialization, including handling its errors. Testing individual parts of the code becomes more challenging because there is no way to test each part in isolation without using mocks. Option 2: Improved Version Code: Java public class RegistrationService { private final ClientRepository clientRepository; private final KafkaTemplate<Object, Object> kafkaTemplate; private final ObjectMapper objectMapper; @SneakyThrows public void registerClient(RegistrationController.RegistrationRequest request) { var client = clientRepository.save(Client.builder() .email(request.email()) .firstName(request.firstName()) .lastName(request.lastName()) .build()); Message<String> message = mapToEventMessage(client); kafkaTemplate.send(message).get(); } private Message<String> mapToEventMessage(Client client) throws JsonProcessingException { var event = RegistrationEvent.builder() .clientId(client.getId()) .email(client.getEmail()) .firstName(client.getFirstName()) .lastName(client.getLastName()) .build(); return MessageBuilder .withPayload(objectMapper.writeValueAsString(event)) .setHeader(KafkaHeaders.TOPIC, "topic-registration") .setHeader(KafkaHeaders.KEY, event.email) .build(); } @Builder public record RegistrationEvent(int clientId, String email, String firstName, String lastName) {} } The diagram is shown below: Here, you can see that the sendEvent method is completely absent, and kafkaTemplate.send is responsible for sending the message. The entire process of constructing the message for Kafka has been moved to a separate mapToEventMessage method. The mapToEventMessage method has no side effects, and its responsibility boundaries are clearly defined. Exceptions related to serialization and message sending are part of the individual methods' contracts and can be handled separately. The mapToEventMessage method is a pure function. When a function is deterministic and has no side effects, we call it a "pure" function. Pure functions are: Easier to read Easier to debug Easier to test Independent of the order in which they are called Simple to run in parallel Recommendations I would suggest the following techniques that can help avoid such antipatterns in the code: Testing Trophy Approach One Pile Technique Test-Driven Development (TDD) All these techniques are closely related and complement each other. Testing Trophy This is an approach to test coverage that emphasizes integration tests, which verify the service's contract as a whole. Unit tests are used for individual functions that are difficult or costly to test through integration tests. I have described tests with this approach in my articles: Ordering Chaos: Arranging HTTP Request Testing in Spring Enhancing the Visibility of Integration Tests Isolation in Testing with Kafka One Pile This technique is described in Kent Beck's book "Tidy First?". The main idea is that reading and understanding code is harder than writing it. If the code is broken into too many small parts, it may be helpful to first combine it into a whole to see the overall structure and logic, and then break it down again into more understandable pieces. In the context of this article, it is suggested not to break the code into methods until it ensures the fulfillment of the required contract. Test-Driven Development This approach allows separating the efforts of writing code to implement the contract and designing the code. We don't try to create a good design and write code that meets the requirements simultaneously; instead, we separate these tasks. The development process looks like this: Write tests for the service contract using the Testing Trophy approach. Write code in the One Pile style, ensuring that it fulfills the required contract. Don't worry about code design quality. Refactor the code. All the code is written, and we have a complete understanding of the implementation and potential bottlenecks. Conclusion The article discusses an example of an antipattern that can lead to difficulties in maintaining and testing code. Approaches like Testing Trophy, One Pile, and Test-Driven Development allow you to structure your work in a way that prevents code from turning into an impenetrable labyrinth. By investing time in the proper organization of code, we lay the foundation for the long-term sustainability and ease of maintenance of our software products. Thank you for your attention to the article, and good luck in your quest for writing simple code!

By Anton Belyaev

You Don’t Get Paid to Practice Scrum

TL; DR: Why Solving Customer Problems Instead Matters Scrum is just a tool; your job is to solve real customer problems and deliver value. Stop focusing on perfecting frameworks and start prioritizing outcomes that matter. It’s time to reassess what truly drives your success, particularly given the challenging business environment. Why Solving Customer Problems Matters More Than Perfecting Scrum Agile practices, particularly within Scrum, often captivate practitioners with their events, roles, principles, rules, and stickies. However, practitioners tend to overlook two crucial truths — both veterans and newcomers alike: First, we are not paid to practice Scrum — or any other specific agile framework — but to solve our customers’ problems within the given constraints while contributing to the organization’s sustainability. Remembering that our primary responsibility — helping create valuable outcomes — goes beyond simply adhering to a framework is essential. Our value lies in navigating complex challenges, delivering meaningful solutions, and supporting the organization’s long-term health and profitability. Second, as long as we deliver business value ethically, legally, sustainably, and in a financially viable manner, no one cares about the specific tools or frameworks we use. And rightfully so. The reality is that customers, stakeholders, and executives are not interested in the mechanics of how you deliver value — they care about the outcomes. Whether through Scrum, Kanban, or any other method, the tangible business value you create matters. These basic principles are often overshadowed by a focus on perfecting using agile frameworks, which can lead to a dangerous complacency. Regardless of experience, Agile practitioners may fall into the trap of believing that mastering Scrum — or any other Agile framework — guarantees success. However, the practical, real-life measure of success lies in the business value delivered, not in the flawless execution of a framework. Suggested Calls to Action I encourage you to reflect on your current practices. Are you indeed focused on delivering value, or have you been caught up in the pursuit of perfecting your use of a framework? Are you solving real customer problems or simply following processes without questioning their impact? You can quickly start analyzing your current situation: Reassess your priorities: Take time to critically evaluate whether your current focus is on solving customer problems or simply following Scrum’s motions. Adjust your approach accordingly. Engage with stakeholders: Start conversations with your stakeholders and customers to ensure you align on what they perceive as value. Use this feedback to inform your team’s efforts. Challenge the status quo: Regularly question whether your practices and tools are the best fit for delivering value in your current context. Be willing to adapt and change when necessary. Emphasize outcomes over outputs: Shift your team’s focus from completing tasks to achieving outcomes that make a meaningful difference to your organization and its customers. Reflect and adapt: Build a habit of personal reflection. Regularly reflect on your role and contributions and make adjustments to ensure that you are not just practicing Scrum but genuinely delivering value. Conclusion Let’s cut to the chase: If you’re still equating your success with how well you adhere to Scrum events or how strictly you follow the rules of any framework, you’re missing the point. Agile, at its core, isn’t about religiously practicing Scrum or any other practice. It’s about solving real customer problems, delivering tangible value, and doing so in a way that sustains the organization. Scrum and other frameworks are merely tools in your toolbox. They’re helpful, but they’re not the endgame. The real question is: Are you using these tools to create something that matters? If you’re more concerned with the process than the outcomes, it’s time to reassess your priorities. Take a moment to reflect: Are your practices driving value, or are they just going through the motions? Are you challenging the status quo, or are you complacent, mistaking adherence to a framework for progress? Your role as a Scrum Master or Agile Coach isn’t about being a gatekeeper of Scrum but a catalyst for change and delivering value. In the end, nobody cares about the tools you use if you’re not delivering results. It matters whether you’re helping your team and organization solve the right problems and achieve meaningful outcomes. So, take a hard look at your approach and ask yourself: Are you delivering the value your customers and stakeholders need, or are you just ticking boxes? To stay relevant and valuable, focus on what truly counts. Remember, it’s not about Scrum — it’s about delivering results that matter.

By Stefan Wolpers

CORE