DZone Spotlight

Thursday, October 16 View All Articles »

Python Development With Asynchronous SQLite and PostgreSQL

By Craig Oda

CORE

After years of working from the comfort of Python and Django, I moved to the wild asynchronous world of FastAPI to improve latency in web-based AI applications. I started with FastAPI and built an open-source stack called FastOpp, which adds command-line and web tools similar to Django. Initially, things went smoothly using SQLite and aiosqlite to add AsyncIO to SQLite. I used SQLAlchemy as my Object Relational Mapper (ORM) and Alembic as the database migration tool. Everything seemed to work easily, so I added a Python script to make things similar to Django’s migrate.py. As things were going smoothly, I added Pydantic for data validation and connected Pydantic to the SQLAlchemy models with SQLModel. Although I was pulling in open source packages that I wasn’t that familiar with, the packages were popular, and I didn’t have problems during initial use. Django comes with an opinionated stack of stable, time-tested tools, which I was starting to miss. However, I was really attracted to FastAPI features such as auto-documentation of APIs and the async-first philosophy. I continued forward by integrating SQLAdmin for a pre-configured web admin panel for SQLAlchemy. I also implemented FastAPIUsers. At this point, I ran into problems using FastAPIUsers in the same way I used Django authentication. I got my first glimpse of the complex world outside of the Django comprehensive ecosystem. I needed to implement my own JWT authentication and used FastAPIUsers as the hash mechanism. The FastAPI project has a full-stack-fastapi-template that I assessed as a starting point. I chose not to use it since my primary goal was focused on using Jinja2Templates for a streaming application from an LLM with heavy database use, both SQL and eventually a vector database using pgvector with PostgreSQL, or for simpler deployments, FAISS with SQLite and the FTS5 extension. My goal is to provide a more Django-like experience for FastAPI and provide the opportunity in the future to use the built-in API and auto-documentation of FastAPI instead of implementing something like Django REST framework, which I've found difficult to set up for automatic documentation of the API endpoints. I've considered for a long time whether it’s better to just use Django with asyncio from the beginning and not build a Django-like interface around FastAPI. In Django 6, there is some support for background tasks. My primary motivation for moving to FastAPI occurred when I was using Django for asynchronous communication with LLM endpoints. Although Django works fine with asynchronous communication, its default synchronous communication style created a number of problems for me. For average people like me, it’s difficult to keep a method asynchronous and not have any synchronous calls to other libraries that might be synchronous or other synchronous communication channels, like a database access. At this point, I wanted to simplify my code architecture and committed to FastAPI and make my code asynchronous from the beginning. It seemed simple. I thought I just needed to use an asynchronous driver with PostgreSQL and everything would work. I was wrong. Problems Moving to Asynchronous Database Connections psycopg2, psycopg3, or asyncpg The default way to connect to Python for many people is psycopg2. This is a very proven way. It is the default usage in most Django applications. Unfortunately, it is synchronous. The most common asynchronous PostgreSQL connector is asyncpg, but I couldn't get it to work in my deployment to Leapcell. As Leapcell had a psycopg2 example for SQLAlchemy, I used psycopg2 and rewrote the database connection to be synchronous while keeping everything around the connection asynchronous. As the latency with the LLM is much higher than the latency with the database, this seemed like a reasonable solution at the time. I just had to wait for the database to send me back the response, and then I was free to deal with other asynchronous problems, such as LLM query and Internet search status updates. The database latency was likely going to be less than 1,500ms in most queries, which was okay for my application. Using a synchronous connection to the database is great in theory, and I’m sure that other, more experienced Python developers can easily solve this problem and keep the synchronous and asynchronous code nicely separated with clean use of async and await. However, I ran into problems with organizing my code to use synchronous connections to the database within asynchronous methods that were talking to the LLM and storing the history in the database. As I was familiar with async/await from using Dart for many years, I was surprised I was having these problems. The problem I had might have been due to my lack of experience in understanding which pre-made Python modules were sending back synchronous versus asynchronous responses. I think that other Python developers might be able to understand my pain. To keep to an asynchronous database connection for both SQLite and PostgreSQL, I moved from the synchronous psycopg2 to asyncpg. SSL Security Not Needed in SQLite, But Needed in PostgreSQL Production The asyncpg connector worked fine in development, but not in production. Although establishing an SSL network connection seems obvious, I didn’t really appreciate this because I had been deploying to sites like Fly.io, Railway, and DigitalOcean Droplets with SQLite. For small prototype applications, SQLite works surprisingly well with FastAPI. I was trying to deploy to the free hobby tier of Leapcell to set up a tutorial for students who didn’t want to pay or didn’t want to put their credit card into a hosting service to go through a tutorial. There’s no way to write to the project file system on the Leapcell service engine. Leapcell recommends using Object Storage and PostgreSQL for persistent data. They do offer a free tier that is pretty generous for PostgreSQL. Leapcell requires SSL communication between their PostgreSQL database and their engine, which they call the service. Unfortunately, the syntax is different for the SSL mode between psycopg2 and asyncpg. I couldn’t just add ?sslmode=require to the end of the connection URL. Leapcell did not have an example for asyncpg. Likely due to my limited skills, I wasn’t able to modify my application completely enough to put the SSL connections in all the required places. In order to just use the URL connection point with sslmode=require, I decided to use psycopg3. Prepared Statements Caused Application to Crash With SQLAlchemy As I needed to use an async ORM in Python, I used SQLAlchemy. I didn’t have too much experience with it initially. I didn’t realize that even though I wasn’t making prepared statements in my Python application, the communication process between psycopg and PostgreSQL was storing prepared statements. Due to the way the connections were pooled on Leapcell, I had to disable the prepared statements. It took me a while to isolate the problem and then implement the fix. The problem never occurred when using SQLite because SQLite runs prepared statements in the same process using the same memory space as the Python program. This is different from PostgreSQL, where the network and session state can change. As I was worried about the performance impact of disabling prepared statements, I did some research, and it appears that SQLAlchemy does statement caching on the Python side. The real-world impact of disabling the prepared statement in PostgreSQL appears to be negligible. Summary Using SQLite in asynchronous mode has been quite easy. Getting PostgreSQL to work has been more difficult. There were three areas that I had trouble with for PostgreSQL: Asynchronous connection – how to write asynchronous Python code effectively to await the return data.Security – how to deal with both SQLite, which doesn’t require an SSL, and PostgreSQL in production, which does require an SSL.Prepared statements – I needed to learn to rely on the SQLAlchemy statement caching instead of the built-in prepared statements on the PostgreSQL server. I like FastAPI, and there are many huge advantages to using it that I got in the first hour of use. I’m going to continue using it instead of Django. However, I’m starting to really appreciate how much Django shielded me from much of the infrastructure setup for my applications. FastAPI is unopinionated in areas such as the database, connectors, authentication, and models. I find it difficult to gain expertise in any one area. Thus, I am focusing on a smaller set of open source components that work with FastAPI to gain a deeper understanding of their use. I feel that many other Python developers are on a similar journey to experiment more with asynchronous Python web applications. I would appreciate feedback and ideas on which open source components or techniques to use to build effective asynchronous AI applications. Resources FastOpp – Open source stack I am building around FastAPIFastAPI – A better Flask More

Distributed Locking in Cloud-Native Applications: Ensuring Consistency Across Multiple Instances

By Navin Kaushik

Overview I am sure that most of us may have used some kind of locking during development, or may have faced issues of incorrect results in some states that are difficult to reproduce. Things are not that complex when we need to manage them within the process or even multiple processes, but on the same machine. It is also very common these days that most of us are involved in making cloud-native applications/services, where there are multiple instances of the service[s], either due to high availability/load balancing. In case of multiple instances of service[s], things become trickier when you face a situation where you need to make sure that certain operations must be performed in a synchronized manner, and it's not about multiple threads/processes but multiple pods/nodes in a native environment. In this blog, this aspect will be covered in detail so that you are aware of the challenges and choose the available choices with due diligence. Here, the primary focus would be on the concept side, like identifying the purpose of locking, challenges, available options, and decision-making factors. Assumption It is assumed that the reader is aware of race condition challenges and why it's important to take care of them in certain scenarios if applicable. Another assumption is that there are multiple instances of the application in a distributed environment, creating concurrent requests where locking is required to perform the task in the expected manner. Use Cases/Problem Statement There are two types of motivations for which you may require locking: 1. Efficiency In this case, you want to prevent a situation where multiple copies of an operation/job are running, although there is no problem if it happens, but that's a wastage of resources, and it doesn't lead to incorrect results/states in the end. For example, there is a job that copies files from one folder to another folder. Since multiple instances are running, all may try to perform this operation at the same time. While this may not lead to a corrupt state, there is no need to perform this operation in concurrent mode, as the final result would be the same. 2. Correctness In this case, if concurrent operation is allowed on the same state, it may lead to incorrect results and must be prevented. Basically, cases where a race condition may lead to an incorrect result must be prevented. For example, if you open two online banking sessions and try to transfer some money at the same time from your bank account, then locking must be there in order to prevent incorrect results. Locking levels may be different, i.e., sometimes taken care of by the underlying framework, like a database, which takes care of locking at one level, and sometimes you need to do it explicitly, due to your business workflow/use case, which is not addressed by the underlying technology/framework. For example: Taken care of by the database itself: Update employee set daalchini_balance = daalchini_balance - order_amount where employeeid = xyz. In this case, even if you run this query concurrently, the result would be correct as the database itself takes care of locking. Explicit Locking: Imagine a scenario where a parking sticker is to be given to an employee if not already taken and if parking stickers are available. There are two tables: one table with total stickers, and another table containing the details of each employee and their car details, along with sticker details. We need to make sure that even if multiple concurrent call comes from the same employee to take the sticker, it should still ensure to give the sticker only if it has not already been given. You may need to take an explicit lock to prevent a race condition when checking if the sticker has already been taken by an employee. Then, issue it if it has not been given and stickers are available, i.e., read and update scenario. Notes Please note that a lock may be required even for scenarios where the database is not there; here, we just took the database-based example for simplicity.Please also note, the lock needs to be efficient as it has a cost associated with it from the perspective of latency/throughput, i.e: Lock time should be as little as possible.Lock should be as granular as possible, i.e, don't block things which can be done concurrently. Like row-level lock is preferable to table-level lock wherever possible, and sometimes a combination, like a parking sticker example: lock at business logic level, employeeid, and leverage the underlying database level implicit locks to update records to achieve the desired state. Please note, our focus is not on atomicity here, so we are not discussing that part, but atomicity needs to be maintained wherever required. Challenges In order to provide locking in a distributed environment where multiple service instances are running on the same/different nodes, even different zones/regions, it becomes more challenging as locking information needs to be highly available and fault-tolerant. It may be provided by an external service, you may write your own locking mechanism either in a complex way within the same service or an external locking service to provide locking functionality, or you may use an existing third-party service[s] which may already exist in your ecosystem, like MySQL, GCP Cloud Storage, Redis, etc. It is clear that whatever mechanism is used, it must meet all the challenges of a distributed environment. A distributed environment has some basic characteristics (from the current context point of view), like: There may be network delay/disconnectivity.There may be data loss due to the failover process, as replication is mostly asynchronous from a performance point of view.Any instance[s] of a service can be down anypoint of time. Some Scenarios Imagine, instance 1 of a service took the lock from an external service and, due to some issue, couldn't proceed to do its work for some time, and meanwhile, a time-out happens, and instance 2 of the service gets the lock, and both are able to change the same state, which may lead to an incorrect result. It means you have to be very careful with the time-out part and need to make sure that if the original owner is unable to release the lock before the time-out period, either it finishes/commits its work before the time-out period or rolls back/undoes it, but shouldn't commit after the time-out period. The longer the time-out means, in case of one instance crashes, then another instance can't get the lock for a longer time.Imagine that instance 1 gets a lock, and after that, a failover happens at the locking server end, due to which lock information may be lost at the locking server end, as replication happens in an asynchronous manner, and locking information may not be propagated to other nodes. As a result, instance 2 can also acquire a lock, which may alter the same state, leading to an incorrect result. Distributed Lock Options RDBMS: In this case, you may look for whether your database provider gives an option to use a lock independent of your schema/table, or in other words, even if you are not using any database but just want to use RDBMS as a distributed lock manager (DLM). For example, in MySQL, you can use a named/user-level lock in order to synchronize multiple threads/instances. We have used it as MySQL was already part of the ecosystem, and it is very simple to use and meets our requirements. It is very good, especially when locking is required for correctness. Reference: https://dev.mysql.com/doc/refman/5.7/en/locking-functions.html Redis: You may use Redis for distributed locking as well. It is simple if you want to use it for an efficient purpose, as it's quite fast, and in rare cases, if multiple instances get hold of the lock at the same time, it is fine. If you want to use it for correctness purposes, then you may need to use the redlock algorithm/implementation. The support of the algorithm will depend on whether you are using self-hosted Redis or managed Redis, like GCP. References: https://redis.io/docs/reference/patterns/distributed-locks/ (Correctness) https://developpaper.com/implementation-code-based-on-redis-distributed-lock/ (Efficiency) https://dzone.com/articles/distributed-lock-implementation-with-redis Google Cloud Storage: This one is a very interesting way to leverage Google Cloud Storage for locking purposes. If you are using GCP as a cloud provider and Google Cloud Storage in your ecosystem and don't find any other options like MySQL/Redis, you may go for it. Reference: https://www.fullstaq.com/knowledge-hub/blogs/a-robust-distributed-locking-algorithm-based-on-google-cloud-storage There are many other options like Zookeeper, Hazelcast, etcd, Hashicorp Consul, etc. Warning Please note that locking has a cost associated with it; choose it wisely based upon your needs, and identify whether locking is required for efficiency or correctness.Please use the lock as granularly as possible and the duration as short as possible.It is highly recommended to do performance testing with concurrent requests that work on the same state. Recommendations If efficiency is required, you may go for Redis, and if correctness is required, you may go with MySQL, if these are already used in your ecosystem. Summary I hope you found this blog useful, and if you are already using distributed locking in your project, do share in the comment section on the mechanism used and your experience so far. References https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.htmlhttps://www.alibabacloud.com/blog/the-technical-practice-of-distributed-locks-in-a-storage-system_597141https://davidecerbo.medium.com/everything-i-know-about-distributed-locks-2bf54de2df71https://tanzu.vmware.com/developer/guides/spring-integration-lock/ More

Trend Report

Kubernetes in the Enterprise

Over a decade in, Kubernetes is the central force in modern application delivery. However, as its adoption has matured, so have its challenges: sprawling toolchains, complex cluster architectures, escalating costs, and the balancing act between developer agility and operational control. Beyond running Kubernetes at scale, organizations must also tackle the cultural and strategic shifts needed to make it work for their teams.As the industry pushes toward more intelligent and integrated operations, platform engineering and internal developer platforms are helping teams address issues like Kubernetes tool sprawl, while AI continues cementing its usefulness for optimizing cluster management, observability, and release pipelines.DZone's 2025 Kubernetes in the Enterprise Trend Report examines the realities of building and running Kubernetes in production today. Our research and expert-written articles explore how teams are streamlining workflows, modernizing legacy systems, and using Kubernetes as the foundation for the next wave of intelligent, scalable applications. Whether you're on your first prod cluster or refining a globally distributed platform, this report delivers the data, perspectives, and practical takeaways you need to meet Kubernetes' demands head-on.

Refcard #387

Getting Started With CI/CD Pipeline Security

By Sudip Sengupta

CORE

Getting Started With CI/CD Pipeline Security

Refcard #216

Java Caching Essentials

By Granville Barnett

*You* Can Shape Trend Reports: Join DZone's Database Systems Research

Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you wish) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Database Systems Research With databases powering nearly every modern application nowadays, how are developers and organizations utilizing, managing, and evolving these systems — across usage, architecture, operations, security, and emerging trends like AI and real-time analytics? Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. Oh, and did we mention that anyone who takes the survey could be one of the lucky four to win an e-gift card of their choosing? We're diving into key topics such as: The databases and query languages developers rely onExperiences and challenges with cloud migrationPractices and tools for data security and observabilityData processing architectures and the role of real-time analyticsEmerging approaches like vector and AI-assisted databases Join the Database Systems Research Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our upcoming Trend Report. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team

By DZone Editorial

Indexing Across Data Models: From Tables to Documents to Text

Every modern software application relies on a database to persist and manage its data. The choice of database technology is largely influenced by the application’s data model and its read and write throughput. For large datasets, query efficiency is critical. An inefficient query that works on a small dataset can quickly turn into a performance bottleneck when scaled to hundreds of thousands or millions of data points. While query optimization helps, it alone cannot guarantee high throughput. Factors such as data modeling, normalization, partitioning strategies, indexing, and even hardware resources all play a role in determining how quickly a system can serve reads and process writes. In this article, we will focus on indexes — one of the most powerful techniques for improving overall database performance. We’ll explore what indexes are, how they work, and how they can dramatically influence both read and write throughput in real-world systems. Indexes Imagine you are visiting a library to find a book to read. The library has organized its books into sections, e.g., science, fiction, or autobiographies. If you are interested in fiction, you head straight to that section, skipping the others. Within the fiction section, suppose the books are arranged alphabetically by title on different racks. If you are looking for a book that starts with “L,” you would go directly to the rack labeled “L” instead of scanning every shelf. This is similar to how indexes work in databases. Just like the sections in a library help you quickly find the right book, a database index helps the system quickly find the right data — without searching through everything. Imagine if the library didn’t have any sections or alphabetical arrangement. You would have to check each book one by one until you find the one you are looking for. The bigger the library, the longer it would take. In computing terms, this kind of search is called O(n) — which means the time it takes grows in direct proportion to the number of items. So, if the library doubles in size, your search time roughly doubles too! That’s why indexes are so important in databases — they help avoid this slow, manual search and make data retrieval efficient and much faster. Definition In its minimalistic form, an index can be defined as — An additional metadata, derived from primary data, serving as an indicator to efficiently locate the data. In the library analogy, the category and title-wise section serves as the metadata to quickly locate the required book. A typical index has the following characteristics: Indexes only utilize the primary data without affecting it, i.e., whether indexed or not, the primary data stays as-is.While indexes improve read, they lower the write throughput as the indexes require updates for every change in primary data.There could be multiple indexes on the same data for different ways of lookup. E.g., a book could be indexed independently based on its category, title, author, publisher, etc. The index could even be a combination of these, which is termed a composite index.Most indexes are to be chosen manually so that the application is in control instead of the tools themselves. Type of Indexes As the primary purpose of an index is to boost reads, the read (or query) pattern determines the index type or its structure for optimal results. For key-value data, a hash-based index would suffice if data is queried via its key only. In case the query requires multiple attributes of data (i.e., value), then indexes based on Log-structured merge-tree (LSM) or B-Tree are utilized. Moreover, if a multi-dimensional range query is required, then R-Tree-based indexes are required. Let's explore various index types and their corresponding use cases. Hash Index Hash indexes use a hash function to map keys to a hash bucket. This enables constant time, i.e., O(1), lookup for an exact match. It is best suited for key-value type data lookup with unique keys. Since hashing doesn’t maintain order, hash indexes are unsuitable for any range-based queries. Typical Hash Indexes Popular tools that utilize hash-based indexes include MySQL Memory (HEAP) Engine, which defaults to hash indexes for in-memory tables; Redis, known for its core data structures that deliver microsecond-level access; and PostgreSQL, which serves as an alternative to default B-trees and is optimized for equality checks. Log-Structured Merge Tree (LSM) LSM tree indexes maintain an in-memory index structure for all the writes, which is often a balanced tree, like a red-black tree. To provide durable storage of writes, they are also updated sequentially as write-ahead logs. On reaching a certain threshold, this in-memory index data is flushed to disk as an immutable, sorted file called a Sorted String Table (SSTable), which acts as the core index. To serve any reads first in-memory index is consulted, followed by SSTables. To add more efficient lookup, Bloom filters are utilized to rule out SSTables that may not have the requested data. Moreover, the background process ensures the stale entries are periodically purged from SSTables via merge and compaction. Typical Log-Structured Merge-Tree Indexes As the LSM tree is designed for write-heavy workloads to achieve excellent write throughput, these indexes are suitable for large-scale key-value stores only (e.g., Cassandra, LevelDB, RocksDB). Depending upon the data spread across multiple SSTables, the read throughput may be slower than that of other indexes. Note: Apache Lucene utilizes LSM for fuzzy text search up to a certain edit distance. B-Tree A B-tree is a balanced tree structure where data is stored in sorted order across nodes called pages. Each node may have multiple children, ensuring the tree remains shallow and lookups are fast. Searching, inserting, and deleting involve traversing from the root down to the leaf, usually in O(log n) time. A B‑tree index is much like a library index — instead of scanning every book, follow the guideposts until you locate the exact book you need. Typical B-Tree Indexes A B-tree is ideal for read-heavy workloads and range queries. Thus, this is the most common index in relational databases (e.g., MySQL — InnoDB, PostgreSQL, Oracle) where predictable query performance is critical. It's imperative to note that most databases use a B+ tree variant instead, where only leaf nodes store row pointers, and internal nodes only guide navigation. This makes range scans even faster. Rectangle-Tree (R-Tree) A rectangle-tree (R-tree) is a hierarchical, balanced tree structure designed to index multi-dimensional spatial data like points, rectangles, polygons, or geographic coordinates. R-tree can be considered as the spatial equivalent of a B-tree, which is for one-dimensional data only. In an R-tree, each node represents a minimum bounding rectangle (MBR) or Bounding Box (BBOX) that encloses its children. The root node covers the entire dataset, and internal nodes contain the MBRs pointing to child nodes. The leaf nodes store actual object data or its references. So when a range or geo-spatial query is made, only the relevant bounding rectangles are traversed. Thu,s eliminating the full traversal. R-tree search, insert/delete is O(log n) with worst case as O(n) only if bounding boxes overlap heavily. Typical Rectangle Tree Indexes R-tree is most efficient for range queries such as “all restaurants within 2 km” or searching for nearest-neighbor, like “closed hospital from current location.” PostgreSQL (with PostGIS) uses R-tree-over-GiST spatial indexes for geospatial queries. SQLite has a built-in R-tree module for geometry-heavy apps. Many GIS systems, including ArcGIS, QGIS, and other mapping platforms, rely on R-trees for fast rendering and querying. Note: Elasticsearch uses a BKD tree to support geospatial query indexing. Internally, this BKD tree is converted to an R-tree. Bitmap Index A bitmap index is a special type of database index that uses bit arrays (bitmaps) instead of traditional tree structures. Each distinct value in a column gets its own bitmap (a sequence of 1s and 0s), where each bit corresponds to a row. 1 represents the row containing the value, and 0 represents the row that does not contain the value. Typical Bitmap Indexes Bitmap indexes are best suited for low-cardinality columns, i.e., with few distinct values, e.g., gender, boolean flags, marital status, region codes, etc. Moreover, any bitwise operations (AND, OR, NOT) on bitmaps are extremely efficient. Bitmap indexes are commonly used in data warehouses and read-heavy systems where queries often combine multiple filters. Oracle Database offers native bitmap indexing while PostgreSQL supports bitmap scans (but not persistent bitmap indexes by default). Apache Hive and Spark SQL use bitmap-based techniques in their respective query optimization layers. Inverted Index An inverted index is a data structure that maps terms (words) to the list of documents (or locations) where they appear. In an inverted index, each unique word in the dataset is stored once. Before being stored, the word is normalized, i.e., transformed to lowercasing, stemming, or removed if it's a stop word. Against the word, a list of document IDs (and sometimes positions) where that word occurs is stored. This enables the query to return the documents quickly where the specified word is present. Typical Inverted Indexes Inverted indexes are the backbone of modern search engines. They are used in log aggregation platforms (e.g., Splunk, ELK stack) to query terabytes of machine data in near real time. Moreover, some RDBMS (like PostgreSQL’s GIN index) and NoSQL systems (Elasticsearch, Apache Lucene, Solr, etc.) use inverted indexes too for full-text search. Vector Index A vector index is a data structure designed to store and search high-dimensional vectors efficiently. Instead of exact matches (like traditional indexes), they enable similarity search — finding items, e.g., “Find me all the products in the image that look like this one,” or “Suggest songs that feel like this track.” While B-Trees unlocked fast relational queries, Vector indexes are unlocking fast semantic queries. Vector indexes handle such high-dimensional vector embeddings via mathematical representations of text, images, audio, or user behavior. At their core, vector indexes organize embeddings in a way that makes Approximate Nearest Neighbor (ANN) search efficient. Typical implementations include Inverted File Index (IVF), Hierarchical Navigable Small World Graph (HNSW), and Product Quantization (PQ). Typical Vector Indexes Typical usage of Vector indexes includes recommendation systems, computer vision — such as image or video retrieval by “visual similarity” — and AI memory and retrieval-augmented generation (RAG), where large language models (LLMs) retrieve relevant context from vast knowledge bases. Several modern tools and databases implement vector indexing under the hood, e.g., Facebook AI Similarity Search (FAISS) — an open-source library for ANN search, and Annoy (Spotify) — optimized for recommendation systems. Secondary Index A secondary index is an additional index created on a non-primary key attribute(s) of a table. Unlike the primary index (which is tied to the primary key and usually unique), a secondary index can be built on any column(s), even those with duplicate values. Whenever data is inserted, updated, or deleted, the secondary index must also be updated — so there is a trade-off between faster reads and slower writes. Typically, secondary indexes are used to speed up queries that filter or sort on non-primary key columns. Also, it allows efficient access paths for multiple attributes in the same data. Secondary indexes are especially helpful in large datasets where scanning the whole table would be too slow. In traditional relational databases, PostgreSQL supports B-tree, Hash, GIN, and BRIN secondary indexes. MySQL/MariaDB has secondary indexes on InnoDB tables. Oracle and SQL Server heavily rely on secondary indexes for query optimization. In NoSQL Databases, MongoDB allows secondary indexes on fields beyond _id similar to CouchBase, which terms these as “Global Secondary Indexes GSI” instead. In Elasticsearch, its inverted index structure serves a similar purpose for text search. Composite or Concatenated Index A composite index (also called a concatenated index or multi-column index) is an index built on two or more columns of a table, instead of just one. It stores a combined, ordered structure of multiple columns, allowing the database to quickly locate rows based on those columns together. The sequence of columns in the index is critical. E.g., an index on (last_name, first_name) can efficiently support queries filtering by last_name alone, or by both last_name and first_name — but not by first_name alone. Having individual indexes on each column is not the same; the optimizer may not combine them as efficiently as a single composite index. A composite index is typically useful when queries often filter, sort, or join on the same set of columns. Composite index is supported in all major relational databases. PostgreSQL uses B-tree composite indexes with advanced variations like multicolumn GiST indexes. SQL Server supports composite clustered and non-clustered indexes, with options like included columns. Oracle supports concatenated indexes with optimizer hints for fine-tuning. Even NoSQL systems like MongoDB support compound indexes, applying similar principles. Clustered Index A clustered index is a way of organizing data in a database table so that the rows are physically stored on disk in the same order as the index. Since the table’s rows can only be stored in one order, only one clustered index per table is possible. Queries that search by the clustered index key (e.g., primary key, date, or ID) are very efficient because the data is already sorted that way. It’s usually implemented as a B-tree, where leaf nodes contain the actual data rows. A clustered index could be on a single column or on multiple columns (i.e., composite index), too. Clustered indexes are best suited for range queries since the rows are stored in order, and the database can quickly scan a contiguous block. Queries with ORDERBY or GROUP BY on the clustered column often skip extra sorting steps. Since the clustered index is often built on the primary key, fetching rows by ID is thus lightning fast. However, clustered indexes can slow down insert-heavy workloads (because new rows must be slotted into the correct physical position) and make updates to indexed columns more expensive. SQL Server and MySQL (InnoDB engine) always cluster data around the primary key. PostgreSQL doesn’t maintain clustered indexes automatically, but it can be used with the CLUSTER command to reorder a table physically. Oracle uses Index-Organized Tables (IOTs), which are essentially clustered indexes. Conclusion While most software professionals don’t need to master every internal detail of database indexes, understanding their fundamentals is highly valuable. Indexes are not just abstract data structures; they directly shape how efficiently applications handle queries, scale under load, and balance read/write performance. With this knowledge, engineers and architects can make more informed choices about which database engines or index types best suit their data models and query patterns. However, real-world performance depends on data distribution, workload, and hardware. That’s why benchmarking with actual datasets is essential before adopting or switching indexes. Thoughtful selection, validated by testing, ensures systems remain efficient, resilient, and scalable in practice. References and Further Reading Log-structured_merge-treeB-treeGiSTPostGISGeographic_information_systemPostgres GINR-treeInverted_indexVector_space_modelDesigning Data-Intensive Applications

By Ammar Husain

Senior Developers, What to Read Next?

Recently, one of my best friends, who is, in the meantime, one of the smartest developers I have the luck to know, asked me what book he should read next to further develop his skills. It took me some time to gather my thoughts, and it might be useful for others, too. Spoiler alert: I could not find a single book that I would say is the one to read as a senior developer. Instead, I summarized the books that I found good for one reason or another. As the summary also declared, this is a subjective list; feel free to agree or disagree with my choices, as well as feel free to leave a comment or contact me in any other way to share your thoughts. First of all, why read books? We are in 2025 — everything important can be summarized in a 160-character-long message, and nobody has the mental capacity to consume anything longer than four seconds. Well, jokes aside, it is a valid concern that books might get outdated; following tech websites can help you stay up to date. (And a currently hot topic, AI is a really good example for this. I am not aware of many well-written and still up-to-date books on that topic.) While this is true, I still have two reasons why people should engage in reading physical books: Because usually, physical books get way deeper into topics than shorter publications. And they tend to present not only the direct results, but also try to clarify the ideas, thoughts, and assumptions behind advice (i.e., how you shall write code) or engineering decisions (i.e., why a given framework's API was built in a particular way). If you want to learn more, these aspects are far more important than knowing a whole API by heart or just knowing the best practices without understanding the reasoning behind them.Because you will remember better: at least, my experience is that I can remember way better which book I read something in, as well as on which blog I read something. I can remember the font, the size of the book, and the number of pages on the left and right side; therefore, when I look for something I read on paper, I find it usually faster than when I look for something that I read online. This might not apply to everyone, but according to my colleagues, this applies to most of us. As a side note, I will not link any online shop for the books, but will provide the ISBN numbers, and you can look them up in the shop of your choice. The Core Let's start with my advice on two books. I know my friend has already read them, but I was unsure if he has the physical copies, too: Clean Code (ISBN 978-0132350884) and Clean Architecture (978-0134494166) by Robert C. Martin. You do not have to agree with everything in these, but I expect every single developer in my team to know about the concepts and arguments listed in these two. As a side note, the second edition of Clean Code (978-0135398579) will be released soon (October 27, 2025), and I am already excited to get it. In general, I think it is a good idea to re-read these books every couple of years. Technical Books I do not believe that any of the following books would be 100% new and unknown to developers who have been around for a while. Still, they might contain such aspects that you have not thought through until now, so they could be a good addition to your library. I really liked the book The Software Craftsman (978-0134052502) because it places software development in a different context, as well as discussing internal processes, interviewing, personal dedication, and other aspects. This book will not help you with low-level coding details, but it could provide you with some insights about the industry you are working in, which definitely helps you to improve as a developer, too. Get Your Hands Dirty on Clean Architecture (978-1805128373), this book managed to surprise me. It has a really good section about layered architecture. This book complements the Clean Architecture book really well by detailing some code-level aspects, which can be really helpful if you are not used to actually writing code according to clean architecture standards. Kent Beck's book, Test Driven Development (978-0321146533), is simply the best book I've seen on TDD. I really liked how he demonstrated the strength of TDD in multiple examples, even writing an xUnit tool. The next book does not strictly belong to this list, as it is not meant for senior developers with years, maybe even decades of experience, but I find its writing style really good. Java By Comparison (978-1680502879) aims to help developers with less experience learn about best practices and how to avoid some mistakes. If you, as a senior, are asked by a junior what to read, this could be a really good pick. Clean Agile (978-0135781869) provides insights into how R. C. Martin remembers the starting days of the agile movement — how it evolved, which are the situations when agile methodology fails to help, and, in general, for what it was originally intended to be used. Reading this can heavily improve your added value in agile projects, simply by having a better understanding of the methodology itself, and maybe you can help your team to be more efficient, regardless of how strictly you follow a given ruleset. Non-Technical Books The books mentioned until this point were written by tech people for tech people, which is ultimately a valid approach because we should learn from each other. The following books do not fall into the same category: they are not necessarily written by technical people and are meant for more than just technical audiences. I still recommend them. Humble Pi: When Math Goes Wrong in the Real World (978-0593084694) is a super entertaining book. You would not expect anything else if you are familiar with the author. Still, beyond being entertaining, it brings attention to responsibility, which I find really important to be aware of, as a software developer. Some mistakes lead to a lot of money being lost. Some mistakes lead to people's lives ending earlier. I am not willing to get all of us super paranoid, but from time to time, everyone should consider what can happen if something in their code malfunctions. In the same book, I found some interesting details on topics that I was not even aware of, and how complex they can be. My favorite part was about calendars: everyone knows that dealing with time zones and various ways of counting days and years can be tricky. But I was not aware that it could be this tricky, and how much struggle it was, when sending messages from one European city to another was not a matter of seconds but weeks. Have you ever felt you are Surrounded by Idiots (978-1785042188)? If you work as a developer and have never felt this, please contact me. I mean, really. I want to know where you work and how to apply. This book describes people and groups. I assume you are a person and you work in some sort of group. Therefore, this book could be useful to understand others, the dynamics in which they are most efficient. I do not believe this book will help you resolve conflicts. But it can help you understand the reasons for conflicts. The last book I advise is Thinking, Fast and Slow (978-0374533557). This one covers a lot of topics that are not or are only marginally related to software development. Still, understanding how people decide and what the limits of rational behaviour are can help you a lot if you want to practice some self-reflection. And I believe, most of us, most of the developers could practice a bit more of it. Wrapping Up Feel free to pick any book from this list for the reasons I mentioned or for any other reason you may have, or pick any other book you believe that will help you become a better developer. My main message here is just please do consume high-quality sources because we cannot fall into the same mistakes over and over again.

By Daniel Buza

Where Stale Data Hides Inside Your Architecture (and How to Spot It)

Every system collects stale data over time — that part is obvious. What’s less obvious is how much of it your platform will accumulate and, more importantly, whether it builds up in places it never should. That’s no longer just an operational issue but an architectural one. In my experience, I’ve often found stale data hiding in corners nobody thinks about. On the surface, they look harmless, but over time, they start shaping system behavior in ways that are hard to ignore. And it’s not just a rare edge case: studies show that, on average, more than half of all organizational data ends up stale. That means the risks are not occasional but systemic, quietly spreading across critical parts of the platform. The impact isn’t limited to performance. Outdated records interfere with correctness, break consistency across services, and complicate debugging. What is more, stale data quietly consumes storage and processing resources, increasing operational costs. Based on what I’ve seen in enterprise platforms, I can point to several hidden spots that deserve far more attention than they usually get. Where Stale Data Finds Room to Hide My team often joins enterprise projects with goals like improving performance or reducing costs. Each time, the same lesson surfaces: by examining the spots below, platforms become leaner, faster, and far easier to maintain. Cache Layers as Hidden Conflict Zones Stale data often hides not in caching itself but in the gaps between cache layers. When application, storefront, and CDN caches don’t align, the system starts serving conflicting versions of the truth, like outdated prices or mismatched product images. In one enterprise ecommerce platform, we traced product inconsistencies back to five overlapping cache levels that overwrote each other unpredictably — a classic case of caching mistakes. The fix required reproducing the conflicts with architects and tightening configurations. A clear warning sign that your cache may hide stale data is when problems vanish after cache purges, only to return later. It often means the layers are competing rather than cooperating. Synchronization Jobs That Drift Another source of stale data is asynchronous synchronization. On paper, delayed updates look harmless, as background jobs will “catch up later.” In practice, those delays create a silent drift between systems. For example, users of a jewelry platform saw outdated loyalty points after login because updates were queued asynchronously. Customers assumed their balances had disappeared, support calls surged, and debugging became guesswork. The issue was fixed by forcing a back-end check whenever personal data pages were opened. A common signal is when user-facing data only appears correct after manual refreshes or additional interactions. Historical Transaction Data That Never Leaves One of the heaviest anchors for enterprise systems is transactional history that stays in production far longer than it should. Databases are built to serve current workloads, not to carry the full weight of years of completed orders and returns. This is exactly what my team encountered in a European beauty retail platform: the production database had accumulated years of records, slowing queries, bloating indexes, and dragging overnight batch jobs while costs crept higher. The fix was smart archiving, with moving old records out of production and deleting them once the retention periods expired. A telling signal is when routine reports or nightly jobs begin stretching into business hours without clear functional changes. Legacy Integrations as Silent Data Carriers Integrations with legacy systems often look stable because they “just work.” The trouble is that over time, those connections become blind spots. Data is passed along through brittle transformations, copied into staging tables, or synchronized with outdated protocols. At first, the mismatches are too small to notice, but they slowly build into systemic inconsistencies that are painful to trace. A signal worth watching is when integrations are left undocumented, or when no one on the team can explain why a particular sync job still runs. That usually means it’s carrying stale data along with it. Backups With Hidden Liabilities Backups are the one place everyone assumes data is safe. The paradox is that safety can turn into fragility when outdated snapshots linger for years. Restoring them may quietly inject obsolete records back into production or test systems, undermining consistency at the very moment resilience is needed most. The architectural pain is in rising storage costs and the risk of corrupted recovery. A simple indicator is when backup retention policies are unclear or unlimited. If “keep everything forever” is the default, stale data has already found its way into your disaster recovery plan. When seeing the corners where stale data tends to accumulate, the next question is, how do you tell when it’s quietly active in yours? Spotting the Signals of Stale Data Over the years, I’ve learned to watch for patterns like these: Lagging reality: Dashboards or analytics that consistently trail behind real events, even when pipelines look healthy.Phantom bugs: Issues that disappear after retries or re-deployments, only to return without code changes.Inconsistent truths: Two systems show different values for the same entity — prices, stock, balances — without a clear root cause.Process creep: Batch jobs or syncs that take longer every month, even when business volume hasn’t grown at the same pace.Operational tells: Teams relying on manual purges, ad-hoc scripts, or “refresh and check again” advice as standard troubleshooting steps. Signals spotted, hiding places uncovered — the next question is obvious: what do you actually do about it? Here is some practical advice. Keeping Data Fresh by Design Preventing stale data requires making freshness an architectural principle. It often starts with centralized cache management, because without a single policy for invalidation and refresh, caches across layers will drift apart. From there, real-time synchronization becomes critical, as relying on overnight jobs or delayed pipelines almost guarantees inconsistencies will creep in. But even when data moves in real time, correctness can’t be assumed. Automated quality checks, from anomaly detection to schema validation, are what keep silent errors from spreading across systems. And finally, no system operates in isolation. Imports and exports from external sources need fail-safes: guardrails that reject corrupted or outdated feeds before they poison downstream processes. Taken together, these practices shift data freshness from reactive firefighting to proactive governance, ensuring systems stay fast, consistent, and trustworthy. Fresh Data as an Ongoing Architectural Discipline In my experience, the cost of stale data rarely hits all at once — it creeps in. Performance slows a little, compliance checks get harder, and customer trust erodes one mismatch at a time. That’s why I see data freshness not as a cleanup task but as an ongoing architectural discipline. The good news is you don’t need to fix everything at once. Start by asking where stale data is most visible in your system today and treat that as your entry point to building resilience.

By Andreas Kozachenko

Infusing AI into Your Java Applications

Artificial intelligence (AI) is becoming increasingly pervasive. As an Enterprise Java developer, you might be wondering what value AI can add to your business applications, what tools Java provides to easily do that, and what skills and knowledge you might need to learn. In this article, we equip you with the basic knowledge and skills that you need to start exploring the capabilities of AI to build intelligent and responsive Enterprise Java applications. When we talk about AI in this article, we mean getting responses from a large language model (LLM) based on a request that the Java application sends to the LLM. In our article’s example, we create a simple chatbot that customers can ask for planetary tourist destination recommendations, and then use to book a spaceship to visit them. We demonstrate using Java frameworks like LangChain4j with Quarkus to efficiently interact with LLMs and create satisfying applications for end-users. Hello (AI) World: Getting an LLM to Respond to a Prompt The first version of our spaceship rental application will build a chatbot that interacts with customers using natural language. It should answer any customer questions about planets they wish to visit in the solar system. For the full application code, see spaceship rental step-01 directory in the GitHub repository. The chatbot sends the customer's questions to the application, which interacts with the LLM to help process the natural language questions and to respond to the customer. For the AI-related parts of the application, we create just two files: An AI service, CustomerSupportAgent.java, which builds a prompt informing the LLM about our solar system’s planets and instructs the LLM to answer questions from customers.A WebSocket endpoint, ChatWebSocket.java, which receives the user’s messages from the chatbot. AI services are Java interfaces that provide a layer of abstraction. When using LangChain4j, these interfaces make LLM interaction easier. AI services are an integration point, so in a real application, you would need to consider security, observability, and fault tolerance of the connections and interactions with the LLM. As well as handling LLM connection details (stored separately in the application.properties configuration file), an AI service builds the prompts and manages chat memory for the requests it sends to the LLM. The prompt is built from two pieces of information in the AI service: the system message and the user message. System messages are typically used by developers to give the LLM contextual information and instructions for handling the request, often including examples that you want the LLM to follow when generating its response. User messages provide the LLM with application user requests. The CustomerSupportAgent interface is registered as the AI service in the application. It defines the messages used to build the prompt and sends the prompt to the LLM: Java @SessionScoped @RegisterAiService public interface CustomerSupportAgent { @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) String chat(String userMessage); } Let’s look at what this code is doing. The @SessionScoped annotation maintains the session for the duration of the web service connection and maintains the chat memory for the duration of the conversation. The @RegisterAIService annotation registers an interface as an AI service. LangChain4j automatically implements the interface. The @SystemMessage annotation tells the LLM how to behave when responding to the prompt. When the end user types a message in the chatbot, the WebSocket endpoint passes the message to the chat() method in the AI service. There is no @UserMessage annotation specified in our AI service interface, so the AI service implementation automatically creates a user message with the chat() method parameter value (in this case the userMessage parameter). The AI service adds the user’s message to the system message to build a prompt that it sends to the LLM, then displays the response from the LLM in the chatbot interface. Note that, for readability, the planet information has been placed in a separate PlanetInfo class. Alternatively, you could place the planet information directly in the system message. The ChatWebSocket class defines a WebSocket endpoint for the application’s chatbot UI to interact with: Java @WebSocket(path = "/chat/batch") public class ChatWebSocket { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocket(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public String onTextMessage(String message) { return customerSupportAgent.chat(message); } } The CustomerSupportAgent interface uses constructor injection to automatically provide a reference to the AI service. When the end user types a message in the chatbot, the onTextMessage() method passes the message to the AI service chat() method. For example, if the user asks, "What's a good planet to visit if I want to see volcanoes?", the application responds with a recommendation and why the user might like to visit there, as a fan of volcanoes: The Spaceship Rental application chatbot Providing an Illusion of Memory As you continue your conversation with the chatbot, it might seem as though it is aware of previous messages exchanged, that is, the context of your conversation. When you talk to another person, you take for granted that they remember what you (and they) last said. Requests to an LLM are stateless, though, so each response is generated solely based on the information contained within the request prompt. To maintain context in a conversation, the AI service uses chat memory, through LangChain4j, to store prior user messages and the chatbot’s responses. By default, the Quarkus LangChain4j extension stores the chat in memory, and the AI service manages the chat memory (for example, by dropping or summarizing the oldest messages) as needed to remain within the memory limits. LangChain4j by itself would require you to first configure a memory provider, but that is not needed when using the Quarkus LangChain4j extension. This gives a practical illusion of memory to end users and improves the user experience so they can enter follow-on messages without needing to repeat everything they previously said. The user chatbot experience can also be improved by streaming the responses from the LLM. Streaming Responses for a More Responsive User Experience You might notice responses to your chat message window take time to generate and then appear all at once. To improve the chatbot’s perceived responsiveness, we can modify the code to return each token of the response as it is generated. This approach, called streaming, allows users to start reading a partial response before the entire response is available. For the full application code, see the GitHub spaceship rental step-02 directory. Changing our application to stream the chatbot response is easy. First, we’ll update the CustomerSupportAgent interface to add a method that returns an instance of the SmallRye Mutiny Multi<String> interface: Java @SessionScoped @RegisterAiService @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. You answer questions from potential guests about the different planets they can visit. If asked about the planets, only use info from the fact sheet below. """ + PlanetInfo.PLANET_FACT_SHEET) public interface CustomerSupportAgent { String chat(String userMessage); Multi<String> streamChat(String userMessage); } Moving the @SystemMessage annotation to the interface means that the annotation doesn’t have to be added to each of the methods in the interface. The streamChat() method returns the LLM’s response to the chat window one token at a time (instead of waiting to display the full response all at once). We also need to call the new streamChat() method from a WebSocket endpoint. To preserve both batch and stream functionality, we create a new ChatWebSocketStream class that exposes the /chat/stream WebSocket endpoint: Java @WebSocket(path = "/chat/stream") public class ChatWebSocketStream { private final CustomerSupportAgent customerSupportAgent; public ChatWebSocketStream(CustomerSupportAgent customerSupportAgent) { this.customerSupportAgent = customerSupportAgent; } @OnOpen public String onOpen() { return "Welcome to Rocket's Cosmic Cruisers! How can I help you today?"; } @OnTextMessage public Multi<String> onStreamingTextMessage(String message) { return customerSupportAgent.streamChat(message); } } The customerSupportAgent.streamChat() call invokes the AI service to send the user message to the LLM. After making some minor tweaks to the UI, we can now toggle streaming on and off in our chatbot: The application with the new streaming platform enabled With streaming enabled, each token (each word, or part-word) produced by the LLM is immediately returned to the chat interface. Generating Structured Outputs From Unstructured Data Up to this point, the LLM’s outputs have been intended for the application’s end user. But what if, instead, we want the LLM’s output to be used directly by our application? When the LLM responds to a request, the AI service that mediates the interaction with the LLM can return structured outputs, which are formats that are more structured than a String, such as POJOs, lists of POJOs, and native types. Returning structured outputs significantly simplifies the integration of an LLM’s output with your Java code because it enforces that the output received by the application from the AI service maps to your Java object’s predefined schema. Let’s demonstrate the usefulness of structured outputs by helping the end user select a spaceship from our fleet that meets their needs. For the full application code, see the GitHub spaceship rental step-03 directory. We begin by creating a simple Spaceship record to store information about each individual spaceship in the fleet: Java record Spaceship(String name, int maxPassengers, boolean hasCargoBay, List<String> allowedDestinations) { } Similarly, to represent the user’s query about the spaceships in our fleet, we create a SpaceshipQuery record, which is based on the information the user provided in the chat: Java @Description("A request for a compatible spaceship") public record SpaceshipQuery(int passengers, boolean hasCargo, List<String> destinations) { } The Fleet class populates several Spaceship objects and provides a way to filter out those that do not match the user. Next, we update the CustomerSupportAgent interface to take the user’s message (unstructured text) to create a structured output in the form of the SpaceshipQuery record. To accomplish this feat, we only need to set the return type for a new extractSpaceshipAttributes() method in our AI service to be a SpaceshipQuery: Java SpaceshipQuery extractSpaceshipAttributes(String userMessage); Under the covers, LangChain4j automatically generates a request to the LLM including a JSON schema representation of the desired response. LangChain4j deserializes the JSON-formatted response from the LLM and uses it to return a SpaceshipQuery record, as requested. We also need to know whether the user’s input is about one of our spaceships, or about some other topic. This filtering is accomplished using a simpler, structured output request that returns a boolean: Java @SystemMessage(""" You are a friendly, but terse customer service agent for Rocket's Cosmic Cruisers, a spaceship rental shop. Respond with 'true' if the user message is regarding spaceships in our rental fleet, and 'false' otherwise. """) boolean isSpaceshipQuery(String userMessage); Our last addition to the CustomerSupportAgent interface enables the agent to provide a spaceship suggestion based on our fleet and the user’s request, with and without streaming: Java @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) String suggestSpaceships(String message, List<Spaceship> compatibleSpaceships); @UserMessage(""" Given the user's query regarding available spaceships for a trip {message}, provide a well-formed, clear and concise response listing our applicable spaceships. Only use the spaceship fleet data from {compatibleSpaceships} for your response. """) Multi<String> streamSuggestSpaceships(String message, List<Spaceship> compatibleSpaceships); } Our last step is to update the ChatWebSocket and ChatWebSocketStream classes to first check if the user’s query is about spaceships in our fleet. If so, the customer support agent creates a SpaceshipQuery record by extracting the information from the user’s message and then responds with suggested spaceships from the fleet that are compatible with the user’s request. The updated code is similar for both the ChatWebSocket and ChatWebSocketStream classes, so only the ChatWebSocket class is shown here: Java @OnTextMessage public String onTextMessage(String message) { boolean isSpaceshipQuery = customerSupportAgent.isSpaceshipQuery(message); if (isSpaceshipQuery) { SpaceshipQuery userQuery = customerSupportAgent.extractSpaceshipAttributes(message); List<Spaceship> spaceships = Fleet.findCompatibleSpaceships(userQuery); return customerSupportAgent.suggestSpaceships(message, spaceships); } else return customerSupportAgent.chat(message); } With these updates, the customer support agent is ready to use the structured outputs to provide the user with spaceship suggestions: The application providing the user with spaceship suggestions based on the structured output With that, we have completed an AI-infused Java chatbot application that provides planetary tourism recommendations and spaceship rentals. To continue learning, experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. More on These AI Concepts We’ve discussed various AI concepts throughout this article. If you want to know more about any of them, here is a quick explainer. Large Language Models (LLMs) When we talk about AI in this article, we generally mean getting responses from a large language model. LLMs are machine learning models that are trained to generate a sequence of outputs based on a sequence of inputs (often text inputs and outputs, but some multi-modal LLMs can work with images, audio, or video). LLMs can perform a wide variety of tasks, such as summarizing a document, translating between languages, fact extraction, writing code, etc. This task of creating new content from the input is what’s referred to as Generative AI, or GenAI. You can infuse such capabilities into your application as needed. Making Requests to LLMs: Prompts, Chat Memory, and Tokens How you request information from an LLM influences not only the response you get back from the LLM but also the end user’s experience and the application's running costs. Prompts Sending a request to an LLM, whether from application code or as an end-user in a chat interface, involves writing a prompt. A prompt is the information (usually, but not always, text) to which the LLM responds. If you think of communicating with an LLM like communicating with another person, how you phrase your request is important to making sure the other person (or the LLM, in this case) understands what you want to know. For example, ensuring that you give the context of the request before going on to ask for a specific piece of information, and not providing lots of irrelevant information to confuse the listener. Chat Memory Unlike when you are talking to another person, LLMs are stateless and don’t remember the previous request, so everything you need the LLM to take into consideration needs to be in your request: the prompt, any previous requests and responses (the chat memory), and any tools you provide to help the LLM respond. However, providing too much information to the LLM in the prompt can potentially complicate the request. It can also be costly. Tokens LLMs convert the words in your prompt into a sequence of tokens. Most hosted LLMs charge usage based on the number of tokens in the request and response. A token can represent a whole word or a part of a word. For example, the word "unbelievable" is typically split into multiple tokens: "un", "bel", and "ievable". The more tokens that you include in the request, especially when you include all the chat memory, the greater the potential cost of running the application. Providing all the chat memory in a request can make requests both costly and less clear. Requests to LLMs are limited in length, so it’s important to manage the chat memory and how much information is included in the request. This can be helped a lot by the Java frameworks that you use, such as LangChain4j with Quarkus, which we use for the sample application in this article. LangChain4j and Quarkus Frameworks LangChain4j is an open-source Java framework that manages interactions between Java applications and LLMs. For example, LangChain4j, through the concept of AI services, stores and helps you to manage chat memory, so that you can keep requests to the LLM efficient, focused, and less expensive. Quarkus is a modern, cloud-native, open-source Java framework optimized for developer productivity, running in containerized environments, and with fast startup and low memory usage. The LangChain4j extensions to Quarkus simplify the configuration of connecting to and interacting with LLMs in AI-infused Java applications. The LangChain4j project can be used with other Java application frameworks, including Open Liberty, Spring Boot, and Micronaut. MicroProfile and Jakarta EE are also working together with LangChain4j to provide an open standards-based programming model for developing AI applications. The Sample Application You can find the complete sample application that we demonstrate throughout this article on GitHub. The application is written in Java and runs on Quarkus using the Quarkus LangChain4j extensions. Conclusion Infusing AI into Java applications enhances the application’s capabilities and the end-user's experience. With the help of Java frameworks like Quarkus and LangChain4j to simplify interactions with LLMs, Java developers can easily infuse AI into business applications. Writing AI-infused applications in Java means you're working in Java’s robust, enterprise-ready ecosystem, which not only helps you to easily interact with AI models, but also makes it easy for the applications to benefit from enterprise essentials such as performance, security, observability, and testing. The field of AI is rapidly evolving. By mastering the concepts and technologies in this article, you can stay ahead of the curve and start exploring how AI can help you build intelligent and engaging Java applications. Experiment with the full code of our sample application alongside the Quarkus with LangChain4j docs. If you’d like to learn more, try this tutorial on how to extend the knowledge of the LLM with content from PDF documents by using retrieval-augmented generation (RAG): Build an AI-powered document assistant with Quarkus and LangChain4j. Thanks to Red Hatters Clement Escoffier, Markus Eisele, and Georgios Andrianakis for valuable review comments. This article was originally published on InfoQ on Aug 15, 2025.

By Don Bourne

Advanced Snowflake SQL for Data Engineering Analytics

Snowflake is a cloud-native data platform known for its scalability, security, and excellent SQL engine, making it ideal for modern analytics workloads. Here in this article I made an attempt to deep dive into advanced SQL queries for online retail analytics, using Snowflake’s capabilities to have insights for trend analysis, customer segmentation, and user journey mapping with seven practical queries, each with a query flow, BI visualization, a system architecture diagram, and sample inputs/outputs based on a sample online retail dataset. Why Snowflake? Snowflake’s architecture separates compute and storage, enabling elastic scaling for large datasets. It supports semi-structured data (e.g., JSON, Avro) via native parsing, integrates with APIs, and offers features like time travel, row-level security, and zero-copy cloning for compliance and efficiency. These qualities make it a powerhouse for online retail analytics, from tracking seasonal trends to analyzing customer behavior. Scenario Context The examples below use a pseudo online retail platform, "ShopSphere," which tracks customer interactions (logins, purchases) and transaction values. The dataset includes two tables: event_log: Records user events (e.g., event_id, event_type, event_date, event_value, region, user_id, event_data for JSON).user: Stores user details (e.g., user_id, first_name, last_name). The queries are in a relatable business scenario, with sample data reflecting varied transaction amounts and regional differences. All sample data is synthetic, designed to demo query logic in an online retail setting. Getting Started With Snowflake To follow along, create a Snowflake database and load the sample tables. Below is the SQL to set up the event_log and User tables: SQL CREATE TABLE event_log ( event_id INT, event_type STRING, event_date DATE, event_value DECIMAL(10,2), region STRING, user_id INT, event_data VARIANT ); CREATE TABLE user ( user_id INT PRIMARY KEY, first_name STRING NOT NULL, last_name STRING NOT NULL ); Insert the sample data provided in each query section. Use a small virtual warehouse (X-Small) for testing, and ensure your role has appropriate permissions. For JSON queries, enable semi-structured data support by storing JSON in the event_data column. Advanced SQL Queries Below are seven advanced SQL queries showcasing Snowflake’s strengths, each with a query flow diagram, sample input/output, and Snowflake-specific enhancements. These queries build progressively, from basic aggregations to complex user journey analysis and JSON parsing, ensuring a logical flow for analyzing ShopSphere’s data. 1. Grouping Data by Year and Quarter This query aggregates events by year and quarter to analyze seasonal trends, critical for inventory planning or marketing campaigns. Query: SQL SELECT EXTRACT(YEAR FROM event_date) AS year, EXTRACT(QUARTER FROM event_date) AS quarter, COUNT(*) AS event_count, SUM(event_value) AS total_value FROM event_log GROUP BY year, quarter ORDER BY year, quarter; Explanation: The query extracts the year and quarter from event_date, counts events, and sums transaction values per group. Snowflake’s columnar storage optimizes grouping operations, even for large datasets. Snowflake Enhancements Scalability: Handles millions of rows with auto-scaling compute.Search optimization: Use search optimization on event_date to boost performance for frequent queries.Clustering: Cluster on event_date for faster aggregations. Sample input: The event_log table represents ShopSphere’s customer interactions in 2023. event_id event_type event_date event_value region user_id 1 Login 2023-01-15 0.00 US 101 2 Purchase 2023-02-20 99.99 EU 102 3 Login 2023-03-25 0.00 Asia 103 4 Purchase 2023-04-10 149.50 US 101 5 Login 2023-05-05 0.00 EU 102 6 Purchase 2023-06-15 75.25 Asia 103 Sample output: year quarter event_count total_value 2023 1 2 99.99 2023 2 3 224.75 2023 3 1 0.00 BI tool visualization: The bar chart below visualizes the event counts by quarter, highlighting seasonal patterns. Query flow: 2. Calculating Running Totals for Purchases Running totals track cumulative transaction values, useful for monitoring sales trends or detecting anomalies. Query: SQL --Running totals track cumulative transaction values, useful for monitoring sales trends or detecting anomalies. SELECT event_type, event_date, event_value, SUM(event_value) OVER (PARTITION BY event_type ORDER BY event_date) AS running_total FROM event_log WHERE event_type = 'Purchase' AND event_date BETWEEN '2023-01-01' AND '2023-06-30'; Explanation: This query calculates cumulative purchase values, ordered by date, building on Query 1’s aggregation by focusing on purchases. Snowflake’s window functions ensure efficient processing. Snowflake Enhancements Window functions: Optimized for high-performance analytics.Time travel: Use AT (OFFSET => -30) to query historical data.Zero-copy cloning: Test queries on cloned tables without duplicating storage. Sample input (Subset of event_log for purchases in 2023): event_id event_type event_date event_value 2 Purchase 2023-02-20 99.99 4 Purchase 2023-04-10 149.50 6 Purchase 2023-06-15 75.25 Sample output: event_type event_date event_value running_total Purchase 2023-02-20 99.99 99.99 Purchase 2023-04-10 149.50 249.49 Purchase 2023-06-15 75.25 324.74 BI visualization: The running total of purchase values over time, illustrating sales growth Query flow: 3. Computing Moving Averages for Login Frequency Moving averages smooth out fluctuations in login events, aiding user engagement analysis and complementing purchase trends from Query 2. Query: SQL SELECT event_date, COUNT(*) AS login_count, AVG(COUNT(*)) OVER (ORDER BY event_date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_avg FROM event_log WHERE event_type = 'Login' GROUP BY event_date; Explanation: This query calculates a three-day moving average of daily login counts. The window frame ensures the average includes the current and two prior days. Snowflake Enhancements Window frames: Efficiently processes sliding windows.Materialized views: Precompute aggregates for faster reporting.Data sharing: Share results securely with marketing teams. Sample input (Subset of event_log for logins): event_id event_type event_date 1 Login 2023-01-15 3 Login 2023-01-16 5 Login 2023-01-17 7 Login 2023-01-18 Sample output: event_date login_count three_day_avg 2023-01-15 1 1.00 2023-01-16 1 1.00 2023-01-17 1 1.00 2023-01-18 1 1.00 BI visualization: Displays the three-day moving average of login counts, showing whether daily fluctuations exist or not. Query flow: 4. Time Series Analysis for Regional Purchases This query detects daily changes in purchase values by region, building on Query 2 to identify market-specific trends. Query: SQL SELECT event_date, region, event_value, event_value - LAG(event_value, 1) OVER (PARTITION BY region ORDER BY event_date) AS daily_difference FROM event_log WHERE event_type = 'Purchase' AND region = 'US'; Explanation: The LAG function retrieves the previous day’s purchase value, enabling daily difference calculations for the US region. Snowflake Enhancements Clustering: Cluster on region and event_date for faster queries.Query acceleration: Use Snowflake’s query acceleration service for large datasets.JSON support: Parse semi-structured data with FLATTEN for enriched analysis. Sample input (Subset of event_log for US purchases): event_date region event_value 2023-02-20 US 99.99 2023-04-10 US 149.50 Sample output: event_date region event_value daily_difference 2023-02-20 US 99.99 NULL 2023-04-10 US 149.50 49.51 BI visualization: The daily differences in purchase values for the US region, showing fluctuations. Query flow: 5. Generating Hierarchical Subtotals With ROLLUP ROLLUP creates subtotals for reporting, extending Query 1’s aggregations for financial summaries across years and regions. Query: SQL SELECT EXTRACT(YEAR FROM event_date) AS year, region, SUM(event_value) AS total_value FROM event_log WHERE event_type = 'Purchase' GROUP BY ROLLUP (year, region) ORDER BY year, region; Explanation: ROLLUP generates subtotals for each year and region, with NULL indicating higher-level aggregations (e.g., total per year or grand total). Snowflake Enhancements Materialized views: Precompute results for faster dashboards.Dynamic warehouses: Scale compute for complex aggregations.Security: Apply row-level security for region-specific access. Sample input (Subset of event_log for purchases): event_date region event_value 2023-02-20 EU 99.99 2023-04-10 US 149.50 2023-06-15 Asia 75.25 Sample output: year region total_value 2023 Asia 75.25 2023 EU 99.99 2023 US 149.50 2023 NULL 324.74 NULL NULL 324.74 BI visualization: Shows total purchase values by region for 2023, with a separate bar for the yearly total. Query flow: 6. Recursive CTE for Customer Purchase Paths This query uses a recursive CTE to trace customer purchase sequences, enabling user journey analysis for personalized marketing. Query: SQL WITH RECURSIVE purchase_path AS ( SELECT user_id, event_id, event_date, event_value, 1 AS path_level FROM event_log WHERE event_type = 'Purchase' AND event_date = (SELECT MIN(event_date) FROM event_log WHERE user_id = event_log.user_id AND event_type = 'Purchase') UNION ALL SELECT e.user_id, e.event_id, e.event_date, e.event_value, p.path_level + 1 FROM event_log e JOIN purchase_path p ON e.user_id = p.user_id AND e.event_date > p.event_date AND e.event_type = 'Purchase' ) SELECT u.user_id, u.first_name, u.last_name, p.event_date, p.event_value, p.path_level FROM purchase_path p JOIN user u ON p.user_id = u.user_id ORDER BY u.user_id, p.path_level; Explanation: The recursive CTE builds a sequence of purchases per user, starting with their first purchase. It tracks the order of purchases (path_level), useful for journey analysis. Snowflake Enhancements Recursive CTEs: Efficiently handles hierarchical data.Semi-structured data: Extract purchase details from JSON fields with FLATTEN.Performance: Optimize with clustering on user_id and event_date. Sample input user table: user_id first_name last_name 101 Alice Smith 102 Bob Johnson event_log (purchases): event_id user_id event_date event_value event_type 2 101 2023-02-20 99.99 Purchase 4 101 2023-04-10 149.50 Purchase 6 102 2023-06-15 75.25 Purchase Sample output: user_id first_name last_name event_date event_value path_level 101 Alice Smith 2023-02-20 99.99 1 101 Alice Smith 2023-04-10 149.50 2 102 Bob Johnson 2023-06-15 75.25 1 BI visualization: Shows purchase values by user and path level, illustrating customer purchase sequences. Query flow: 7. Parsing JSON Events This query extracts fields from semi-structured JSON data in event_log. Query: SQL SELECT e.event_date, e.event_data:product_id::INT AS product_id, e.event_data:category::STRING AS category FROM event_log e WHERE e.event_type = 'Purchase' AND e.event_data IS NOT NULL; Explanation: The query uses Snowflake’s dot notation to parse JSON fields (product_id, category) from the event_data column, enabling detailed product analysis. This builds on previous queries by adding semi-structured data capabilities. Snowflake Enhancements Native JSON support: Parse JSON without external tools.Schema-on-read: Handle evolving JSON schemas dynamically.Performance: Use VARIANT columns for efficient JSON storage. Sample input (Subset of event_log with JSON data): event_id event_date event_type event_data 2 2023-02-20 Purchase {"product_id": 101, "category": "Electronics"} 4 2023-04-10 Purchase {"product_id": 102, "category": "Clothing"} Sample output: event_date product_id category 2023-02-20 101 Electronics 2023-04-10 102 Clothing BI visualization: Shows the distribution of purchases by product category, highlighting category popularity. Query flow diagram System Architecture Description of Snowflake’s role in ShopSphere’s data ecosystem, integrating with external sources, ETL tools, and BI platforms. Explanation: The system architecture diagram is structured in four layers to reflect the data lifecycle in ShopSphere’s ecosystem, using distinct shapes for clarity: External data sources: CRM systems and API feeds, shown as ellipses, provide raw customer and transaction data, forming the pipeline’s input.Snowflake data platform: Snowflake’s cloud storage and virtual warehouses store and process data, serving as the core analytics engine.ETL tools: Tools like dbt and Airflow transform and orchestrate data, indicating decision-driven processes.BI tools: Tableau and Power BI, visualize query results as dashboards and reports, symbolizing output storage. Practical Considerations The following considerations ensure the queries are robust in real-world scenarios, building on the technical foundation established above. Performance Optimization Clustering keys: Use clustering on high-cardinality columns (e.g., user_id, event_date) to improve query performance for large datasets.Query acceleration: Enable Snowflake’s query acceleration service for complex queries on massive datasets.Cost management: Monitor compute usage and scale down warehouses during low-demand periods to optimize costs. Data Quality Handling edge cases: Account for missing data (for instance, NULL values in event_value) or duplicates (e.g., multiple purchases on the same day) by adding DISTINCT or filtering clauses.Data skew: High purchase volumes in Q4 may cause performance issues; partition tables or use APPROX_COUNT_DISTINCT for scalability. Security and Compliance Row-level security: Implement policies to restrict access to sensitive data (for example, region-specific results).Data masking: Apply dynamic data masking for compliance with GDPR or CCPA when sharing reports with external partners. Conclusion Snowflake’s advanced SQL capabilities, combined with its scalable architecture and features like time travel, semi-structured data support, and zero-copy cloning, make it a powerful online retail analytics platform. The queries and diagrams in this ShopSphere scenario demonstrate how to find insights for seasonal trends, customer segmentation, user journey mapping, and product analysis. Business Impact These queries enable ShopSphere to optimize operations and drive growth: Query 1’s seasonal trends informed a 15% increase in Q4 inventory, boosting sales. Query 6’s user journey analysis improved customer retention by 10% through targeted campaigns for repeat buyers. Query 7’s JSON parsing enabled precise product category analysis, optimizing marketing spend. Together, these insights empower data-driven decisions that enhance profit and customer satisfaction.

By Ram Ghadiyaram

CORE

CNCF Triggers a Platform Parity Breakthrough for Arm64 and x86

The Challenge Developing open-source software for deployment on Arm64 architecture requires a robust continuous integration and continuous deployment (CI/CD) environment. Yet, there has historically been a disparity between the levels of support for Arm64 and traditional x86 processor architectures, with Arm64 usually at a disadvantage. Developers of infrastructure components for multiple architectures have certain expectations of their work environments: Consistency of the tools and methods they use across platforms, so they don’t have to adopt different development procedures just to adopt a less prevalent platform.Performance from their platforms and support mechanisms, so their deployment schemes don’t suffer from speed deficiency when they choose to support multiple platforms.Testing coverage so the very same tests for efficiency, compliance, and security apply to all platforms simultaneously and without substantial differentiation.Maintainability, enabling developers to automate their integration and redevelopment processes so they apply to all platforms without alteration. Product managers for these same components have these same requirements, plus at least two more: Platform coverage capability, so that technical account managers (TAM) may have the skills and readiness they need to respond to customer needs.Support tiering capability, enabling TAM and other IT personnel to classify their levels of software support according to their capability to respond to urgent or emerging customer issues. The Solution Working in collaboration with both Ampere and infrastructure provider Equinix, open-source developer Alex Ellis made available his Actuated CI/CD platform to some of the most critical open-source projects in the cloud-native software ecosystem. Actuated takes GitHub self-hosted automation processes demonstrated by security engineers to be inherently vulnerable to malicious attack, and runs them in microVMs abstracted from the public Internet. Implementation Several key open-source Cloud Native Computing Foundation projects took advantage of an Actuated environment to run all of their GitHub Actions for Arm64. This environment is based on Ampere® Altra® processors made available with the help of infrastructure provider Equinix. The success of this initiative was instrumental in prompting GitHub to implement full support of the Arm64 architecture with GitHub Actions. Now, developers who had been running Arm64 build processes in QEMU emulation environments on x86 architectures can relocate those processes to Arm64 on bare metal. Self-Hosted Runners for GitHub Actions on ARM64 GitHub dominates the hosting of software projects these days. The most popular way that GitHub-hosted projects generate builds and releases for continuous integration is with the platform’s built-in CI toolset, GitHub Actions. The most important role played by the GitHub Actions CI/CD platform is automating software development pipelines. The party responsible for triggering any GitHub Action is a runner. It’s an agent running on a server, waiting for something to do and eager to do it once it’s given the assignment. It’s assigned a job from the workflow and tasked with getting it done. GitHub is a complete software deployment platform. As such, it hosts its own runners, each of which is adapted to its specified target environment and architecture. Until recently, GitHub did not offer hosted runner environments for Arm64. Projects that wanted to generate Arm64-native builds did have an option — the self-hosted runner. GitHub users could install an agent on a physical or virtual machine hosted elsewhere, and have GitHub Actions dispatch jobs to that host, managed by the project users. This required project administrators not only to manage the project itself but also to take care of the maintenance and security of the build environment that the projects would use. In CNCF’s case, developers took advantage of credits to Equinix Metal, enabling them to provision bare metal instances and use them as self-hosted runners for projects. But for a code lab whose projects must be made available 24/7/365 to other developers worldwide, the security of self-hosted runners poses a challenge: Anyone could clone the project repository, modify the Actions jobs, and get access to the runner node to run arbitrary jobs, according to this GitHub documentation. Another problem was ensuring consistency between CI runs. With self-hosted runners, if there were side effects of the CI jobs, such as configuration changes or files left behind afterwards, they would still be there for ensuing jobs. This posed a problem — when running a CI job to build or test software, you should have a controlled environment, so that the only thing that changes between runs is the software. In the case of self-hosted runners, the environment can drift over time. In the absence of a cleanup process, it was possible for runs of the same build job on the same host to generate different results over time. One way developers bypassed the need for Arm64 native runners was by running virtual Arm64 environments on x86 servers, using QEMU open-source emulation. Emulated environments add a huge performance overhead for software compilations, which run at a fraction of the pace of compilations on native, non-emulated hardware. Emulation worked well enough for developing small to medium projects. But if developers had to build something big and important for ARM64, the strain would become so great on their virtual environments that builds would completely fail. “In the past, people were doing builds using QEMU,” said Equinix’s Developer Partner Manager Ed Vielmetti. “Say you were building a compiler, where the intermediate steps require large amounts of memory and very deep integration with the processor. That just would not work in an emulated environment.” The Disparity Phenomenon Unlike the typical enterprise, the Cloud Native Computing Foundation has a special obligation to build its cloud-native components for all the world’s major processor architectures. Projects such as the containerd portable container runtime, the etcd key/value data store, the fluentd log data collector, the Falco real-time threat detection tool, and the OpenTelemetry observability and instrumentation toolkit, among dozens of others, are critical dependencies for the cloud-native ecosystem, and as such, must be built for both x86 and Arm64. To build low-level infrastructure components with support for Arm64, CNCF developers need access to native Arm64 infrastructure. This means, ironically, they need the very class of tools they’re trying to create. At first, Ampere and Equinix collaborated with CNCF to address these gaps by donating Ampere Altra-based servers or setting up Altra-based bare metal nodes at Equinix facilities. The granularity of the Arm64-based server resources that Equinix could share was bare metal nodes — a 160-core dual-socket Ampere Altra system. Ideally, a server like this would be shared among several projects, but this was, at the time, beyond the capabilities of the CNCF. This is the problem that Ampere and Actuated proposed to solve for CNCF by allowing multiple projects to run on fewer hosts, thus providing easy access to build services for more projects while consuming less hardware. “OpenTelemetry is a full-on, full-time-on, CI/CD system,” said Antoine Toulmé, Senior Engineering Manager for Blockchain and DLT and Splunk Maintainer for OpenTelemetry project. “We were able to leverage [our Ampere server] infrastructure for ourselves, but we weren’t able to share it with open source at large." “We cannot give GitHub runners away,” Toulmé said. “Once we were happy with certifying the downstream distributions to our customers, we opened issues with the OpenTelemetry project saying we would like to see ARM64 support being delivered at the highest level — meaning, it should run for every commit, it should run for main, it should run all the time. And the feedback was, well, great, but there are no ARM64 runners in GitHub. So we’re going to need you to work with what we can do here.” Due to the lack of readily available Arm64 platforms for these projects, developers were unaware if the changes they committed were causing issues on Arm64, as test suites were not run as frequently as for x86. Since container orchestration platforms are among the platforms being developed to support Arm64, this phenomenon became a vicious cycle: Releases were gated on passing integration test suites for x86, but releases were not gated on the same test suites passing for Arm64. The solution CNCF’s developers would discover falls far short of qualifying as radical or revolutionary — in fact, it’s more of a bug fix in practice. It’s so simple to implement that it completely compensates for this disparity, not just for CNCF but for any developer of any platform-level component for any architecture. Breakthrough: Actuated, Plus Editing One Line of Code To take the first step towards platform parity between x86 and Arm64, Ampere enlisted the help of Alex Ellis, the creator of a service called Actuated. It’s a product that runs GitHub Actions jobs in secure, isolated microVMs, instrumented to receive build jobs from GitHub Actions, and offering developers visibility into the performance of their build jobs and the load on the shared build systems. Actuated could run all the CNCF’s existing GitHub Actions runners after altering a single line of their configuration files, plus in some cases the pasting of a few code snippets — changes which took less than five minutes to implement. These changes enabled GitHub-hosted projects to point to Actuated’s microVM-driven environment on Ampere Altra processors for their build jobs. “Falco really needed Arm64 GitHub runners to elevate its support for the architecture and enlarge its user base,” Falco project Senior Open Source Engineer and Sysdig Maintainer Federico Di Pierro said. “[Actuated] was the perfect solution for us because it was easy to leverage and relieved any burden for the maintainers. This way, we as maintainers can focus on what really matters for the project, instead of fighting with maintaining and deploying self-hosted infrastructure. Now we are building, testing, and releasing artifacts for ARM64, leveraging Actuated for many of our projects, and it works flawlessly.” Having seen the increase in demand for Arm native build environments in recent years, GitHub announced last June the availability in public beta of Arm64-based hosted runners for GitHub Actions, powered by Ampere compute instances on Microsoft Azure, followed in January 2025 by the release into public preview of free hosted runners for public repositories. For OpenTelemetry, this means the end of network loads as high as 10 times their assigned bandwidth caps, on account of OpenTelemetry builds constantly downloading dependencies from Docker Hub repositories. “Yeah, we were definitely breaking things,” the OpenTelemetry Project’s Antoine Toulmé said. “We got lucky, because the Arm runners for GitHub shipped. We have moved to ARM runners, we are happy as can be, and nothing is breaking anymore.” Now for the first time, project maintainers can pay as close attention to the safety and security of Arm64 builds as they have for x86 builds, knowing that they’re no longer likely to encounter performance degradations or penalties. “[Actuated] gave us great confidence in the CI builds on ARM64,” Principal Software Engineer and AWS Maintainer for the containerd project Phil Estes said. “If the Arm CI breaks now, there’s no way we will merge that [pull request] until we figure out why... We have full confidence now that [build failures] are not an issue with flaky hardware [as they sometimes were before].” For its part, Oracle is continuing its policy of donating $3 million per year in OCI credits for Arm64 instances powered by Ampere to CNCF projects. This generosity, along with the newfound stability of Arm64 platforms catalyzed by Ampere and Equinix, and brought about by Actuated, is enabling prominent cloud infrastructure vendors, including Red Hat, SUSE, Canonical, and Mirantis, to provide full support for their enterprise customers who choose ARM64 infrastructure. Parity makes it possible for enterprises to make sensible choices about their computing infrastructure and platforms without incurring penalties just for choosing an alternative architecture. Large cloud customers are proving that Arm64 can provide organizations with the performance they need and reduced expenses for workloads — all with industry-leading energy efficiency. But organizations can’t experience those benefits until they can deploy their workloads on all infrastructure options on a level playing field with one another and measure the results for themselves. Leveling the Playing Field In early 2023, few options existed for GitHub-hosted projects that wanted to fully integrate Arm64 into their continuous integration processes. Through this initiative, leveraging an innovative software solution from Actuated with Ampere CPUs hosted by Equinix, we lowered the bar for CNCF projects to make a start towards parity of support for ARM64 and x86. Key cloud-native projects, including etcd, containerd, Open Telemetry, Falco, and others, were able to advance their support of Arm64, accelerate their CI runs on native Arm64 infrastructure, and support increasing numbers of their users taking advantage of ARM64 compute in the cloud. By the end of this pilot project, the number of options for developers has grown considerably. The CNCF now offers its projects the ability to run GitHub Actions jobs on managed Kubernetes clusters on OCI, using Ampere-powered instances and the GitHub project Actions Runner Controller, and with the addition of hosted Arm64 runners to GitHub, it has never been easier for projects to easily support this fast-growing and exciting architecture for cloud-native applications. Check out the full Ampere article collection here.

By Scott Fulton III

Fixing a Test Involves Much More Than Simply Making It Pass

Software tests are essential mechanisms in the development lifecycle designed to verify that code behaves as expected. They catch bugs, ensuring functionality aligns with requirements, and maintain overall system reliability through automated assertions on inputs, outputs, and states. However, fixing a failing test extends far beyond merely adjusting it to pass, as this superficial approach can mask deeper issues in the underlying code or architecture. True resolution demands a thorough investigation into the root cause. It could be a logic error, an environmental mismatch, a dependency failure, or an outdated assumption in the test itself, followed by targeted corrections that preserve the test's intent as a safeguard against regressions. Blindly modifying assertions to match unexpected outcomes, such as changing an expected value from false to true without validating why the discrepancy occurred, risks introducing false positives, eroding trust in the test suite. This way, we are potentially allowing flawed software to reach production, where it could lead to security vulnerabilities, data inconsistencies, or user-facing failures. After all, a proper fix not only restores the green status but also enhances code quality, observability, documentation, and team knowledge. Test failures should be treated as opportunities for genuine improvement rather than quick band-aids. As developers and testers, we rely on tools like Postman to streamline API testing and ensure our applications behave as expected. Postman’s intuitive interface, powerful scripting capabilities, and recent AI integrations make it a go-to for automating tests in CI/CD pipelines or during exploratory testing. However, one seemingly helpful feature — the "Fix test" option powered by Postbot (Postman’s AI assistant) — can introduce subtle risks if used without caution. In this article, we'll explore how the "Fix test" option works, why it might tempt you to use it, and most importantly, the potential dangers of blindly accepting its suggestions. We'll use an example to illustrate how "fixing" a test to make it pass can mask underlying issues, leading to false confidence in your API's reliability. By the end, you'll have best practices to avoid these pitfalls and maintain robust, meaningful tests. Postman Fundamentals Postman allows you to write tests in JavaScript that run after sending a request, verifying aspects like status codes, response bodies, headers, and more. These tests use the pm.test and pm.expect syntax to assert expected outcomes. For instance, you might check if a response JSON field matches a specific value. Postbot is Postman's AI-powered assistant. I entered the following prompt to Postbot. How can you help me? What can you do? Here is the response that I’ve got: Textile You can call me Postbot :) I'm here to help you with your API workflows in Postman. Here are some of the things I can do: 1. Add or update test scripts for your API requests. 2. Add or update visualization scripts to visualize your API responses. 3. Edit existing test scripts to make changes or fix errors. 4. Add or update descriptions for your requests to enhance documentation. 5. Construct new API requests based on your existing requests. 6. Debug and fix your API requests to ensure they work correctly. If a test fails, Postbot can analyze the failure and offer suggestions, including a "Fix test" option. This feature scans the actual response and proposes updates to your test script. It usually involves adjusting assertions to align with the received data. I find Postbot very helpful indeed, as it can save me a great deal of time in writing tests and debugging. But I’ve also learned to always be critical of its results. While convenient, this AI bot may prioritize making the test pass over diagnosing the root cause. Postbot might rewrite an assertion from expecting false to true if that's what the response currently returns, without questioning whether the response itself is correct. How A Boolean Flip Can Hide a Bug Let’s say you're testing an API endpoint for user authentication. Your test collection includes a request to log in with valid credentials, followed by a test script that verifies a boolean field in the response: JavaScript pm.test("Verify authentication status", function () { var jsonData = pm.response.json(); pm.expect(jsonData.isAuthenticated).to.be.false; // Expected: false for guest mode }); You run the test, but it fails because the response shows isAuthenticated: true. Intrigued, you invoke Postbot (via the AI scripting panel or failure suggestions), and it offers the "Fix test" option. Using "Fix test" updates the script to: JavaScript pm.test("Verify authentication status", function () { var jsonData = pm.response.json(); pm.expect(jsonData.isAuthenticated).to.be.true; // Now matches the actual response }); Voilà — the test now passes on rerun. But did you fix the problem? In this scenario, the API might have a bug. Perhaps a configuration error is incorrectly authenticating guests as full users, exposing sensitive data. By "fixing" the test, you've adapted it to the faulty behavior instead of investigating the discrepancy. The test suite turns green, but the underlying issue persists. Are we sure that there is no security vulnerability that we've just masked by turning the test suite green? This example highlights a core issue that goes beyond testing tools. It actually goes beyond testing types and testing levels: Testing is a critical thinking activity. Our tests should exist for a reason. They should be there in order to verify behaviour, functionality, or anything else that we need to verify. If a test passes or fails, we must understand why. Is it a false positive or a false negative? Did the test catch a bug or not? We need to investigate. In the example, instead of investigating, we just let AI change the test so that it passes testing. All we’ve got now is a test that passes testing for the wrong reason. The test suite is green, misleading us into the wrong conclusion that authentication works as expected. The Dangers of Blindly "Fixing" Tests Our authentication example also highlights another important fact. Postbot's fixes are reactive to the current response. They are not proactive in ensuring it aligns with business requirements or specifications. This is where we need to be critical and investigate as testers. While the "Fix test" feature can accelerate testing and development, misusing it can lead to several risks: Masking bugs in the API or backend. Tests exist to catch deviations from expected behavior. If the response is wrong due to a code change, database issue, or external dependency, updating the test to match it sweeps the problem under the rug. This erodes the test's value as a safety net.False positives in CI/CD pipelines. Green tests give teams confidence to deploy. If "fixed" tests ignore real issues, you risk shipping broken features. In a fast-paced environment, this can result in production outages, data corruption, or compliance violations.Loss of test intent and documentation. Original tests often encode business rules or assumptions (e.g., "isAuthenticated should be false for unauthenticated requests"). Altering them without review dilutes this intent. This makes the test suite harder to maintain and understand for future team members.Encouraging lazy debugging practices. Relying on AI to "fix" failures discourages thorough root-cause analysis. What if the failure stems from flaky networks, incorrect request payloads, or environmental differences? Skipping the investigation misses opportunities to improve overall system reliability.Accumulated technical debt. Repeated quick fixes can bloat your test scripts with outdated or incorrect assertions. When the API evolves legitimately, these band-aid tests will fail en masse, requiring more effort to untangle. Best Practices: Safely Leveraging Postbot and "Fix Test" The practices below emphasize a balanced approach where AI assists but does not replace human judgment. The goal is that test fixes contribute to long-term quality rather than short-term convenience. By integrating them into your workflow, you can mitigate risks while still benefiting from the efficiency of automation. Always investigate failures first. Use Postbot to generate initial tests or suggestions, but validate them against specs. Before clicking "Fix test," manually review the response. After a fix, add comments explaining the change. Ask: Is this the expected output according to the API docs or requirements? Use Postman’s console or external tools to debug.Version control your collections. Store Postman collections in Git. This way, changes from "Fix test" appear in diffs, allowing peer reviews before merging.Incorporate environment variables for flexibility. Instead of hardcoding values, use variables like {{expectedAuthStatus}. This makes tests modular without rewriting assertions.Run tests in multiple environments. Test against staging, production-like setups, or mocks to confirm if failures are environment-specific.Educate your team. Share this awareness in code reviews or team meetings. Encourage a culture where test failures prompt discussion, not automatic fixes. By treating "Fix test" as a starting point rather than a solution, you maintain control over your test integrity. This is of paramount importance. After all, why do we test if our test results are misleading and meaningless? Wrapping Up Software testing is a critical thinking activity. Do we understand how and why our systems work? Can we determine whether something is wrong and why? There is no testing without observability and no observability without testing. We need to wear our inspector's hat and investigate. After all, this is what we do, this is who we are: Inspectors. Postman’s AI features like Postbot are game-changers for productivity. However, they're just tools, not oracles. Making a test pass can be tempting, but true fixing requires understanding the "why" behind failures. In our example, flipping a Boolean assertion might greenlight a security flaw, underscoring the need for vigilance. Next time a test fails in Postman, pause before fixing. Dive into the details. Align with requirements. Be confident that your changes enhance reliability, not just silence alerts. By doing so, you'll build a more robust API testing practice that catches issues early and keeps your applications trustworthy.

By Stelios Manioudakis, PhD

CORE

Efficiently Reading Large Excel Files (Over 1 Million Rows) Using the Open-Source Sjxlsx Java API

If you are a developer, regardless of the technologies you use, at some point, you will face the challenge of handling large Excel files. Most commonly, you will see the "out of memory" error when handling these files. Here is the requirement: You get a workbook with 100 spreadsheets, and each of them has 1 million records with a minimum of 100 columns. For your reference, here's how many rows each Excel version can support: 1,048,576 rows – Excel 365, 2013, 2010, 200765,536 rows – Excel 2003 and earlier versions Cost and Hosting Server Limitations Third-party license purchase is not feasible. Open source is allowed.Deployment should be on average-tier cloud environments (4 GB disk/4 GB RAM or less) or on-premises Windows Server (16/32 GB RAM) already loaded with 10+ running applications. How would you handle this situation? Let's explore the current digital market. Are there any open-source solutions available to meet this requirement? I can’t use “interop” when your cloud runs on "Linux OS," and it also causes intermittent hang issues in multi-threaded applications. Most of them will say that the popular library is the "Apache POI" streaming library. Let’s proceed with a practical implementation using Apache POI. To get the large dataset (Excel) file, we have N number of websites, such as awesome-public-datasets, Google Dataset Search, and World Bank Data. Personally, I frequently visit Kaggle. Let's download the container Dataset as a 1.9GB CSV file, and then save it as an Excel file (.XSLX), which became around a 600 MB XLSX file. The sheet includes 87 columns and 1048576 rows. Before we run the solution, let me capture my laptop's resource usage so that we can compare later. Resource Usage CPU: 3%Memory: 54% IDE I am using Eclipse IDE (v: 4.36.0). Apache POI I am using the latest Apache POI 5.2.5 and other dependency JARS available from Apache POI. Source Code Here, I am just attempting to read the sheet name from the workbook, not the rows. Java public static void main(String[] args) throws Exception { String filePath = "C:\\POC\\Containers_Dataset.xlsx"; ReadExcelbyApachePOI(filePath); } /*List out sheet name*/ static void ReadExcelbyApachePOI(String filePath) throws Exception { try (OPCPackage opcPackage = OPCPackage.open(new File(filePath), PackageAccess.READ)) { XSSFWorkbook workbook = new XSSFWorkbook(opcPackage); XSSFReader xssfReader = new XSSFReader(opcPackage); StylesTable styles = xssfReader.getStylesTable(); XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader .getSheetsData(); while (iter.hasNext()) { InputStream stream = iter.next(); String sheetName = iter.getSheetName(); System.out.println("Sheetname: " + sheetName); } } catch (IOException e) { e.printStackTrace(); } } Result Encountering "Java heap space - Out of memory error." Java Exception in thread "main" java.lang.OutOfMemoryError: Java heap space Resource Usage CPU: 58%Memory: 94% Let us try with the sjxlsx open-source API. It’s an open-source Java API Source Code; this project was first published on Google Code, and it seems unmaintained. In GitHub, it is available for anyone who can download and update the changes for their needs. "Memory" and "speed" are primary goals of this API. It provides two modes, which are "classic" and "stream." Classic – All records of the sheet will be loaded.Stream – Read one record at a time. Microsoft XLSX uses XML+zip (OOXML) to store the data. So, to be fast, "sjxlsx" uses STAX for XML input and output. Source Code Java public static void main(String[] args) throws Exception { String filePath = "C:\\POC\\Containers_Dataset.xlsx"; SimpleXLSXWorkbook workbook = newWorkbook(filePath); testLoadALL(workbook); } private static SimpleXLSXWorkbook newWorkbook(String filePath) { return new SimpleXLSXWorkbook(new File(filePath)); } /*Read Each Row*/ private static void printRow(int rowPos, com.incesoft.tools.excel.xlsx.Cell[] row) { int cellPos = 0; for (com.incesoft.tools.excel.xlsx.Cell cell : row) { System.out.println(com.incesoft.tools.excel.xlsx.Sheet.getCellId(rowPos, cellPos) + "=" + cell.getValue()); cellPos++; } } /*Load & Read workbook * false => Read each row * true => Load all rows */ public static void testLoadALL(SimpleXLSXWorkbook workbook) { com.incesoft.tools.excel.xlsx.Sheet sheetToRead = workbook.getSheet(0,false); SheetRowReader rowreader = sheetToRead.newReader(); int rowPos = 0; while (rowreader != null) { com.incesoft.tools.excel.xlsx.Cell[] row = rowreader.readRow(); printRow(rowPos, row); rowPos++; } } Resource Usage CPU: 3% (No changes)RAM: 61% (7% usage: 1 GB usage) Output Java BN1048519=40298 BO1048519=0 BP1048519=0 BQ1048519=0 BR1048519=0 BS1048519=610 BT1048519=0 BU1048519=1 BV1048519=64240 BW1048519=923 BX1048519=158 BY1048519=32 BZ1048519=0 CA1048519=0 CB1048519=0 CC1048519=0 CD1048519=0 CE1048519=0 CF1048519=0 CG1048519=0 CH1048519=10000206 CI1048519=0 A1048520=100.64.0.2-10.16.0.9-35919-8080-6 B1048520=100.64.0.2 C1048520=35919 D1048520=10.16.0.9 E1048520=8080 F1048520=6 G1048520=45266.83932053241 H1048520=41626 I1048520=6 J1048520=5 K1048520=515 L1048520=357 M1048520=515 N1048520=0 O1048520=85.8333333333333 P1048520=210.24786958888899 Q1048520=357 R1048520=0 S1048520=71.400000000000006 T1048520=159.65525359348399 U1048520=20948.445682986501 V1048520=264.25791572574798 W1048520=4162.5999999999904 X1048520=12728.124713056101 Y1048520=40374 Z1048520=9 AA1048520=41626 AB1048520=8325.2000000000007 AC1048520=17922.528077813098 AD1048520=40374 AE1048520=29 AF1048520=41594 AG1048520=10398.5 AH1048520=20011.5685292282 AI1048520=40406 AJ1048520=26 AK1048520=1 AL1048520=1 AM1048520=0 AN1048520=0 AO1048520=0 AP1048520=0 AQ1048520=200 AR1048520=168 AS1048520=144.14068130495301 AT1048520=120.11723442079401 AU1048520=0 AV1048520=515 AW1048520=79.272727272727295 AX1048520=179.87445116474399 AY1048520=32354.8181818181 AZ1048520=2 BA1048520=2 BB1048520=0 BC1048520=2 BD1048520=10 BE1048520=0 BF1048520=0 BG1048520=0 BH1048520=0.83333333333333304 BI1048520=79.272727272727195 BJ1048520=85.8333333333333 BK1048520=71.400000000000006 BL1048520=0 BM1048520=0 BN1048520=0 BO1048520=0 BP1048520=0 BQ1048520=0 BR1048520=0 BS1048520=46 BT1048520=0 BU1048520=32 BV1048520=64240 BW1048520=502 BX1048520=1 BY1048520=32 BZ1048520=0 CA1048520=0 CB1048520=0 CC1048520=0 CD1048520=0 CE1048520=0 CF1048520=0 CG1048520=0 CH1048520=41626 CI1048520=0 A1048521=100.64.0.2-10.16.0.9-9479-8080-6 B1048521=100.64.0.2 C1048521=9479 D1048521=10.16.0.9 E1048521=8080 F1048521=6 G1048521=45266.835683206016 H1048521=111205 I1048521=6 J1048521=5 K1048521=537 L1048521=357 Performance Results Woo-hoo! I can read the records. The winner is "sjxlsx." It has been proven that this library consumes less than 1 GB of memory, compared to the higher usage by Apache POI. It is an excellent open-source Java API for reading large Excel datasets. Additional Features In addition to that, it supports writing Excel and offers rich data formatting in sheets. Java public static void main(String[] args) throws Exception { // WRITE - we take WRITE as a special kind of MODIFY SimpleXLSXWorkbook workbook = newWorkbook(); OutputStream output = ExcelOutput("write"); WriteExcel(workbook, output); output.close(); } private static SimpleXLSXWorkbook newWorkbook() { return new SimpleXLSXWorkbook(new File("/sample.xlsx")); } private static OutputStream ExcelOutput(String suffix) throws Exception { return new BufferedOutputStream(new FileOutputStream("/sample_" + suffix + ".xlsx")); } public static void WriteExcel(SimpleXLSXWorkbook workbook, OutputStream outputStream) throws Exception { com.incesoft.tools.excel.xlsx.Sheet sheet = workbook.getSheet(0); WriteRecords(workbook, sheet, 0); workbook.commit(outputStream); } static public void WriteRecords(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet, int rowOffset) { int columnCount = 10; int rowCount = 10; int offset = rowOffset; for (int r = offset; r < offset + rowCount; r++) { int modfiedRowLength = sheet.getModfiedRowLength(); for (int c = 0; c < columnCount; c++) { sheet.modify(modfiedRowLength, c, r + "," + c, null); } } } Writing a Richly Styled Row Java public static void WriteRichStyleRow(SimpleXLSXWorkbook wb, com.incesoft.tools.excel.xlsx.Sheet sheet) throws Exception { Font font = wb.createFont(); font.setColor("FFFF0000"); Fill fill = wb.createFill(); fill.setFgColor("FF00FF00"); CellStyle style = wb.createStyle(font, fill); RichText richText = wb.createRichText(); richText.setText("test_text"); Font font2 = wb.createFont(); font2.setColor("FFFF0000"); richText.applyFont(font2, 1, 2); sheet.modify(0, 0, (String) null, style); sheet.modify(1, 0, richText, null); } Summary Ultimately, "sjxlsx" provides an efficient, lightweight way to read large Excel files without infrastructure headaches. Thank you!

By Mahendran Chinnaiah

A Fresh Look at Optimizing Apache Spark Programs

I have spent countless hours debugging slow Spark jobs, and it almost always comes down to a handful of common pitfalls. Apache Spark is a powerful distributed processing engine, but getting top performance requires more than just running your code on a cluster. Even with Spark’s built-in Catalyst optimizer and Tungsten execution engine, a poorly written or configured Spark job can run slowly or inefficiently. In my years as a software engineer, I have learned that getting top performance from Spark requires moving beyond the defaults and treating performance tuning as a core part of the development process. In this article, I will share the practical lessons I use to optimize Spark programs for speed and resource efficiency. Overview: The goal is to tackle performance from every angle. We will start at the top with cluster-level configurations like resource allocation and memory, then dive right into the code to cover best practices for writing efficient Spark APIs. From there, we will get into the often overlooked but critical layer of data storage and formatting for faster I/O. To wrap it all up, we will see how monitoring the Spark UI and logs is key to refining performance over time. Prelude: Understanding Spark’s Architecture and Lazy Evaluation Before we get into performance optimization, it helps to anchor on how Spark runs your program. Diagram: Spark architecture Driver: The driver program runs your main Spark application, builds a logical plan (a DAG of transformations), turns it into a physical plan, and schedules tasks across the Executor programs. It tracks job progress and collects results. Executors: Executors live on worker nodes. They run tasks in parallel, keep partitions of data in memory or on disk, and spill to disk if needed. Executors are where your transformations actually execute. Cluster Manager (allocator): YARN, Kubernetes, or Spark’s standalone manager decides how many executors you get and where they run. With the right settings, Spark can scale executors up/down as workload changes. Lazy evaluation: Transformations like select, filter, groupBy, and join do not execute immediately. Spark builds a plan and only materializes work when you call an action (count, collect, write, show, etc.). This gives Spark the chance to: Push filters down to the data sourcePrune unused columnsReorder operationsChoose efficient join/partition strategies Spark Configuration Tuning Spark offers many knobs and configuration properties that can significantly impact performance. Proper tuning of these settings ensures that your Spark cluster resources are used optimally. Key areas include dynamic resource allocation, memory management, and setting the right level of parallelism for shuffles. 1. Dynamic Resource Allocation By default, Spark uses a static number of executors for the entire application. However, in real-world scenarios, data sizes can vary from one job run to another or even between stages of a single job. Allocating a fixed number of executors can either waste resources or cause bottlenecks. Dynamic Resource Allocation allows Spark to scale the number of executors up or down based on workload. This means Spark can add executors when your job has a lot of work and removes them when they are no longer needed, improving efficiency. To enable dynamic allocation, set spark.dynamicAllocation.enabled to true in your Spark configuration. You should also ensure the cluster is configured to support this by enabling either the external shuffle service or Spark’s internal shuffle tracking (via spark.shuffle.service.enabled or spark.dynamicAllocation.shuffleTracking.enabled). Additionally, you can tune related settings such as the minimum and maximum number of executors Spark can allocate (spark.dynamicAllocation.minExecutors and spark.dynamicAllocation.maxExecutors respectively). For example, you might allow Spark to scale between 2 and 10 executors depending on load: Python from pyspark.sql import SparkSession spark = SparkSession.builder \ .config("spark.dynamicAllocation.enabled", "true") \ .config("spark.dynamicAllocation.minExecutors", 2) \ .config("spark.dynamicAllocation.maxExecutors", 10) \ .getOrCreate() With dynamic allocation on, Spark will automatically “stretch” or “shrink” your cluster usage as needed, so you don’t have to guess the perfect number of executors for each job. This is especially useful for multi-tenant environments or jobs with varying input sizes. 2. Memory Configuration Memory is a critical resource in Spark. If executors or the driver run out of memory, you will get the dreaded out-of-memory (OOM) errors which often crash the job. Proper memory tuning helps avoid these issues. When you allocate memory (spark.executor.memory for executors, spark.driver.memory for the driver), it is important to understand that these two serve different purposes. The driver uses its memory mainly for job coordination — building the DAG, scheduling tasks, and collecting results, and not for holding large datasets. The executors, on the other hand, use their memory for actual data processing and caching, and this is where most tuning matters. Each executor JVM divides its allocated heap into regions for different uses: Execution memory (for shuffles, sorts, joins, and aggregations)Storage memory (for caching datasets and broadcast variables)User and system memory (for internal metadata and user objects) The parameter spark.memory.fraction controls what portion of the executor’s JVM heap is reserved for the combined execution + storage regions, with the remainder (usually 25%) left for user and system overhead. By default, spark.memory.fraction is set to 0.75, meaning roughly 75% of the executor heap is available for Spark’s data operations, and 25% is reserved for overhead and bookkeeping. Within that memory region reserved for data operations (the 75% part), Spark dynamically manages execution and storage memory, giving preference to computation when needed (e.g., a large join) and evicting cached blocks if space runs short. For most applications, this default split works well, but certain workloads benefit from fine-tuning. For example, if you cache large datasets, increasing spark.memory.fraction slightly (e.g., toward 0.8) gives Spark’s unified memory region more room for storage memory, allowing more data to stay in memory and reducing cache evictions. However, this leaves less overall JVM heap available for temporary objects, which can increase the chance of garbage collection or spills during computation. Conversely, if your job involves heavy joins, aggregations, or UDFs with little caching, lowering the fraction (e.g., to 0.6–0.7) slightly reduces the unified region but gives the JVM more free heap outside Spark’s managed memory, helping it handle transient objects and preventing out-of-memory errors during shuffle-intensive workloads. The key is to avoid excessive spilling (which occurs when Spark runs out of in-memory space for shuffles) while leaving enough headroom to prevent OOM errors. You can monitor this balance in the Spark UI’s Executors tab. Frequent disk spills or long GC times are signals that the memory configuration might need adjustment. In summary, allocate sufficient memory to each executor (where most of the work happens), and be mindful of how that memory is divided. Keeping the defaults is often safest unless you have specific caching or shuffle-heavy workloads. And of course, never exceed the physical memory available on a node, or the OS may terminate your processes. 3. Shuffle Parallelism (Partitions) Another important configuration is the degree of parallelism for shuffle operations. Spark breaks tasks into partitions; too few partitions means each task handles a lot of data (potentially straining memory and increasing shuffle time), while too many partitions means a lot of small tasks with overhead and possibly too many small output files. By default, Spark uses 200 shuffle partitions for joins and aggregations (spark.sql.shuffle.partitions = 200). This default may not be optimal for your data size or cluster. Tuning the number of partitions can lead to better performance. For big jobs with very large datasets, you might increase the partitions to get more parallelism. For example, if processing a 1 TB dataset, 200 partitions (~5 GB per partition) is likely too few — you might raise it to 1000 or more so each partition is ~1 GB or less. On the other hand, for smaller jobs or after filtering down data, 200 partitions might be too many (each partition too small). In such cases, reducing the number of partitions can cut down scheduling overhead and small task inefficiencies. Spark’s repartition() and coalesce() transformations or SQL hints can also be used in code to increase or decrease partitions for a given DataFrame. A common rule of thumb is to aim for 100 MB–200 MB of data per partition for efficient processing. You can estimate this by looking at your input data size and dividing by a target partition size. For example, if you expect to process ~50 GB of data, using about 250–500 partitions might be reasonable (since 50 GB / 200 MB ≈ 250). Conversely, if you have a small dataset of 1 GB, having 200 partitions (each ~5 MB) is overkill — you could reduce spark.sql.shuffle.partitions to maybe 10 or 20 in that case. The goal is balanced partitions: not too large to cause memory issues, and not too small to cause excessive task overhead. In Spark 3.x and beyond, there is also an Adaptive Query Execution (AQE) feature that can automatically optimize the number of post-shuffle partitions at runtime. When AQE is enabled (spark.sql.adaptive.enabled=true by default in recent versions), Spark will merge small shuffle partitions or split skewed ones on the fly based on actual data statistics. This helps mitigate issues of suboptimal partition counts. We’ll discuss AQE a bit more later, but it is good to be aware that Spark itself can adjust some of this if configured. Takeaway: Don’t just stick with the default 200 partitions for all jobs. Tune it or let AQE handle it, but ensure your tasks are neither overwhelmed by data nor spending more time coordinating than computing. Code Level Optimization Optimizing Spark is not just about configs. How you write your Spark code has a big impact on performance. This section covers best practices in coding your transformations and actions to minimize expensive operations and fully leverage Spark’s internal optimizations. The guiding principle is to work with Spark, not against it: use high-level APIs that allow Spark to optimize under the hood, and be mindful of operations that cause a lot of data movement or computation. 1. Prefer Data Frame/Dataset APIs Over RDDs Apache Spark started with the low-level RDD API, but nowadays you should favor the higher-level Data Frame or Dataset APIs for most tasks. Data Frames and Datasets provide declarative, SQL-like operations (e.g., select, filter, groupBy) that Spark’s Catalyst optimizer can analyze and optimize aggressively. In contrast, RDDs (with manual map, reduce, etc.) do not get those optimizations. The result: Data Frame/Dataset operations are generally much faster and more memory efficient than equivalent RDD code because Spark can optimize the query plan and even generate efficient underlying code (thanks to Tungsten). For example, if you have data you want to aggregate by key, doing it as a Data Frame df.groupBy("key").agg(sum("value")) will let Spark decide the best execution plan. If you did the same with RDDs (e.g., using rdd.groupByKey() and then summing), Spark has less information to optimize, and the operation could be slower or use more memory. In short, leverage Spark’s built-in engine by using DataFrames/Datasets and express your computations at a high level. This not only gives performance benefits, but typically leads to cleaner and more concise code as well. The only time you might need RDDs is for very specialized processing or when using certain legacy libraries, but even then, you can often convert between DataFrames and RDDs as needed. As a rule of thumb, if an operation is supported in the DataFrame API, use it instead of the RDD API. High-level APIs let Spark do the hard work of optimization for you. 2. Filter Early and Minimize Data Shuffles Minimizing data movement (shuffling) is critical for performance. Shuffles (like those that happen during joins or aggregations) involve redistributing data across the network and are often the most expensive part of a Spark job. One of the best ways to cut down shuffle cost is to simply reduce the data size early in your job. Apply filters and projections as soon as possible in your ETL pipeline so that subsequent operations work on less data. If you only need a subset of data, do not wait until after a big join to filter it and instead push those filters up front. Removing even 10% of data upfront can have ripple effects of speeding up downstream shuffles and computations. For example, if you are processing logs for a specific date range, filter by date as the first step rather than after doing a bunch of processing. This “predicate pushdown” approach often happens automatically with DataFrame queries on persistent data (Parquet will only read needed partitions), but it is also a mindset for your transformations: do the narrow, data-reducing transformations early, and the wide, expensive ones (like joins) later on the smaller data. In addition, consider data partitioning before heavy operations to avoid skew and balance the load. If you know you will join two DataFrames on a key that is very skewed, you might repartition the DataFrame by that key first to ensure even distribution, or use techniques like salting. On the other hand, if you have a many-partitioned dataset but your next stage is an aggregate that does not need so many partitions, you could coalesce it down to avoid unnecessary shuffle partitions. In summary, think about data volume at each step and try to keep it minimal. Every time you can cut down data size or avoid an unnecessary shuffle, you win performance gains. 3. Use Efficient Join Strategies (Broadcast Small Data) Joins are a common source of performance problems in Spark, especially when joining a very large dataset with a smaller one. A standard join will shuffle both sides of the join across the network to group matching keys together, which can be expensive. However, Spark provides an optimization for the scenario where one dataset is much smaller than the other: broadcast joins. A broadcast join sends a copy of the small dataset to every executor, so that Spark can join without shuffling the large dataset at all. Each executor already has the small dataset locally in memory. This can dramatically speed up joins where, say, you have a large fact table and a small dimension table (e.g., a few MBs or a few thousand records). Spark’s Catalyst optimizer will automatically broadcast a table in a join if it estimates the table size is below a certain threshold (controlled by spark.sql.autoBroadcastJoinThreshold, default ~10MB). But sometimes you know better which table is small, or the estimates are off, so you can also manually hint or specify a broadcast. Using the Data Frame API in PySpark/Scala, you can import the broadcast function and apply it to the smaller Data Frame in a join, for example: Python from pyspark.sql.functions import broadcast # large_df: a huge DataFrame # small_df: a small DataFrame that fits in memory result_df = large_df.join(broadcast(small_df), on="customer_id") In the above code, small_df will be broadcast to all executors. The join can then be performed without a costly shuffle of large_df. Broadcast joins work best when the smaller dataset can easily fit in each executor’s memory. If the “small” dataset is too large to broadcast (bigger than the threshold or memory limits), then a broadcast join could backfire by causing memory pressure, so use this for truly small reference data. Another join strategy is to ensure that both sides of a large join are partitioned by the join key (using Data Frame's repartition on the key, or ensuring the data was written partitioned by that key). If both datasets are partitioned on the join key, Spark will still do a shuffle join, but it can make sure the same keys end up in the same partition with potentially less network traffic (this is more advanced and often handled by Catalyst/AQE). 4. Avoid UDFs and Utilize Built-in Functions Spark allows you to define user-defined functions (UDFs) to run custom code on Data Frame rows, but they come at a cost. UDFs (especially in PySpark) break the optimized flow: they force Spark to serialize data to Python, execute your Python code, and deserialize results. They are also “black boxes” to the Catalyst optimizer (Spark cannot look inside them to apply further optimizations). All this means UDFs typically run much slower than Spark’s native operations. Whenever possible, try to use Spark’s built-in functions (in pyspark.sql.functions or the Data Frame API) instead of writing a UDF. Spark has a rich set of functions for string manipulation, date processing, aggregations, etc., which run in the JVM and can be vectorized and optimized by Spark. For example, to convert a string column to uppercase, using the built-in func.upper(col("name")) is far faster than writing a Python UDF to do the same, because the built-in version will execute within Spark’s engine (and even leverage optimized Java code). If you absolutely must use a UDF (for example, you need some complex logic not available as a built-in, or you want to use a specific Python library function on your data), consider using Pandas UDFs (vectorized UDFs). Pandas UDFs can process a batch of rows at once using Apache Arrow for data transfer between JVM and Python, which is much more efficient than the “one row at a time” standard UDF. They were introduced in Spark 2.3+ and can often give performance closer to built-ins while still allowing you to write Python code for the custom logic. To summarize: avoid UDFs unless necessary. Check Spark’s function library first — chances are there is already a way to do what you need. By using built-in functions, you keep the execution within the optimized Spark engine and benefit from Catalyst’s ability to rearrange and optimize those operations. 5. Cache Judiciously to Reuse Data Spark’s lazy evaluation means it will recompute transformations each time an action is called, unless it can optimize otherwise. If you have a DataFrame or RDD that you need to use multiple times in your job (say you do multiple actions or multiple joins with it), it might make sense to cache (persist) that data in memory. Caching can save expensive recomputation in iterative algorithms or complex pipelines. However, caching is not a silver bullet. Use it only for datasets that are reused enough to pay off the cost of caching. If you cache something that is only used once, you incur extra overhead (to store it) for no benefit. Identify points in your job where caching makes a difference. For example, if you filter and transform a base Data Frame and then use the result in two separate joins, caching the result of that filter + transform step might be worthwhile so that step is not redone for each join. When using df.cache() or df.persist(), prefer the MEMORY_ONLY storage level for purely in-memory caching, or MEMORY_AND_DISK if the data is too large (Spark will spill to disk if it cannot fit in RAM). Also, remember to undo persisting the data later if it is no longer needed to free up executor memory. A common pattern is to cache the “core” dataset that many downstream operations depend on. For instance, in a machine learning pipeline, you might cache the prepared feature Data Frame that is used by multiple models for training. Or in an ETL, cache the cleaned data that will be aggregated in multiple ways. This can give significant speedups. Spark will automatically avoid recomputation of some intermediate results thanks to DAG optimization, but it is not omniscient — explicit caching is your tool to hint to Spark, “I’m going to use this again, keep it around.” Used properly, caching and persistence are powerful for optimization. Data Storage and Format Optimization Optimizing how data is stored and accessed can greatly speed up Spark jobs, especially for I/O heavy workloads. This is often an overlooked aspect of performance. By choosing efficient data formats and organizing data cleverly, you can reduce the amount of data that Spark needs to read and write, thus accelerating your jobs. Use Columnar, Compressed File Formats If you have a choice, store your data in a columnar format like Parquet or ORC rather than as plain text (CSV/JSON) or other row-based formats. Columnar formats are optimized for analytic queries: Spark can skip reading columns that are not needed (saving I/O) and apply predicate pushdown (only read data that matches filters). They also integrate well with Spark’s off-heap memory management. Additionally, they support compression and encoding that significantly reduce file sizes. For example, Parquet with Snappy compression often gives huge space savings and speeds up data reads due to needing to transfer less data from disk. The compression is automatic and split into blocks so that Spark can still read in parallel. In short, prefer Parquet/ORC with compression for Spark data — your jobs will thank you with faster disk and network operations. Partition Data on Disk for Pruning If your data is large and stored in a distributed filesystem (like HDFS or S3), organize it into partitioned directories by key fields that you commonly filter by. For instance, it is very common to partition datasets by date (.../year=2025/month=10/day=11/ etc.). When you load a partitioned table and apply a filter like WHERE year=2025 AND month=10, Spark (through Hive Metastore or internal logic) will only read the folders matching those partition values, skipping the rest entirely. This partition pruning can cut down the runtime dramatically for time range queries or other partition key filters. Choose partition columns that have enough granularity to split data, but not so high cardinality that you get a million tiny files. Dates, regions, or categorical fields are good candidates. On a related note, avoid creating too many small files in the output. If you have, say, 1000 partitions each resulting in a small file of a few KB, Spark’s reading overhead increases (and many files can strain the file system). It’s often better to coalesce or repartition data before writing out to end up with larger chunkier files (e.g., each 100MB or more). The goal is a balance: not one huge monolithic file (which can’t be read in parallel efficiently), but also not tens of thousands of tiny files. File I/O operations have overhead, so fewer, larger files are usually better for throughput. Bucketing and Sorting For certain use cases (especially joining on a key or doing aggregations), partitioning might not be enough. Spark offers bucketing (hash partitioning data into a fixed number of files by column and optionally sorting within each bucket). If two tables are bucketed by the same key into the same number of buckets, Spark can do a join or aggregation on those with less shuffle. Bucketing is more of an advanced technique and adds complexity (and it is static once set up), but it can be worth it for repeated queries on huge data. If you have control over how data is written, consider bucketing on high cardinality join keys to optimize those operations. In summary, store data in a Spark-friendly way. Use the right formats (columnar + compressed), partition the data for predicate pushdown, and write outputs in appropriately sized files. These optimizations at the storage layer often make a big difference, because they reduce the amount of work Spark has to do before your actual computation even starts. Monitoring Tuning is an iterative process. Even after applying the above strategies, it’s important to monitor your Spark application’s performance and identify any remaining bottlenecks or areas to improve. Spark provides excellent tools for this. The Spark Web UI is your first stop. When you run a Spark application, you can access its UI (typically at http://<driver-node>:4040 for local or the tracking URL in cluster mode). On the UI, examine the Jobs and Stages tabs to see where time is being spent. Look at metrics like Task Duration, Shuffle Read/Write, Memory Spill, etc. For example, if you see a stage with an enormous Shuffle Read size, that is a hint that maybe you need to reduce data shuffling or use a broadcast join. Large "Shuffle Write" metrics indicate a lot of data being shuffled; this might signal an opportunity for optimization by filtering or repartitioning differently. Similarly, if you notice tasks are slow due to GC, that indicates memory tuning might be needed. Spark UI also helps spot data skew. If one task is running much longer than others in a stage, perhaps one partition had way more data (skew), and you might handle that by better partitioning or using techniques like salting or the spark.sql.adaptive.skewJoin.enabled feature in AQE. Besides the UI, you can enable Spark event logs and use tools or Spark History Server to analyze runs after the fact. There are also external APM (Application Performance Management) tools and metrics sinks (Ganglia, Prometheus, etc.) that can collect Spark metrics. Leverage these tools to get insight into your job’s behavior. It is often helpful to test your job on a sample of data or on a smaller scale, profile it, and then extrapolate or catch issues before running on full data. Finally, remember that optimization is a balancing act and sometimes requires tradeoffs. The “best” settings can depend on your specific data sizes, data distributions, and cluster resources. What works for one job (or one day’s data) might need adjustment for another. By monitoring and understanding Spark’s execution, you will build an intuition for where the bottleneck is — be it CPU, memory, network, or I/O and apply the appropriate fix from the toolbox of techniques discussed above. Key Takeaways Leverage Spark’s optimizer: Use high-level APIs (Data Frames/Datasets) and built-in functions to let Spark’s Catalyst and Tungsten engines optimize your code. Avoid using RDDs or UDFs unless absolutely necessary, as they bypass many optimizations.Tune resources and configs: Adjust Spark configurations to your workload. Enable dynamic allocation for variable workloads, set appropriate executor memory to avoid OOM, and tune the number of shuffle partitions so tasks are neither too large nor too small.Minimize data shuffling: Design your computations to reduce data movement. Filter data early in the pipeline, and use techniques like broadcast joins for small lookup datasets. Where possible, partition or coalesce data to avoid skew and excessive tiny tasks.Efficient data storage: Use columnar formats like Parquet/ORC with compression for input and output data. Partition your data files by common keys (e.g., date) to enable Spark to skip reading unnecessary data. Avoid writing out a myriad of small files. Aim for fewer, larger files for better I/O throughput.Cache smartly: Cache/persist intermediate datasets only when they are reused multiple times in your job. This can save time on re-computation for iterative algorithms or repeated joins, but do not waste memory on caching data that is used just once.Monitor and iterate: Use the Spark UI and logs to profile your job’s performance. Identify stages with heavy shuffles, long tasks, or skew, and address those with the appropriate optimization techniques. Optimization is iterative — observe, tweak, and repeat to hone in on the best performance for your Spark application. By following these best practices and continuously observing your Spark jobs, you can significantly improve the speed and efficiency of your Apache Spark programs.

By Nataraj Mocherla