A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Top 5 Key Features of Apache Iceberg for Modern Data Lakes
AWS Cloud Security: Key Components, Common Vulnerabilities, and Best Practices
Large language models (LLMs) have drastically advanced natural language processing (NLP) by learning complex language patterns from vast datasets. Yet, when these models are combined with structured knowledge graphs — databases designed to represent relationships between entities — challenges arise. Knowledge graphs can be incredibly useful in providing structured knowledge that enhances an LLM's understanding of specific domains. However, as these graphs grow larger, they often become cumbersome, reducing their efficiency when queried. For example, an LLM tasked with answering questions or making decisions based on knowledge from a graph may take longer to retrieve relevant information if the graph is too large or cluttered with unnecessary details. This can increase computation times and limit the model’s scalability. A promising approach to address this issue is pruning, a method of selectively reducing the size of knowledge graphs while preserving their most relevant and important connections. Pruning graph databases can improve the knowledge representation in LLMs by removing irrelevant data, thus enabling faster and more focused knowledge retrieval. This article discusses the benefits and strategies for pruning knowledge graphs and how they can enhance LLM performance, particularly in domain-specific applications. The Role of Graph Databases in Knowledge Representation Graph databases are designed to store and query data in graph structures consisting of nodes (representing entities) and edges (representing relationships between entities). Knowledge graphs leverage this structure to represent complex relationships, such as those found in eCommerce systems, healthcare, finance, and many other domains. These graphs allow LLMs to access structured, domain-specific knowledge that supports more accurate predictions and responses. However, as the scope and size of these knowledge graphs grow, retrieving relevant information becomes more difficult. Inefficient traversal of large graphs can slow down model inference and increase the computational resources required. As LLMs scale, integrating knowledge graphs becomes a challenge unless methods are employed to optimize their size and structure. Pruning provides a solution to this challenge by focusing on the most relevant nodes and relationships and discarding the irrelevant ones. Pruning Strategies for Graph Databases To improve the efficiency and performance of LLMs that rely on knowledge graphs, several pruning strategies can be applied: Relevance-Based Pruning Relevance-based pruning focuses on identifying and retaining only the most important entities and relationships relevant to a specific application. In an eCommerce knowledge graph, for example, entities such as "product," "category," and "customer" might be essential for tasks like recommendation systems, while more generic entities like "region" or "time of day" might be less relevant in certain contexts and can be pruned. Similarly, edges that represent relationships like "has discount" or "related to" may be removed if they don't directly impact key processes like product recommendations or personalized marketing strategies. By pruning less important nodes and edges, the knowledge graph becomes more focused, improving both the efficiency and accuracy of the LLM in handling specific tasks like generating product recommendations or optimizing dynamic pricing. Edge and Node Pruning Edge and node pruning involves removing entire nodes or edges based on certain criteria, such as nodes with few connections or edges with minimal relevance to the task at hand. For example, if a node in a graph has low importance — such as a product that rarely receives customer interest — it might be pruned, along with its associated edges. Similarly, edges that connect less important nodes or represent weak relationships may be discarded. This method aims to maintain the essential structure of the graph while simplifying it, removing redundant or irrelevant elements to improve processing speed and reduce computation time. Subgraph Pruning Subgraph pruning involves removing entire subgraphs from the knowledge graph if they are not relevant to the task at hand. For instance, in an eCommerce scenario, subgraphs related to "customer support" might be irrelevant for a model tasked with product recommendations, so these can be pruned without affecting the quality of the primary tasks. This targeted pruning helps reduce the size of the graph while ensuring that only pertinent data remains for knowledge retrieval. Impact on LLM Performance Speed and Computational Efficiency One of the most significant advantages of pruning is its impact on the speed and efficiency of LLMs. By reducing the size of the knowledge graph through pruning, the graph becomes easier to traverse and query. This results in faster knowledge retrieval, which directly translates to reduced inference times for LLM-based applications. For example, if a graph contains thousands of irrelevant relationships, pruning those out allows the model to focus on the most relevant data, speeding up decision-making processes in real-time applications like personalized product recommendations. Accuracy in Domain-Specific Tasks Pruning irrelevant information from a graph also helps improve the accuracy of LLMs in domain-specific tasks. By focusing on the most pertinent knowledge, LLMs can generate more accurate responses. In an eCommerce setting, this means better product recommendations, more effective search results, and an overall more optimized customer experience. Moreover, pruning ensures that the model’s focus is on high-quality, relevant data, reducing the chances of confusion or misinterpretation of less relevant details. Conclusion Pruning techniques offer a practical and effective approach to optimizing the integration of graph databases in large language models. By selectively reducing the complexity and size of knowledge graphs, pruning helps improve the retrieval speed, accuracy, and overall efficiency of LLMs. In domain-specific applications, such as eCommerce, healthcare, or finance, pruning can significantly enhance performance by allowing LLMs to focus on the most relevant data for their tasks. As LLMs continue to evolve, the ability to integrate vast amounts of structured knowledge while maintaining computational efficiency will be crucial. Pruning is a valuable tool in this process, enabling LLMs to scale without sacrificing performance.
Master Database File (MDF) is the primary database file in MS SQL Server that stores all the data, including views, tables, stored procedures, foreign keys, and primary keys. Sometimes, while opening the MDF file, you may face an issue where the SQL Server database cannot be opened and encounter an error, like the one given below: SQL Server error 5171: MDF is not a primary database fileUnable to open the file xxxxx.mdf. Operating system error 5: (Access is denied)”Event ID 824: SQL Server detected a logical consistency-based I/O errorMsg 8114, Level 16, State 5, Line 2: Error converting data type varchar to numeric Such errors usually occur when there is corruption in the MDF file. In this article, we'll explain the possible reasons behind the corruption of MDF files and show how to repair corrupted MDF files and resolve the issue. Causes of Corruption in the MDF File There are various reasons that can lead to corruption in the SQL Server database (MDF) file. Some common ones are given below: MS SQL Server or System Crash MS SQL Server or system may crash due to issues with the operating system, hardware, or software. If the server or system crashes when working on the MDF file, it can damage or corrupt the file. Abrupt System Shutdown Sudden power failure or forced shutdown of the system while the SQL Server is running can corrupt the MDF file stored on your system. Malware or Virus Attacks Malware or viruses can also damage MDF files and make them inaccessible. It usually occurs when your system does not have an antivirus or anti-malware protection software. Issues With Storage Drive Bad sectors on the storage drive or file system errors can corrupt the SQL database files. If the drive is damaged, then it can also cause corruption in the MDF files. Methods to Resolve 'Database cannot be Opened' Issue Due to MDF File Corruption If your MDF file is corrupted, you can then follow the below-mentioned methods to repair and recover the corrupt MDF file, thus resolving the “Database cannot be opened” issue in SQL Server. Method 1: Restore the MDF File From Backup If you have created a backup of your MDF file, then you can restore the file from the backup. But before restoring, you need to check if your backup is complete and readable. For this, you can use the RESTORE VERIFY ONLY command. If this command shows a success message, then you can proceed to restore the backup. For this, use the below Transact SQL command: MS SQL BACKUP DATABASE [AdventureWorks2019] TO DISK = N’C:\backups\AdventureWorks2019.bak’ WITH NOFORMAT, NOINIT, NAME = N’AdventureWorks2019-Full Database Backup’, SKIP, NOREWIND, NOUNLOAD, STATS = 10 GO Alternatively, you can use the SQL Server Management Studio (SSMS) to restore the backup. Method 2: Use DBCC CHECKDB Command If the backup is not updated or there is a problem while restoring the backup file, then you can use the DBCC CHECKDB command to repair the corrupt MDF file. To repair the database file, you need to make sure that you have admin rights on the database. Before repairing, set the database to single-user mode by using the below command: MS SQL ALTER DATABASE Dbtesting SET SINGLE_USER If you are unable to set the database to SINGLE_USER mode, then verify that the AUTO_UPDATE_STATISTICS_ASYNC option is set to OFF. Next, run the DBCC CHECKDB command as given below to repair the database (MDF) file: MS SQL DBCC CHECKDB (N ’Dbtesting’, REPAIR_ALLOW_DATA_LOSS) WITH ALL_ERRORMSGS, NO_INFOMSGS GO When the database is repaired, set it to MULTI_USER mode again by using the below command: MS SQL ALTER DATABASE Dbtesting SET MULTI_USER Note: The DBCC CHECKDB command can help you repair the MDF file. However, while repairing the file, it may deallocate pages or rows, which can result in data loss. Method 3: Use a Professional SQL Repair Tool Repairing the SQL database using the DBCC CHECKDB command with the REPAIR_ALLOW_DATA_LOSS option can result in data loss. To avoid data loss, you can use a powerful MS SQL repair software, like Stellar Repair for MS SQL. This software can repair corrupt MDF files and restore all the file objects, like primary keys, tables, triggers, views, etc., without changing the original structure. It saves the repaired file data in a new MDF file and various other formats, such as CSV, HTML, and XLS. The software supports SQL database (MDF and NDF) files created on Windows and Linux systems. To Conclude The 'Database cannot be opened' issue in MS SQL Server can occur due to corruption in the MDF file. To fix this issue, the easiest way is to restore the MDF file from the last backup. If the backup file is obsolete or not working, you can use the DBCC CHECKDB command to repair the corrupt MDF file. If the DBCC CHECKDB command fails to repair the database, then use a powerful SQL repair tool, like Stellar Repair for MS SQL. It can repair corrupt MDF files and restore all the data to a new MDF file with complete integrity. You can install the free version of Stellar Repair for MS SQL to scan the corrupt MDF file and preview the recoverable data.
When we are working with a database, optimization is crucial and key in terms of application performance and efficiency. Likewise, in Azure Cosmos DB, optimization is crucial for maximizing efficiency, minimizing costs, and ensuring that your application scales effectively. Below are some of the best practices with coding examples to optimize performance in Azure Cosmos DB. 1. Selection of Right Partition Key Choosing an appropriate partition key is vital for distributed databases like Cosmos DB. A good partition key ensures that data is evenly distributed across partitions, reducing hot spots and improving performance. The selection of a partition key is simple but very important at design time in Azure Cosmos DB. Once we select the partition key, it isn't possible to change it in place. Best Practice Select a partition key with high cardinality (many unique values).Ensure it distributes reads and writes evenly.Keep related data together to minimize cross-partition queries. Example: Creating a Container With an Optimal Partition Key C# var database = await cosmosClient.CreateDatabaseIfNotExistsAsync("YourDatabase"); var containerProperties = new ContainerProperties { Id = "myContainer", PartitionKeyPath = "/customerId" // Partition key selected to ensure balanced distribution }; // Create the container with 400 RU/s provisioned throughput var container = await database.CreateContainerIfNotExistsAsync(containerProperties, throughput: 400); 2. Properly Use Indexing In Azure Cosmos DB, indexes are applied to all properties by default, which can be beneficial but may result in increased storage and RU/s costs. To enhance query performance and minimize expenses, consider customizing the indexing policy. Cosmos DB supports three types of indexes: Range Indexes, Spatial Indexes, and Composite Indexes. Use the proper type of wisely. Best Practice Exclude unnecessary fields from indexing.Use composite indexes for multi-field queries. Example: Custom Indexing Policy C# { "indexingPolicy": { "automatic": true, "indexingMode": "consistent", // Can use 'none' or 'lazy' to reduce write costs "includedPaths": [ { "path": "/orderDate/?", // Only index specific fields like orderDate "indexes": [ { "kind": "Range", "dataType": "Number" } ] } ], "excludedPaths": [ { "path": "/largeDataField/*" // Exclude large fields not used in queries } ] } } Example: Adding a Composite Index for Optimized Querying C# { "indexingPolicy": { "compositeIndexes": [ [ { "path": "/lastName", "order": "ascending" }, { "path": "/firstName", "order": "ascending" } ] ] } } You can read more about Indexing types here. 3. Optimize Queries Efficient querying is crucial for minimizing request units (RU/s) and improving performance in Azure Cosmos DB. The RU/s cost depends on the query's complexity and size. Utilizing bulk executors can further reduce costs by decreasing the RUs consumed per operation. This optimization helps manage RU usage effectively and lowers your overall Cosmos DB expenses. Best Practice Use SELECT queries in limited amounts, retrieve only necessary properties.Avoid cross-partition queries by providing the partition key in your query.Use filters on indexed fields to reduce query costs. Example: Fetch Customer Record C# var query = new QueryDefinition("SELECT c.firstName, c.lastName FROM Customers c WHERE c.customerId = @customerId") .WithParameter("@customerId", "12345"); var iterator = container.GetItemQueryIterator<Customer>(query, requestOptions: new QueryRequestOptions { PartitionKey = new PartitionKey("12345") // Provide partition key to avoid cross-partition query }); while (iterator.HasMoreResults) { var response = await iterator.ReadNextAsync(); foreach (var customer in response) { Console.WriteLine($"{customer.firstName} {customer.lastName}"); } } 4. Consistency Levels Tuning The consistency levels define specific operational modes designed to meet speed-related guarantees. There are five consistency levels (Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual) available in Cosmos DB. Each consistency level impacts latency, availability, and throughput. Best Practice Use Session consistency for most scenarios to balance performance and data consistency.Strong consistency guarantees data consistency but increases RU/s and latency. Example: Setting Consistency Level C# var cosmosClient = new CosmosClient( "<account-endpoint>", "<account-key>", new CosmosClientOptions { // Set consistency to "Session" for balanced performance ConsistencyLevel = ConsistencyLevel.Session }); Read more about the consistency level here. 5. Use Provisioned Throughput (RU/s) and Auto-Scale Wisely Provisioning throughput is a key factor in achieving both cost efficiency and optimal performance in Azure Cosmos DB. The service enables you to configure throughput in two ways: Fixed RU/s: A predefined, constant level of Request Units per second (RU/s), suitable for workloads with consistent performance demands.Auto-Scale: A dynamic option that automatically adjusts the throughput based on workload fluctuations, providing scalability while avoiding overprovisioning during periods of low activity. Choosing the appropriate throughput model helps balance performance needs with cost management effectively. Best Practice For predictable workloads, provision throughput manually.Use auto-scale for unpredictable or bursty workloads. Example: Provisioning Throughput With Auto-Scale C# var throughputProperties = ThroughputProperties.CreateAutoscaleThroughput(maxThroughput: 4000); // Autoscale up to 4000 RU/s var container = await database.CreateContainerIfNotExistsAsync(new ContainerProperties { Id = "autoscaleContainer", PartitionKeyPath = "/userId" }, throughputProperties); Example: Manually Setting Fixed RU/s for Stable Workloads C# var container = await database.CreateContainerIfNotExistsAsync(new ContainerProperties { Id = "manualThroughputContainer", PartitionKeyPath = "/departmentId" }, throughput: 1000); // Fixed 1000 RU/s 6. Leverage Change Feed for Efficient Real-Time Processing The change feed allows for real-time, event-driven processing by automatically capturing changes in the database, eliminating the need for polling. This reduces query overhead and enhances efficiency. Best Practice Use change feed for scenarios where real-time data changes need to be processed (e.g., real-time analytics, notifications, alerts). Example: Reading From the Change Feed C# var iterator = container.GetChangeFeedIterator<YourDataModel>( ChangeFeedStartFrom.Beginning(), ChangeFeedMode.Incremental); while (iterator.HasMoreResults) { var changes = await iterator.ReadNextAsync(); foreach (var change in changes) { Console.WriteLine($"Detected change: {change.Id}"); // Process the change (e.g., trigger event, update cache) } } 7. Utilization of Time-to-Live (TTL) for Automatic Data Expiration If you have data that is only relevant for a limited time, such as logs or session data, enabling Time-to-Live (TTL) in Azure Cosmos DB can help manage storage costs. TTL automatically deletes expired data after the specified retention period, eliminating the need for manual data cleanup. This approach not only reduces the amount of stored data but also ensures that your database is optimized for cost-efficiency by removing obsolete or unnecessary information. Best Practice Set TTL for containers where data should expire automatically to reduce storage costs. Example: Setting Time-to-Live (TTL) for Expiring Data C# { "id": "sessionDataContainer", "partitionKey": { "paths": ["/sessionId"] }, "defaultTtl": 3600 // 1 hour (3600 seconds) } In Cosmos DB, the maximum Time-to-Live (TTL) value that can be set is 365 days (1 year). This means that data can be automatically deleted after it expires within a year of creation or last modification, depending on how you configure TTL. 8. Avoid Cross-Partition Queries Cross-partition queries can significantly increase RU/s and latency. To avoid this: Best Practice Always include partition key in your queries.Design your partition strategy to minimize cross-partition access. Example: Querying With Partition Key to Avoid Cross-Partition Query C# var query = new QueryDefinition("SELECT * FROM Orders o WHERE o.customerId = @customerId") .WithParameter("@customerId", "12345"); var resultSetIterator = container.GetItemQueryIterator<Order>(query, requestOptions: new QueryRequestOptions { PartitionKey = new PartitionKey("12345") }); while (resultSetIterator.HasMoreResults) { var response = await resultSetIterator.ReadNextAsync(); foreach (var order in response) { Console.WriteLine($"Order ID: {order.Id}"); } } Conclusion These tips are very effective during development. By implementing an effective partitioning strategy, customizing indexing policies, optimizing queries, adjusting consistency levels, and selecting the appropriate throughput provisioning models, you can greatly improve the performance and efficiency of your Azure Cosmos DB deployment. These optimizations not only enhance scalability but also help in managing costs while providing a high-performance database experience.
Ensuring database reliability can be difficult. Our goal is to speed up development and minimize rollbacks. We want developers to be able to work efficiently while taking ownership of their databases. Achieving this becomes much simpler when robust database observability is in place. Let’s explore how. Do Not Wait With Checks Teams aim to maintain continuous database reliability, focusing on ensuring their designs perform well in production, scale effectively, and allow for safe code deployments. To achieve this level of quality, they rely on a range of practices, including thorough testing, code reviews, automated CI/CD pipelines, and component monitoring. Despite these efforts, challenges persist. Database-related problems often go undetected during testing. This is because most tests prioritize the accuracy of data operations while overlooking performance considerations. As a result, even though the data may be handled correctly, the solution may perform too slowly for production needs, leading to failures and decreased customer satisfaction. Load testing adds further complications. These tests are complex to create and maintain, expensive to run, and usually occur too late in development. By the time load testing uncovers performance issues, the code has already been reviewed and merged, requiring developers to revisit and revise their designs to address the problems. A straightforward solution exists for addressing these challenges: implementing observability early in the pipeline. Utilizing effective observability tools, we can integrate directly with developers' environments to detect errors during the development phase. This allows us to monitor query performance, identify schema issues, and recognize design flaws — essentially catching anything that might cause problems in production. Addressing these issues early enables us to fix them at the source before they become larger concerns. Let Your Teams Shine Maintaining database reliability can be challenging for developers when they don't have full ownership of the process. It becomes even more difficult when multiple teams are involved, DBAs guard their responsibilities, and ticketing issues create bottlenecks. However, this can be resolved. Developers can significantly increase their speed and effectiveness when they have complete ownership. They excel when they control development, deployment, monitoring, and troubleshooting. To succeed, they need the right tools — observability solutions that offer actionable insights and automated troubleshooting rather than traditional monitoring tools that simply deliver raw data without context or understanding. We need a new approach. Instead of overwhelming developers with countless data points, we need solutions that analyze the entire SDLC and provide actionable insights with automated fixes. These tools should be able to optimize queries and offer suggestions to improve performance. Likewise, they should recommend schema enhancements and indexes and detect anomalies, automatically alerting developers when manual intervention is required for business-critical decisions that can't be resolved automatically. A paradigm shift is necessary as we move away from information overload towards more streamlined solutions that encapsulate the entire Software Development Life Cycle (SDLC). Our needs are twofold: firstly, a solution should autonomously scrutinize and analyze all aspects of our SDLC to provide concise answers. This includes optimizing SQL queries for better performance or identifying areas requiring schema enhancements like adding appropriate indexes based on certain parameters such as query patterns — essentially providing automated fixes when possible. Secondly, it needs the capability not only to detect discrepancies that require developer intervention but also to have an alert system in place for these complex issues which cannot be resolved by code changes alone and instead necessitate business decisions or architectural modifications. In essence, we are seeking a holistic solution with automated capabilities where possible; otherwise, it provides the necessary prompts to guide developers toward appropriate actions that ensure the robustness of our system across all deployment stages without overwhelming them with data points. Stop Being Afraid of Databases Modern observability elevates your team's database reliability by ensuring that developers' changes are safe for production, anomalies are detected early, and configurations are optimized for maximum performance. With effective observability tools in place, project management becomes smoother as developers can bypass time-consuming interactions with other teams. This allows you to focus on your core business, boost productivity, and scale your organization efficiently. Embracing cutting-edge monitoring tools brings about a noticeable improvement in our team's database dependability by validating the safety of developers’ code modifications for production environments. These modern observability solutions provide real-time anomaly detection while also optimizing configurations to achieve peak performance levels, thereby facilitating efficient project management without needing constant cross-communication with other teams — a common bottleneck in many organizations. As we shift our focus entirely onto core business tasks and activities, thanks to these tools' efficiency gains, productivity skyrockets, leading us towards effective scaling of the organization as well.
Classic Case 1 Many software professionals lack in-depth knowledge of TCP/IP logic reasoning, which often leads to misidentifying problems as mysterious problems. Some are discouraged by the complexity of TCP/IP networking literature, while others are misled by confusing details in Wireshark. For instance, a DBA facing performance problems might misinterpret packet capture data in Wireshark, erroneously concluding that TCP retransmissions are the cause. Figure 1. Packet capture screenshot provided by DBA suspecting retransmission problems Since retransmission is suspected, it's essential to understand its nature. Retransmission fundamentally involves timeout retransmission. To confirm if retransmission is indeed the cause, time-related information is necessary, which is not provided in the screenshot above. After requesting a new screenshot from the DBA, the timestamp information was included. Figure 2. Packet capture screenshot with time information added When analyzing network packets, timestamp information is crucial for accurate logical reasoning. A time difference in the microsecond range between two duplicate packets suggests either a timeout retransmission or duplicate packet capture. In a typical LAN environment with a Round-trip Time (RTT) of around 100 microseconds, where TCP retransmissions require at least one RTT, a retransmission occurring at just 1/100th of the RTT likely indicates duplicate packet capture rather than an actual timeout retransmission. Classic Case 2 Another classic case illustrates the importance of logical reasoning in network problem analysis. One day, one business developer came rushing over, saying that a scheduled script using the MySQL database middleware had failed in the early morning hours with no response. Upon hearing about the problem, I checked the error logs of the MySQL database middleware but found no valuable clues. So, I asked the developers if they could reproduce the problem, knowing that once reproducible, a problem becomes easier to solve. The developers tried multiple times to reproduce the problem but were unsuccessful. However, they made a new discovery: they found that executing the same SQL queries during the day resulted in different response times compared to the early morning. They suspected that when the SQL response was slow, the MySQL database middleware was blocking the session and not returning results to the client. Based on this insight, the database operations team was asked to modify the script's SQL to simulate a slow SQL response. As a result, the MySQL database middleware returned the results without encountering the hang problem seen in the early morning hours. For a while, the root cause couldn't be identified, and developers discovered a functional problem with the MySQL database middleware. Therefore, developers and DBA operations became more convinced that the MySQL database middleware was delaying responses. In reality, these problems were not related to the response times of the MySQL database middleware. From the events of the first day, the problem did indeed occur. Everyone involved tried to pinpoint the cause, making various guesses, but the true reason remained elusive. The next day, developers reported that the script problem reoccurred in the early morning, yet they couldn't reproduce it during the day. Developers, feeling pressured as the script was soon to be used online, complained about the situation. My only suggestion was for them to use the script during the day to avoid problems in the early morning. With all suspicions focused on the MySQL database middleware, it was challenging to analyze the problem from other perspectives. As a developer responsible for the MySQL database middleware, such mysterious problems cannot be easily overlooked. Ignoring them could impact subsequent use of the MySQL database middleware, and there is also pressure from leadership to solve the problem promptly. Finally, it was decided to implement a low-cost packet capture analysis solution: during the execution of the script in the early morning, packet captures would be performed on the server to analyze what was happening at that time. The goal was to determine if the MySQL database middleware either failed to send a response at all or if it did send a response that the client script did not receive. Once it could be confirmed that the MySQL database middleware did send a response, the problem would not be attributed to the MySQL database middleware developers. On the third day, developers reported that the early morning problem did not recur, and packet capture analysis confirmed that the problem did not occur. After careful consideration, it seemed unlikely that the problem was solely with the MySQL database middleware: frequent occurrences in the early morning and rare occurrences during the day were puzzling. The only course of action was to wait for the problem to occur again and analyze it based on the packet captures. On the fourth day, the problem did not surface again. However, on the fifth day, the problem finally reappeared, bringing hope for resolution. The packet capture files are numerous. First, ask the developers to provide the timestamp when the problem occurred, then search through the extensive packet capture data to identify the SQL query that caused the problem. The final result is as follows: Figure 3. Key packet information captured for problem resolution From the packet capture content above (captured from the server), it appears that the SQL query was sent at 3 AM. The MySQL database middleware took 630 seconds (03:10:30.899249-03:00:00.353157) to return the SQL response to the client, indicating that the MySQL database middleware did indeed respond to the SQL query. However, just 238 microseconds later (03:10:30.899487-03:10:30.899249), the server's TCP layer received a reset packet, which was suspiciously quick. It's important to note that this reset packet cannot be immediately assumed to be from the client. Firstly, it is necessary to confirm who sent the reset packet — either it was sent by the client or by an intermediate device along the way. Since packet capture was performed only on the server side, information about the client's packet situation is not available. By analyzing the packet capture files from the server side and applying logical reasoning, the aim is to identify the root cause of the problem. If the assumption is made that the client sent a reset, it would imply that the client's TCP layer no longer recognizes the TCP state of this connection — transitioning from an established state to a nonexistent one. This change in TCP state would notify the client application of a connection problem, causing the client script to immediately error out. However, in reality, the client script is still waiting for the response to come back. Therefore, the assumption that the client sent a reset does not hold true — the client did not send a reset. The client's connection is still active, but on the server side, the corresponding connection has been terminated by the reset. Who sent the reset, then? The primary suspect is Amazon's cloud environment. Based on this packet capture analysis, the DBA operations queried Amazon customer service and received the following information: Figure 4. Final response from Amazon customer service Customer service's response aligns with the analysis results, indicating that Amazon's ELB (Elastic Load Balancer, similar to LVS) forcibly terminated the TCP session. According to their feedback, if a response exceeds the 350-second threshold (as observed in the packet capture as 630 seconds), Amazon's ELB device sends a reset to the responding party (in this case, the server). The client scripts deployed by the developers did not receive the reset and mistakenly assumed the server connection was still active. Official recommendations for such problems include using TCP keepalive mechanisms to mitigate these problems. With the official response obtained, the problem was considered fully solved. This specific case illustrates how online problems can be highly complex, requiring the capture of critical information — in this instance, packet capture data — to understand the situation as it occurred. Through logical reasoning and the application of reductio ad absurdum, the root cause was identified.
In web development, optimizing and scaling applications have always been an issue. React.js had extraordinary success in front-end development as a tool, providing a robust way to create user interfaces. But it gets complicated with growing applications, especially when it comes to multiple REST API endpoints. Concerns such as over-fetching, where excessive data is required, can be a source of performance bottlenecks and a poor user experience. Among the solutions to these challenges is adopting the use of GraphQL with React applications. If your backend has multiple REST endpoints, then introducing a GraphQL layer that internally calls your REST API endpoints can enhance your application from overfetching and streamline your frontend application. In this article, you will find how to use it, the advantages and disadvantages of this approach, various challenges, and how to address them. We will also dive deeper into some practical examples of how GraphQL can help you improve the ways you work with your data. Overfetching in REST APIs In REST APIs, overfetching occurs when the amount of data that the API delivers to the client is more than what the client requires. This is a common problem with REST APIs, which often return a fixed Object or Response Schema. To better understand this problem, let us consider an example. Consider a user profile page where it is only required to show the user’s name and email. With a typical REST API, fetching the user data might look like this: JavaScript fetch('/api/users/1') .then(response => response.json()) .then(user => { // Use the user's name and profilePicture in the UI }); The API response will include unnecessary data: JSON { "id": 1, "name": "John Doe", "profilePicture": "/images/john.jpg", "email": "john@example.com", "address": "123 Denver St", "phone": "111-555-1234", "preferences": { "newsletter": true, "notifications": true }, // ...more details } Although the application only requires the name and email fields of the user, the API returns the whole user object. This additional data often increases the payload size, takes more bandwidth, and can eventually slow down the application when used on a device with limited resources or a slow network connection. GraphQL as a Solution GraphQL addresses the overfetching problem by allowing clients to request exactly the data they need. By integrating a GraphQL server into your application, you can create a flexible and efficient data-fetching layer that communicates with your existing REST APIs. How It Works 1. GraphQL Server Setup You introduce a GraphQL server that serves as an intermediary between your React frontend and the REST APIs. 2. Schema Definition You define a GraphQL schema that specifies the data types and queries your frontend requires. 3. Resolvers Implementation You implement resolvers in the GraphQL server that fetch data from the REST APIs and return only the necessary fields. 4. Front-End Integration You update your React application to use GraphQL queries instead of direct REST API calls. This approach allows you to optimize data fetching without overhauling your existing backend infrastructure. Implementing GraphQL in a React Application Let’s look at how to set up a GraphQL server and integrate it into a React application. Install Dependencies PowerShell npm install apollo-server graphql axios Define the Schema Create a file called schema.js: JavaScript const { gql } = require('apollo-server'); const typeDefs = gql` type User { id: ID! name: String email: String // Ensure this matches exactly with the frontend query } type Query { user(id: ID!): User } `; module.exports = typeDefs; This schema defines a User type and a user query that fetches a user by ID. Implement Resolvers Create a file called resolvers.js: JavaScript const resolvers = { Query: { user: async (_, { id }) => { try { const response = await fetch(`https://jsonplaceholder.typicode.com/users/${id}`); const user = await response.json(); return { id: user.id, name: user.name, email: user.email, // Return email instead of profilePicture }; } catch (error) { throw new Error(`Failed to fetch user: ${error.message}`); } }, }, }; module.exports = resolvers; The resolver for the user query fetches data from the REST API and returns only the required fields. We will use https://jsonplaceholder.typicode.com/ for our fake REST API. Set Up the Server Create a server.js file: JavaScript const { ApolloServer } = require('apollo-server'); const typeDefs = require('./schema'); const resolvers = require('./resolvers'); const server = new ApolloServer({ typeDefs, resolvers, }); server.listen({ port: 4000 }).then(({ url }) => { console.log(`GraphQL Server ready at ${url}`); }); Start the server: PowerShell node server.js Your GraphQL server is live at http://localhost:4000/graphql, and if you query your server, it will take you to this page. Integrating With the React Application We will now change the React application to use the GraphQL API. Install Apollo Client PowerShell npm install @apollo/client graphql Configure Apollo Client JavaScript import { ApolloClient, InMemoryCache } from '@apollo/client'; const client = new ApolloClient({ uri: 'http://localhost:4000', cache: new InMemoryCache(), }); Write the GraphQL Query JavaScript const GET_USER = gql` query GetUser($id: ID!) { user(id: $id) { id name email } } `; Now, integrate the above pieces of code with your React app. Here is a simple react app below, which lets a user select the userId and display the information: JavaScript import { useState } from 'react'; import { ApolloClient, InMemoryCache, ApolloProvider, gql, useQuery } from '@apollo/client'; import './App.css'; // Link to the updated CSS const client = new ApolloClient({ uri: 'http://localhost:4000', // Ensure this is the correct URL for your GraphQL server cache: new InMemoryCache(), }); const GET_USER = gql` query GetUser($id: ID!) { user(id: $id) { id name email } } `; const User = ({ userId }) => { const { loading, error, data } = useQuery(GET_USER, { variables: { id: userId }, }); if (loading) return <p>Loading...</p>; if (error) return <p>Error: {error.message}</p>; return ( <div className="user-container"> <h2>{data.user.name}</h2> <p>Email: {data.user.email}</p> </div> ); }; const App = () => { const [selectedUserId, setSelectedUserId] = useState("1"); return ( <ApolloProvider client={client}> <div className="app-container"> <h1 className="title">GraphQL User Lookup</h1> <div className="dropdown-container"> <label htmlFor="userSelect">Select User ID:</label> <select id="userSelect" value={selectedUserId} onChange={(e) => setSelectedUserId(e.target.value)} > {Array.from({ length: 10 }, (_, index) => ( <option key={index + 1} value={index + 1}> {index + 1} </option> ))} </select> </div> <User userId={selectedUserId} /> </div> </ApolloProvider> ); }; export default App; Result You will see simple user details like this: [Github Link]. Working With Multiple Endpoints Imagine a scenario where you need to retrieve a specific user’s posts, along with the individual comments on each post. Instead of making three separate API calls from your frontend React app and dealing with unnecessary data, you can streamline the process with GraphQL. By defining a schema and crafting a GraphQL query, you can request only the exact data your UI requires, which is an efficient request all in one. We need to fetch user data, their posts, and comments for each post from the different endpoints. We’ll use fetch to gather data from the multiple endpoints and return it via GraphQL. Update Resolvers JavaScript const fetch = require('node-fetch'); const resolvers = { Query: { user: async (_, { id }) => { try { // fetch user const userResponse = await fetch(`https://jsonplaceholder.typicode.com/users/${id}`); const user = await userResponse.json(); // fetch posts for a user const postsResponse = await fetch(`https://jsonplaceholder.typicode.com/posts?userId=${id}`); const posts = await postsResponse.json(); // fetch comments for a post const postsWithComments = await Promise.all( posts.map(async (post) => { const commentsResponse = await fetch(`https://jsonplaceholder.typicode.com/comments?postId=${post.id}`); const comments = await commentsResponse.json(); return { ...post, comments }; }) ); return { id: user.id, name: user.name, email: user.email, posts: postsWithComments, }; } catch (error) { throw new Error(`Failed to fetch user data: ${error.message}`); } }, }, }; module.exports = resolvers; Update GraphQL Schema JavaScript const { gql } = require('apollo-server'); const typeDefs = gql` type Comment { id: ID! name: String email: String body: String } type Post { id: ID! title: String body: String comments: [Comment] } type User { id: ID! name: String email: String posts: [Post] } type Query { user(id: ID!): User } `; module.exports = typeDefs; Server setup in server.js remains same. Once we update the React.js code, we get the below output: Result You will see a detailed user like this: [Github Link]. Benefits of This Approach Integrating GraphQL into your React application provides several advantages: Eliminating Overfetching A key feature of GraphQL is that it only fetches exactly what you request. The server only returns the requested fields and ensures that the amount of data transferred over the network is reduced by serving only what the query demands, thus improving performance. Simplifying Front-End Code GraphQL enables you to get the needed information in a single query regardless of its origin. Internally, it could be making 3 API calls to get the information. This helps to simplify your frontend code because now you don’t need to orchestrate different async requests and combine their results. Improving Developer’s Experience A strong typing and schema introspection offer better tooling and error checking than in the traditional API implementation. Furthermore, there are interactive environments where developers can build and test queries, including GraphiQL or Apollo Explorer. Addressing Complexities and Challenges This approach has some advantages but also introduces some challenges that must be managed. Additional Backend Layer The introduction of the GraphQL server creates an extra layer in your backend architecture, and if it is not managed properly, it becomes a single point of failure. Solution Pay attention to error handling and monitoring. Containerization and orchestration tools like Docker and Kubernetes can help manage scalability and reliability. Potential Performance Overhead The GraphQL server may make multiple REST API calls to resolve a single query, which can introduce latency and overhead to the system. Solution Cache the results to avoid making several calls to the API. Some tools, such as DataLoader, can handle the process of batching and caching requests. Conclusion "Simplicity is the ultimate sophistication" — Leonardo da Vinci Integrating GraphQL into your React application is more than just a performance optimization — it’s a strategic move toward building more maintainable, scalable, and efficient applications. By addressing overfetching and simplifying data management, you not only enhance the user experience but also empower your development team with better tools and practices. While the introduction of a GraphQL layer comes with its own set of challenges, the benefits often outweigh the complexities. By carefully planning your implementation, optimizing your resolvers, and securing your endpoints, you can mitigate potential drawbacks. Moreover, the flexibility that GraphQL offers can future-proof your application as it grows and evolves. Embracing GraphQL doesn’t mean abandoning your existing REST APIs. Instead, it allows you to leverage their strengths while providing a more efficient and flexible data access layer for your front-end applications. This hybrid approach combines the reliability of REST with the agility of GraphQL, giving you the best of both worlds. If you’re ready to take your React application to the next level, consider integrating GraphQL into your data fetching strategy. The journey might present challenges, but the rewards — a smoother development process, happier developers, and satisfied users — make it a worthwhile endeavor. Full Code Available You can find the full code for this implementation on my GitHub repository.
API testing has gained a lot of momentum these days. As UI is not involved, it is a lot easier and quicker to test. This is the reason why API testing is considered the first choice for performing end-to-end testing of the system. Integrating the automated API Tests with the CI/CD pipelines allows teams to get faster feedback on the builds. In this blog, we'll discuss and learn about DELETE API requests and how to handle them using Playwright Java for automation testing, covering the following points: What is a DELETE request?How do you test DELETE APIs using Playwright Java? Getting Started It is recommended that you check out the earlier tutorial blog to learn about the details related to prerequisites, setup, and configuration. Application Under Test We will be using the free-to-use RESTful e-commerce APIs that offer multiple APIs related to order management functionality, allowing us to create, retrieve, update, and delete orders. This application can be set up locally using Docker or NodeJS. What Is a DELETE Request? A DELETE API request deletes the specified resource from the server. Generally, there is no response body in the DELETE requests. The resource is specified by a URI, and the server permanently deletes it. DELETE requests are neither considered safe nor idempotent, as they may cause side effects on the server, like removing data from a database. The following are some of the limitations of DELETE requests: The data deleted using a DELETE request is not reversible, so it should be handled carefully.It is not considered to be a safe method as it can directly delete the resource from the database, causing conflicts in the system.It is not an idempotent method, meaning calling it multiple times for the same resource may result in different states. For example, in the first instance, when DELETE is called, it will return Status Code 204 stating that the resource has been deleted, and if DELETE is called again on the same resource, it may give a 404 NOT FOUND as the given resource is already deleted. The following is an example of the DELETE API endpoint from the RESTful e-commerce project. DELETE /deleteOrder/{id} : Deletes an Order By ID This API requires the order_id to be supplied as Path Parameter in order to delete respective order from the system. There is no request body required to be provided in this DELETE API request. However, as a security measure, the token is required to be provided as a header to delete the order. Once the API is executed, it deletes the specified order from the system and returns Status Code 204. In case where the order is not found, or the token is not valid or not provided, it will accordingly show the following response: Status CodeDescription400 Failed to authenticate the token404 No order with the given order_id is found in the system403Token is missing in the request How to Test DELETE APIs Using Playwright Java Testing DELETE APIs is an important step in ensuring the stability and reliability of the application. Correct implementation of the DELETE APIs is essential to check for unintended data loss and inconsistencies, as the DELETE APIs are in charge of removing the resources from the system. In this demonstration of testing DELETE APIs using Playwright Java, we'll be using the /deleteOrder/{id} for deleting an existing order from the system. Test Scenario 1: Delete a Valid Order Start the RESTful e-commerce service.Using a POST request, create some orders in the system.Delete the order with order_id “1” using DELETE request.Check that the Status Code 204 is returned in the response. Test Implementation The following steps are required to be performed to implement the test scenario: Add new orders using the POST request.Hit the /auth API to generate token.Hit the /deleteOrder/ API endpoint with the token and the order_id to delete the order.Check that the Status Code 204 is returned in the response. A new test method, testShouldDeleteTheOrder(), is created in the existing test class HappyPathTests. This test method implements the above three steps to test the DELETE API. Java @Test public void testShouldDeleteTheOrder() { final APIResponse authResponse = this.request.post("/auth", RequestOptions.create().setData(getCredentials())); final JSONObject authResponseObject = new JSONObject(authResponse.text()); final String token = authResponseObject.get("token").toString(); final int orderId = 1; final APIResponse response = this.request.delete("/deleteOrder/" + orderId, RequestOptions.create() .setHeader("Authorization", token)); assertEquals(response.status(), 204); } The POST /auth API endpoint will be hit first to generate the token. The token received in response is stored in the token variable to be used further in the DELETE API request. Next, new orders will be generated using the testShouldCreateNewOrders() method, which is already discussed in the previous tutorial, where we talked about testing POST requests using Playwright Java. After the orders are generated, the next step is to hit the DELETE request with the valid order_id that would delete the specific order. We'll be deleting the order with the order_id “1” using the delete() method provided by Playwright framework. After the order is deleted, the Status Code 204 is returned in response. An assertion will be performed on the Status Code to verify that the Delete action was successful. Since no request body is returned in the response, this is the only thing that can be verified. Test Execution We'll be creating a new testng.xml named testng-restfulecommerce-deleteorders.xml to execute the tests in the order of the steps that we discussed in the test implementation. XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Restful ECommerce Test Suite"> <test name="Testing Happy Path Scenarios of Creating and Updating Orders"> <classes> <class name="io.github.mfaisalkhatri.api.restfulecommerce.HappyPathTests"> <methods> <include name="testShouldCreateNewOrders"/> <include name="testShouldDeleteTheOrder"/> </methods> </class> </classes> </test> </suite> First, the testShouldCreateNewOrders() test method will be executed, and it will create new orders. Next, the testShouldDeleteTheOrder() test method order will be executed to test the delete order API. The following screenshot of the test execution performed using IntelliJ IDE shows that the tests were executed successfully. Now, let’s verify that the order was correctly deleted by writing a new test that will call the GET /getOrder API endpoint with the deleted order_id. Test Scenario 2: Retrieve the Deleted Order Delete a valid order with order_id “1.”Using GET /getOrder API, try retrieving the order with order_id “1.”Check that the Status Code 404 is returned with the message “No Order found with the given parameters!” in the response. Test Implementation Let’s create a new test method, testShouldNotRetrieveDeletedOrder(), in the existing class HappyPathTests. Java @Test public void testShouldNotRetrieveDeletedOrder() { final int orderId = 1; final APIResponse response = this.request.get("/getOrder", RequestOptions.create().setQueryParam("id", orderId)); assertEquals(response.status(), 404); final JSONObject jsonObject = new JSONObject(response.text()); assertEquals(jsonObject.get("message"), "No Order found with the given parameters!"); } The test implementation of this scenario is pretty simple. We will be executing the GET /getOrder API and to fetch the deleted order with order_id “1.” An assertion is applied next to verify that the GET API should return the Status Code 404 in the response with the message “No Order found with the given parameters!” This test ensures that the delete order API worked fine and the order was deleted from the system. Test Execution Let’s update the testng.xml file and add this test scenario at the end after the delete test. Java <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Restful ECommerce Test Suite"> <test name="Testing Happy Path Scenarios of Creating and Updating Orders"> <classes> <class name="io.github.mfaisalkhatri.api.restfulecommerce.HappyPathTests"> <methods> <include name="testShouldCreateNewOrders"/> <include name="testShouldDeleteTheOrder"/> <include name="testShouldNotRetrieveDeletedOrder"/> </methods> </class> </classes> </test> </suite> Now, all three tests should run in sequence. The first one will create orders; the second one will delete the order with order_id “1”; and the last test will hit the GET API to fetch the order with order_id “1” returning Status Code 404. The screenshot above shows that all three tests were executed successfully, and the DELETE API worked fine as expected. Summary DELETE API requests allow the deletion of the resource from the system. As delete is an important CRUD function, it is important to test it and verify that the system is working as expected. However, it should be noted that DELETE is an irreversible process, so it should always be used with caution. As per my experience, it is a good approach to hit the GET API after executing the DELETE request to check that the specified resource was deleted from the system successfully. Happy testing!
We’re all familiar with the principles of DevOps: building small, well-tested increments, deploying frequently, and automating pipelines to eliminate the need for manual steps. We monitor our applications closely, set up alerts, roll back problematic changes, and receive notifications when issues arise. However, when it comes to databases, we often lack the same level of control and visibility. Debugging performance issues can be challenging, and we might struggle to understand why databases slow down. Schema migrations and modifications can spiral out of control, leading to significant challenges. Overcoming these obstacles requires strategies that streamline schema migration and adaptation, enabling efficient database structure changes with minimal downtime or performance impact. It’s essential to test all changes cohesively throughout the pipeline. Let’s explore how this can be achieved. Automate Your Tests Databases are prone to many types of failures, yet they often don’t receive the same rigorous testing as applications. While developers typically test whether applications can read and write the correct data, they often overlook how this is achieved. Key aspects like ensuring the proper use of indexes, avoiding unnecessary lazy loading, or verifying query efficiency often go unchecked. For example, we focus on how many rows the database returns but neglect to analyze how many rows it had to read. Similarly, rollback procedures are rarely tested, leaving us vulnerable to potential data loss with every change. To address these gaps, we need comprehensive automated tests that detect issues proactively, minimizing the need for manual intervention. We often rely on load tests to identify performance issues, and while they can reveal whether our queries are fast enough for production, they come with significant drawbacks. First, load tests are expensive to build and maintain, requiring careful handling of GDPR compliance, data anonymization, and stateful applications. Moreover, they occur too late in the development pipeline. When load tests uncover issues, the changes are already implemented, reviewed, and merged, forcing us to go back to the drawing board and potentially start over. Finally, load tests are time-consuming, often requiring hours to fill caches and validate application reliability, making them less practical for catching issues early. Schema migrations often fall outside the scope of our tests. Typically, we only run test suites after migrations are completed, meaning we don’t evaluate how long they took, whether they triggered table rewrites, or whether they caused performance bottlenecks. These issues often go unnoticed during testing and only become apparent when deployed to production. Another challenge is that we test with databases that are too small to uncover performance problems early. This reliance on inadequate testing can lead to wasted time on load tests and leaves critical aspects, like schema migrations, entirely untested. This lack of coverage reduces our development velocity, introduces application-breaking issues, and hinders agility. The solution to these challenges lies in implementing database guardrails. Database guardrails evaluate queries, schema migrations, configurations, and database designs as we write code. Instead of relying on pipeline runs or lengthy load tests, these checks can be performed directly in the IDE or developer environment. By leveraging observability and projections of the production database, guardrails assess execution plans, statistics, and configurations, ensuring everything will function smoothly post-deployment. Build Observability Around Databases When we deploy to production, system dynamics can change over time. CPU load may spike, memory usage might grow, data volumes could expand, and data distribution patterns may shift. Identifying these issues quickly is essential, but it's not enough. Current monitoring tools overwhelm us with raw signals, leaving us to piece together the reasoning. For example, they might indicate an increase in CPU load but fail to explain why it happened. The burden of investigating and identifying root causes falls entirely on us. This approach is outdated and inefficient. To truly move fast, we need to shift from traditional monitoring to full observability. Instead of being inundated with raw data, we need actionable insights that help us understand the root cause of issues. Database guardrails offer this transformation. They connect the dots, showing how various factors interrelate, pinpointing the problem, and suggesting solutions. Instead of simply observing a spike in CPU usage, guardrails help us understand that a recent deployment altered a query, causing an index to be bypassed, which led to the increased CPU load. With this clarity, we can act decisively, fixing the query or index to resolve the issue. This shift from "seeing" to "understanding" is key to maintaining speed and reliability. The next evolution in database management is transitioning from automated issue investigation to automated resolution. Many problems can be fixed automatically with well-integrated systems. Observability tools can analyze performance and reliability issues and generate the necessary code or configuration changes to resolve them. These fixes can either be applied automatically or require explicit approval, ensuring that issues are addressed immediately with minimal effort on your part. Beyond fixing problems quickly, the ultimate goal is to prevent issues from occurring in the first place. Frequent rollbacks or failures hinder progress and agility. True agility is achieved not by rapidly resolving issues but by designing systems where issues rarely arise. While this vision may require incremental steps to reach, it represents the ultimate direction for innovation. Metis empowers you to overcome these challenges. It evaluates your changes before they’re even committed to the repository, analyzing queries, schema migrations, execution plans, performance, and correctness throughout your pipelines. Metis integrates seamlessly with CI/CD workflows, preventing flawed changes from reaching production. But it goes further — offering deep observability into your production database by analyzing metrics and tracking deployments, extensions, and configurations. It automatically fixes issues when possible and alerts you when manual intervention is required. With Metis, you can move faster and automate every aspect of your CI/CD pipeline, ensuring smoother and more reliable database management. Everyone Needs to Participate Database observability is about proactively preventing issues, advancing toward automated understanding and resolution, and incorporating database-specific checks throughout the development process. Relying on outdated tools and workflows is no longer sufficient; we need modern solutions that adapt to today’s complexities. Database guardrails provide this support. They help developers avoid creating inefficient code, analyze schemas and configurations, and validate every step of the software development lifecycle within our pipelines. Guardrails also transform raw monitoring data into actionable insights, explaining not just what went wrong but how to fix it. This capability is essential across all industries, as the complexity of systems will only continue to grow. To stay ahead, we must embrace innovative tools and processes that enable us to move faster and more efficiently.
Organizations adopting Infrastructure as Code (IaC) on AWS often struggle with ensuring that their infrastructure is not only correctly provisioned but also functioning as intended once deployed. Even minor misconfigurations can lead to costly downtime, security vulnerabilities, or performance issues. Traditional testing methods — such as manually inspecting resources or relying solely on static code analysis — do not provide sufficient confidence for production environments. There is a pressing need for an automated, reliable way to validate AWS infrastructure changes before they go live. Solution Terratest provides an automated testing framework written in Go, designed specifically to test infrastructure code in real-world cloud environments like AWS. By programmatically deploying, verifying, and destroying resources, Terratest bridges the gap between writing IaC (e.g., Terraform) and confidently shipping changes. Here’s how it works: Below is a detailed guide on how to achieve AWS infrastructure testing using Terratest with Terraform, along with sample code snippets in Go. This workflow will help you provision AWS resources, run tests against them to ensure they work as intended, and then tear everything down automatically. Prerequisites Install Terraform Download and install Terraform from the official site. Install Go Terratest is written in Go, so you’ll need Go installed. Download Go from the official site. Set Up AWS Credentials Ensure your AWS credentials are configured (e.g., via ~/.aws/credentials or environment variables like AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY). Initialize a Go Module In your project directory, run: Shell go mod init github.com/yourusername/yourproject go mod tidy Add Terratest to Your go.mod In your project/repo directory, run: Shell go get github.com/gruntwork-io/terratest/modules/terraform go get github.com/stretchr/testify/assert Sample Terraform Configuration Create a simple Terraform configuration that launches an AWS EC2 instance. Put the following files in a directory named aws_ec2_example (or any name you prefer). Save it as main.tf for reference. Shell terraform { required_providers { aws = { source = "hashicorp/aws" version = "~> 4.0" } } required_version = ">= 1.3.0" } provider "aws" { region = var.aws_region } resource "aws_instance" "example" { ami = var.ami_id instance_type = "t2.micro" tags = { Name = "Terratest-Example" } } output "instance_id" { value = aws_instance.example.id } Next, variables.tf: Shell variable "aws_region" { type = string default = "us-east-1" } variable "ami_id" { type = string default = "ami-0c55b159cbfafe1f0" # Example Amazon Linux AMI (update as needed) } Terratest Code Snippet Create a Go test file in a directory named test (or you can name it anything, but test is conventional). For example, aws_ec2_test.go: Shell package test import ( "testing" "github.com/gruntwork-io/terratest/modules/terraform" "github.com/stretchr/testify/assert" ) func TestAwsEC2Instance(t *testing.T) { // Define Terraform options to point to the Terraform folder terraformOptions := &terraform.Options{ TerraformDir: "../aws_ec2_example", // Optional: pass variables if you want to override defaults Vars: map[string]interface{}{ "aws_region": "us-east-1", "ami_id": "ami-0c55b159cbfafe1f0", }, } // At the end of the test, destroy the resources defer terraform.Destroy(t, terraformOptions) // Init and apply the Terraform configuration terraform.InitAndApply(t, terraformOptions) // Fetch the output variable instanceID := terraform.Output(t, terraformOptions, "instance_id") // Run a simple assertion to ensure the instance ID is not empty assert.NotEmpty(t, instanceID, "Instance ID should not be empty") } What This Test Does Initializes and applies the Terraform configuration in ../aws_ec2_example.Deploys an EC2 instance with the specified AMI in us-east-1.Captures the instance_id Terraform output.Verifies that the instance ID is not empty using Testify’s assert library.Destroys the resources at the end of the test to avoid incurring ongoing costs. Running the Tests Navigate to the directory containing your Go test file (e.g., test directory).Run the following command: Shell go test -v Observe the output: You’ll see Terraform initializing and applying your AWS infrastructure.After the test assertions pass, Terraform will destroy the resources. Conclusion By following these steps, you can integrate Terratest into your AWS IaC workflow to: Provision AWS resources using Terraform.Test them programmatically with Go-based tests.Validate that your infrastructure is configured properly and functioning as expected.Tear down automatically, ensuring that you’re not incurring unnecessary AWS costs and maintaining a clean environment for repeated test runs.
The primary inspiration for this article was my feeling that unfortunately, IT and Cyber too often work in silos — with security constraints often poorly addressed or insufficiently shared. It was also inspired by meetings with people working in Cyber, who may recognize each other. Indeed, on the IT side, API security is often perceived as a subject that is covered as long as authentication and rights are properly managed and an API Gateway is used. Of course, this is necessary. But thinking about API security, in terms of what it involves, means thinking about a large part of the security of your IT. As I do not come from the world of Cyber, the only aim of this article is to try and bring these two worlds together by addressing all the aspects that API security can cover. And of course, this article is an invitation to you to get closer to your Cyber teams! — and to provide you with as concise a shopping list as possible for exchanges between IT and Cyber teams. . . but long enough :). Hence, a very concise format was chosen for this article. To do this, we will first explain the risks that I have identified, then look at securing APIs across their entire value chain, from DevSecOps to API WAAP (WAAP stands for Web Application and API Protection — a kind of Web Application Firewall for APIs). We then offer a panorama of technologies and finish with some recommendations. On that note, let's get going! Why Is API Security Crucial? The data exposed is very often sensitive: APIs often return confidential data, making it essential to protect it. It is a preferred attack vector: As a single point of entry for data, APIs are prime points of attack. Their complexity is increasing: The evolution of architectures (microservices, cloud, service mesh, etc.) can increase the potential attack surface.APIs must comply with the regulatory framework: RGPD, PCI DSS, PSD2, etc. are all regulations that require APIs to be securely exposed. Does that only happen to other people? Well, it doesn't. 2019, Facebook: Data leak involving 540 million users due to unsecured servers accessible via APIs2018, Twitter: Poor management of access authorizations made some users' private messages available. Now that we've covered the issues, let's look at the risks and solutions. The Major Risks Associated With API Security Common API Vulnerabilities Code Injection Code injection is one of the best-known threats, along with: SQL injection, for example,But also by command, with the not-so-distant example of the Log4J flaw. Inadequate Authentication and Authorization It is vital to have a properly applied authentication and authorization policy in order to block attackers as effectively as possible. The main principles are as follows: Sessions must be well managed: Unexpired or incorrectly revoked sessionsAccess tokens must be properly secured: Insecure storage or transmission of tokensAccess controls must be properly configured: Incorrectly configured permissions allowing unauthorized access Exposure of Sensitive Data APIs can inadvertently expose sensitive and useless data if they are not properly defined, configured, or secured. Typical cases are: API responses are too verbose: Inclusion of unnecessary data in responsesAPI responses are not encrypted: Unencrypted data transmissionErrors are poorly managed: Error messages revealing sensitive information about the infrastructure Emerging and Sophisticated Threats Brute Force Attacks and Credential Stuffing This is a well-known strategy involving testing combinations of usernames and passwords. They are as easy to defend against as they are particularly dangerous because: They can be automated on a large scale.They can also exploit identifying information from data leaks (avoid having a single password). Man-in-the-Middle (MITM) Attacks An MITM attack consists of the attacker placing himself between the client and the API Gateway to intercept or modify exchanges. Risks include: Theft of sensitive data: By intercepting unencrypted dataManipulating queries: By altering data exchanged between the client and the serverIdentity theft: By recovering server certificates to pass itself off as the legitimate server DDoS Attacks These attacks consist of making a very large number of calls in order to make the API unavailable. They can take several forms: Volumetric attacks: By saturating the bandwidthAttacks at application level: Using API vulnerabilities to drain server resources Slow attacks: This involves keeping connections open to drain server resources. Risks Specific to Modern Architecture Microservices and Containerization Containerization and microservices add new security challenges: The exponential complexity of access management: Security requirements have to be managed by microservice.Container orchestration risks: Orchestration tools can also have their own vulnerabilities. Increased exposure of internal APIs: Internal APIs must absolutely not be exposed externally! API in the Cloud Deploying APIs in cloud environments presents specific risks: Incorrect configuration of cloud services: Inadvertent exposure of APIs or dataComplex identity and access management: You need to integrate the security mechanisms of your cloud provider with those of your API.Dependence on the cloud provider's security: You need to understand and complete the security measures, depending on your cloud provider's policy. Shadow APIs and Zombie APIs Shadow APIs (undocumented or unmanaged) and zombie APIs (obsolete but still active) represent significant risks: Lack of visibility: This makes it difficult to identify and secure these APIs.Unpatched vulnerabilities: Obsolete APIs may contain known security vulnerabilities. Uncontrolled access: There is a risk of attackers exploiting sensitive systems or data. Strategies and Solutions for Effectively Securing API Global Approach to API Security Securing the API via DevSecOps The DevSecOps approach makes it possible to secure an API upstream of its deployment, via: Shift-left security: Incorporating security tests from the beginningAutomated security testing: Using static (SAST) and dynamic (DAST) code analysis tools Ongoing management of vulnerabilities: Code, libraries, dependencies, etc. — all these elements can fail or contain flaws discovered too late. They must therefore be detected and corrected. API Governance and Security Policies What would we be without governance? This is obviously an essential point, and we will be particularly vigilant on the following aspects: The definition of security standards: Via best practice documents for the development and deployment of secure APIsIdentity and Access Management (IAM): To cover the definition of strict policies for authentication and authorizationRegular audits: To continually assess API compliance with security policies Team Training and Awareness API security relies to a large extent on the skills and vigilance of all teams, whether they be DevOps, Cyber, or Dev: Training programs: Dedicated sessions on API security practicesPractical exercises: Via simulated attacks and incident responsesA culture of security: By encouraging the reporting and resolution of security problems API Security Technologies and Tools API Gateways and Web Application and API Protection (WAAP) API Gateways (and their service mesh and micro-gateway cousins) and WAAPs (WAFs for APIs, if you prefer) represent the first line of defense: By filtering traffic: By blocking malicious requestsManaging authentication: Centralizing and strengthening authentication mechanismsRate limiting: Protecting against DDoS attacksAnalyzing traffic: By detecting and blocking suspicious behavior API Management and Protection Solutions There are other specialized tools with advanced API security features: Automatic API discovery solutions: To detect the notorious Shadow APIsBehavioral analysis solutions: To detect anomalies and suspicious behaviorVersion management solutions: To control and secure different API versionsRegulatory compliance solutions: To demonstrate compliance with safety regulations API Security Analysis Tools There are also dedicated tools for identifying specific API vulnerabilities: API-specific vulnerability scanners: As the name suggestsAPI fuzzing solutions: Fuzzing is a testing technique that sends random and/or malformed data to identify vulnerabilitiesStatic and dynamic analysis tools: SAST and DAST are available for APIs. Best Practices for Securing APIs Robust Authentication and Authorization Using standard protocols: OAuth 2.0, OpenID ConnectVia fine-tuned management of authorizations: Via implementation of the principle of least privilege, using scope APIsVia regular rotation of keys and tokens: To limit the impact in the event of a compromise Centralization and Breakdown of Gateway APIs A Gateway API should ideally be placed centrally in the architecture so as not to multiply the entry points. However, you can have two Gateway APIs — one "public" and one "private" — to mitigate the risks as much as possible: Single point of entry: Centralization of API traffic for greater visibility and controlVersion management: Easier management of different API versionsTransformation and mediation: Adapting requests and responses to ensure compatibility and security Encryption and Data Protection Encryption in transit: Systematically using TLS (in a non-deprecated version); possibly also signing the data exchanged, as Stripe does with its API, to guarantee authenticity, integrity, and non-repudiationEncryption of sensitive data: Whether at rest or in transitSecure key management: By managing encryption keys throughout their lifecycle; this includes: Using strong, randomly generated keysRegular rotation of keys to limit the impact of potential breachesStoring keys securely in a "Vault," separate from the data they protectImplement access controlsManage digital identities via certificates using a PKI, coupled with a an HSM to secure cryptographic keys in a hardware environment Log Management and Auditing Logging: By recording all security-related eventsRetention of logs: By keeping logs for a sufficiently long period to enable them to be analyzed in the pastRegular log analysis: By setting up regular log analysis processesProtecting logs: By preventing any unauthorized modification to guarantee their integrity in the event of an audit Real-Time Monitoring Behavioral analysis: To detect anomalies in API trafficReal-time alerts: To react quickly to security incidentsContinuous monitoring: To constantly monitor API availability and quickly detect attacks Penetration Tests and Security Validation Regular testingRealistic scenarios based on real casesContinuous validation via the CI/CD chain Conclusion As we can see, API security requires the skills of various teams, but also a commitment from everyone. IT solutions exist, but they are nothing without a security policy shared by all and for all. And also, and above all, the establishment of best practices defined internally, as we have shared in this article. DevOps, Dev, Cyber — it's your turn!
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Principal Sales Engineer,
Snowflake
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert
Denis Magda
Head of DevRel,
Yugabyte