Is your software supply chain secure? Calling all security savants to share your experiences, tips, and insights with our dev community!
Data quality isn't just a technical issue: It impacts an organization's compliance, operational efficiency, and customer satisfaction.
Scalability 101: How to Build, Measure, and Improve It
Java's Quiet Revolution: Thriving in the Serverless Kubernetes Era
Generative AI
AI technology is now more accessible, more intelligent, and easier to use than ever before. Generative AI, in particular, has transformed nearly every industry exponentially, creating a lasting impact driven by its (delivered) promises of cost savings, manual task reduction, and a slew of other benefits that improve overall productivity and efficiency. The applications of GenAI are expansive, and thanks to the democratization of large language models, AI is reaching every industry worldwide.Our focus for DZone's 2025 Generative AI Trend Report is on the trends surrounding GenAI models, algorithms, and implementation, paying special attention to GenAI's impacts on code generation and software development as a whole. Featured in this report are key findings from our research and thought-provoking content written by everyday practitioners from the DZone Community, with topics including organizations' AI adoption maturity, the role of LLMs, AI-driven intelligent applications, agentic AI, and much more.We hope this report serves as a guide to help readers assess their own organization's AI capabilities and how they can better leverage those in 2025 and beyond.
Machine Learning Patterns and Anti-Patterns
Getting Started With Data Quality
Hey, DZone Community! We have an exciting year of research ahead for our beloved Trend Reports. And once again, we are asking for your insights and expertise (anonymously if you choose) — readers just like you drive the content we cover in our Trend Reports. Check out the details for our research survey below. Software Supply Chain Security Research Supply chains aren't just for physical products anymore; they're a critical part of how software is built and delivered. At DZone, we're taking a closer look at the state of software supply chain security to understand how development teams are navigating emerging risks through smarter tooling, stronger practices, and the strategic use of AI. Take our short research survey (~10 minutes) to contribute to our upcoming Trend Report. We're exploring key topics such as: SBOM adoption and real-world usageThe role of AI and ML in threat detectionImplementation of zero trust security modelsCloud and open-source security posturesModern approaches to incident response Join the Security Research We’ve also created some painfully relatable memes about the state of software supply chain security. If you’ve ever muttered “this is fine” while scanning dependencies, these are for you! Over the coming month, we will compile and analyze data from hundreds of respondents; results and observations will be featured in the "Key Research Findings" of our Trend Reports. Your responses help inform the narrative of our Trend Reports, so we truly cannot do this without you. Stay tuned for each report's launch and see how your insights align with the larger DZone Community. We thank you in advance for your help! —The DZone Content and Community team
Index maintenance is a critical component of database administration as it helps ensure the ongoing efficiency and performance of a Structured Query Language (SQL) Server environment. Over time, as data is added, updated, and deleted, index fragmentation can occur, where the logical and physical ordering of index pages becomes misaligned. This fragmentation can lead to increased disk I/O, decreased query performance, and overall system inefficiency. Running index maintenance jobs, such as those provided by the Ola Hallengren SQL Server Maintenance Solution, allows DBAs to proactively address this fragmentation and optimize the indexes for better performance. By regularly monitoring index fragmentation levels and executing maintenance operations like index reorganizations and rebuilds, DBAs can keep their databases running at peak efficiency. This is especially important for large, mission-critical databases, where any degradation in performance can have a significant business impact. Maintaining optimal index health helps ensure fast, reliable data access, reduced resource consumption, and an overall improvement in the user experience. Consequently, implementing a well-designed index maintenance strategy is a crucial responsibility for any DBA managing a complex SQL Server environment. Ola Hallengren's SQL Server Maintenance Solution The SQL Server Maintenance Solution, developed by Ola Hallengren, is a widely adopted and trusted set of scripts used by database administrators worldwide. This comprehensive solution automates various maintenance tasks, including index optimization, database integrity checks, and statistics updates. Ola's scripts have become the industry standard for proactive database maintenance. The IndexOptimize procedure from the Maintenance Solution provides extensive customization and configuration options to tailor the index maintenance process for specific environments and requirements. Many database administrators rely on these scripts as the foundation for their index management strategy, as they offer a robust and efficient way to keep indexes in an optimal state. You can download the latest SQL Server Maintenance Solution version from Ola Hallengren's website. The scripts are released under the MIT License, allowing users to freely use, modify, and distribute them as needed. Core IndexOptimize Parameters and Their Impact The `IndexOptimize` stored procedure provides extensive customization through numerous parameters. Understanding these is critical for effective implementation: Essential Parameters ParameterDescriptionImpact`@Databases`Target databasesControls scope of operation`@FragmentationLow`Action for low fragmentationTypically NULL (no action)`@FragmentationMedium`Action for medium fragmentationUsually REORGANIZE`@FragmentationHigh`Action for high fragmentationREBUILD or REORGANIZE`@FragmentationLevel1`Low/medium threshold (%)Typically 5-15%`@FragmentationLevel2`Medium/high threshold (%)Typically 30-40%`@PageCountLevel`Minimum index size to processExcludes small indexes`@SortInTempdb`Use tempdb for sortingReduces production database I/O`@MaxDOP`Degree of parallelismControls CPU utilization`@FillFactor`Index fill factorControls free space in pages`@PadIndex`Apply fill factor to non-leaf levelsAffects overall index size`@LOBCompaction`Compact LOB dataReduces storage for LOB columns`@UpdateStatistics`Update statistics after rebuild'ALL', 'COLUMNS', 'INDEX', NULL`@OnlyModifiedStatistics`Only update changed statisticsReduces unnecessary updates`@TimeLimit`Maximum execution time (seconds)Prevents runaway jobs`@Delay`Pause between operations (seconds)Reduces continuous resource pressure`@Indexes`Specific indexes to maintainAllows targeted maintenance`@MinNumberOfPages`Minimum size thresholdAlternative to PageCountLevel`@MaxNumberOfPages`Maximum size thresholdLimits operation to smaller indexes`@LockTimeout`Lock timeout (seconds)Prevents blocking`@LogToTable`Log operations to tableEnables tracking/troubleshootingParameterDescriptionRecommended Setting`@AvailabilityGroups`Target specific AGsLimit scope when needed`@AvailabilityGroupReplicas`Target specific replicas'PRIMARY' to limit AG impact`@AvailabilityGroupDatabases`Target specific databasesFocus on critical databases Availability Group-Specific Parameters ParameterDescriptionRecommended Setting`@AvailabilityGroups`Target specific AGsLimit scope when needed`@AvailabilityGroupReplicas`Target specific replicas'PRIMARY' to limit AG impact`@AvailabilityGroupDatabases`Target specific databasesFocus on critical databases Implementation Strategies by Index Size Large Indexes (>10GB) EXECUTE dbo.IndexOptimize @Databases = 'PRODUCTION_DB', @FragmentationLow = NULL, @FragmentationMedium = 'INDEX_REORGANIZE', @FragmentationHigh = 'INDEX_REORGANIZE,INDEX_REBUILD_ONLINE', @FragmentationLevel1 = 15, @FragmentationLevel2 = 40, @PageCountLevel = 10000, -- Only process substantial indexes @MaxDOP = 4, -- Limit CPU utilization @TimeLimit = 7200, -- 2-hour limit per operation @Delay = '00:00:45', -- 45-second pause between operations @SortInTempdb = 'Y', -- Reduce database file I/O @MaxNumberOfPages = NULL, -- No upper limit @MinNumberOfPages = 10000, @LockTimeout = 300, -- 5-minute lock timeout @LogToTable = 'Y', @Execute = 'Y'; Special considerations: Prefer REORGANIZE for large indexes to minimize transaction log growthUse REBUILD selectively when reorganize is insufficientImplement larger `@Delay`to allow transaction log processingSchedule during low-activity periodsConsider smaller batches using `@Indexes` parameter Medium Indexes (1GB-10GB) EXECUTE dbo.IndexOptimize @Databases = 'PRODUCTION_DB', @FragmentationLow = NULL, @FragmentationMedium = 'INDEX_REORGANIZE', @FragmentationHigh = 'INDEX_REBUILD_ONLINE', @FragmentationLevel1 = 10, @FragmentationLevel2 = 30, @PageCountLevel = 1000, @MaxDOP = 2, @TimeLimit = 3600, -- 1-hour limit @Delay = '00:00:20', -- 20-second pause @SortInTempdb = 'Y', @MinNumberOfPages = 1000, @MaxNumberOfPages = 10000, @LockTimeout = 180, -- 3-minute lock timeout @LogToTable = 'Y', @Execute = 'Y'; Special considerations: Balance between REORGANIZE and REBUILD operationsModerate `@Delay` value to manage resource impactCan run more frequently than large index maintenance Small Indexes (<1GB) EXECUTE dbo.IndexOptimize @Databases = 'PRODUCTION_DB', @FragmentationLow = NULL, @FragmentationMedium = 'INDEX_REORGANIZE', @FragmentationHigh = 'INDEX_REBUILD_ONLINE', @FragmentationLevel1 = 5, @FragmentationLevel2 = 30, @PageCountLevel = 100, @MaxDOP = 0, -- Use server default @TimeLimit = 1800, -- 30-minute limit @Delay = '00:00:05', -- 5-second pause @SortInTempdb = 'Y', @MaxNumberOfPages = 1000, @MinNumberOfPages = 100, @LockTimeout = 60, -- 1-minute lock timeout @LogToTable = 'Y', @Execute = 'Y'; Special considerations: Can be more aggressive with rebuild operations.Minimal `@Delay` needed between operations.Can run during regular business hours with minimal impact. Availability Group-Specific Configurations Environment: Large, mission-critical OLTP database with multiple replicas in an Availability Group (AG) configured for synchronous commit. Maintenance Objectives: Minimize impact on production workload and log shipping.Avoid exhausting storage resources due to log growth.Ensure high availability and minimal downtime. Synchronous AG Environment EXECUTE dbo.IndexOptimize @Databases = 'PRODUCTION_DB', @FragmentationLow = NULL, @FragmentationMedium = 'INDEX_REORGANIZE', @FragmentationHigh = 'INDEX_REORGANIZE', -- Avoid rebuilds in sync AGs @FragmentationLevel1 = 15, @FragmentationLevel2 = 40, @PageCountLevel = 5000, @MaxDOP = 2, @TimeLimit = 3600, @Delay = '00:01:00', -- Longer delay for sync replicas @AvailabilityGroupReplicas = 'PRIMARY', @LockTimeout = 300, @LogToTable = 'Y', @Execute = 'Y'; Synchronous AG considerations: Minimize rebuilds - Transaction logs must be synchronized before the operation completes.Implement longer delays between operations to allow synchronization.Monitor replica lag and suspend jobs if lag exceeds thresholds.Increase log backup frequency during maintenance windows.Split maintenance across multiple days for very large environments. Asynchronous AG Environment Environment: Large, multi-terabyte data warehouse database with asynchronous AG replicas. Maintenance Objectives: Perform comprehensive index and statistics maintenanceMinimize the impact on the reporting workload during the maintenance windowEnsure optimal performance for the upcoming quarter EXECUTE dbo.IndexOptimize @Databases = 'PRODUCTION_DB', @FragmentationLow = NULL, @FragmentationMedium = 'INDEX_REORGANIZE', @FragmentationHigh = 'INDEX_REBUILD_ONLINE', -- Rebuilds more acceptable @FragmentationLevel1 = 10, @FragmentationLevel2 = 30, @PageCountLevel = 2000, @MaxDOP = 4, @TimeLimit = 5400, @Delay = '00:00:30', -- Moderate delay @AvailabilityGroupReplicas = 'PRIMARY', @LockTimeout = 240, @LogToTable = 'Y', @Execute = 'Y'; Asynchronous AG considerations: More liberal with rebuilds - operations don't wait for secondary synchronization.Still monitor send queue to prevent overwhelming secondaries.Consider network bandwidth and adjust `@Delay` accordingly.Implement send queue size alerts during maintenance. Preventing Storage and IOPS Pressure Pre-Maintenance Preparation Expand transaction log files proactively: ALTER DATABASE [YourDatabase] MODIFY FILE (NAME = LogFileName, SIZE = ExpandedSizeInMB); Configure TempDB properly: -- Verify TempDB configuration SELECT name, size/128.0 AS [Size_MB] FROM tempdb.sys.database_files; Implement pre-maintenance checks: -- Create helper procedure to validate environment readiness CREATE PROCEDURE dbo.ValidateMaintenanceReadiness AS BEGIN DECLARE @IssuesFound BIT = 0; -- Check log space IF EXISTS ( SELECT 1 FROM sys.databases d CROSS APPLY sys.dm_db_log_space_usage() l WHERE d.database_id = DB_ID() AND l.log_space_used_percent > 30 ) BEGIN RAISERROR('Log usage exceeds 30%. Backup logs before proceeding.', 16, 1); SET @IssuesFound = 1; END -- Check AG health IF EXISTS ( SELECT 1 FROM sys.dm_hadr_availability_replica_states ars JOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id WHERE ars.is_local = 0 AND ars.synchronization_health <> 2 -- Not HEALTHY ) BEGIN RAISERROR('Availability Group replicas not in healthy state.', 16, 1); SET @IssuesFound = 1; END RETURN @IssuesFound; END; GO Operational Techniques Implement dynamic index selection based on business impact: -- Create index priority categories CREATE TABLE dbo.IndexMaintenancePriority ( SchemaName NVARCHAR(128), TableName NVARCHAR(128), IndexName NVARCHAR(128), Priority INT, -- 1=High, 2=Medium, 3=Low MaintenanceDay TINYINT -- Day of week (1-7) ); -- Use with dynamic execution DECLARE @IndexList NVARCHAR(MAX); SELECT @IndexList = STRING_AGG(CONCAT(DB_NAME(), '.', SchemaName, '.', TableName, '.', IndexName), ',') FROM dbo.IndexMaintenancePriority WHERE Priority = 1 AND MaintenanceDay = DATEPART(WEEKDAY, GETDATE()); EXEC dbo.IndexOptimize @Databases = 'PRODUCTION_DB', @Indexes = @IndexList, -- other parameters Implement I/O throttling techniques: Use Resource Governor to limit I/O (SQL Server Enterprise).Set lower `@MaxDOP` values during business hours.Implement longer `@Delay` values during peak periods. Database-level I/O tuning: -- Consider trace flag 1117 for uniform file growth DBCC TRACEON(1117, -1); -- Consider trace flag 1118 for reducing SGAM contention DBCC TRACEON(1118, -1); -- For SQL Server 2016+, use proper tempdb configuration ALTER DATABASE [tempdb] MODIFY FILE (NAME = 'tempdev', SIZE = 8GB); Advanced Scheduling Strategies Workload-Aware Batching -- Create helper procedure for smart batching CREATE PROCEDURE dbo.ExecuteIndexMaintenanceBatch @BatchSize INT = 5, @MaxRuntime INT = 7200 -- 2 hours in seconds AS BEGIN DECLARE @StartTime DATETIME = GETDATE(); DECLARE @EndTime DATETIME = DATEADD(SECOND, @MaxRuntime, @StartTime); DECLARE @CurrentTime DATETIME; DECLARE @IndexBatch NVARCHAR(MAX); WHILE (1=1) BEGIN SET @CurrentTime = GETDATE(); IF @CurrentTime > @EndTime BREAK; -- Get next batch of indexes based on priority and fragmentation SELECT TOP (@BatchSize) @IndexBatch = STRING_AGG(CONCAT(DB_NAME(), '.', s.name, '.', t.name, '.', i.name), ',') FROM sys.indexes i JOIN sys.tables t ON i.object_id = t.object_id JOIN sys.schemas s ON t.schema_id = s.schema_id JOIN sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') ps ON ps.object_id = i.object_id AND ps.index_id = i.index_id WHERE i.type_desc = 'NONCLUSTERED' AND ps.avg_fragmentation_in_percent > 30 AND ps.page_count > 1000 AND NOT EXISTS ( -- Skip indexes we've already processed SELECT 1 FROM dbo.CommandLog WHERE DatabaseName = DB_NAME() AND SchemaName = s.name AND ObjectName = t.name AND IndexName = i.name AND StartTime > DATEADD(DAY, -7, GETDATE()) ) ORDER BY ps.avg_fragmentation_in_percent DESC; IF @IndexBatch IS NULL BREAK; -- No more work to do -- Execute maintenance for this batch EXEC dbo.IndexOptimize @Databases = DB_NAME(), @Indexes = @IndexBatch, @FragmentationLow = NULL, @FragmentationMedium = 'INDEX_REORGANIZE', @FragmentationHigh = 'INDEX_REORGANIZE', @FragmentationLevel1 = 10, @FragmentationLevel2 = 30, @MaxDOP = 2, @TimeLimit = 1800, -- 30 minutes per batch @Delay = '00:00:30', @LogToTable = 'Y', @Execute = 'Y'; -- Pause between batches WAITFOR DELAY '00:01:00'; END END; GO Monitoring Framework -- Create monitoring stored procedure CREATE PROCEDURE dbo.MonitorIndexMaintenance AS BEGIN -- Check transaction log usage SELECT DB_NAME(database_id) AS DatabaseName, log_space_in_use_percentage FROM sys.dm_db_log_space_usage WHERE log_space_in_use_percentage > 50; -- Check AG send queue size SELECT ar.replica_server_name, drs.database_name, drs.log_send_queue_size, drs.log_send_rate, drs.redo_queue_size, drs.redo_rate FROM sys.dm_hadr_database_replica_states drs JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id WHERE drs.log_send_queue_size > 10000 OR drs.redo_queue_size > 10000; -- Check ongoing index operations SELECT r.session_id, r.command, r.status, r.wait_type, r.wait_time, OBJECT_NAME(p.object_id) AS ObjectName, p.index_id, i.name AS IndexName FROM sys.dm_exec_requests r CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t LEFT JOIN sys.partitions p ON p.hobt_id = r.statement_id LEFT JOIN sys.indexes i ON i.object_id = p.object_id AND i.index_id = p.index_id WHERE t.text LIKE '%INDEX_REBUILD%' OR t.text LIKE '%INDEX_REORGANIZE%'; END; GO Best Practices Summary For synchronous AG environments: Prioritize REORGANIZE over REBUILD, especially for large indexes.Implement longer delays between operations (45-90 seconds). Schedule maintenance during the least active periods.Consider partitioning very large tables for incremental maintenance. For asynchronous AG environments: More liberal use of REBUILD for critical indexes. Implement moderate delays (15-45 seconds). Monitor send queue and redo queue sizes closely. General IOPS reduction techniques: Leverage `@SortInTempdb = 'Y'` to spread I/O load. Use `@MaxDOP` to control parallelism (lower values reduce I/O).Implement `@Delay` parameters appropriate to your environment. Use `@TimeLimit` to prevent runaway operations. Storage pressure mitigation: Pre-allocate transaction log space before maintenance.Increase log backup frequency during maintenance (every 5-15 minutes). Use Resource Governor to limit I/O impact. Implement batched approaches with appropriate pauses. Comprehensive maintenance approach: Different strategies for different index sizes. Business-hour vs. off-hour configurations. Prioritization based on business impact.Regular verification of fragmentation levels post-maintenance. By implementing these guidelines and adapting the provided scripts to your specific environment, you can maintain optimal SQL Server index performance while minimizing production impact, even in complex Availability Group configurations.
The Go programming language is a great fit for building serverless applications. Go applications can be easily compiled to a single, statically linked binary, making deployment simple and reducing external dependencies. They start up quickly, which is ideal for serverless environments where functions are frequently invoked from a cold start. Go applications also tend to use less memory compared to other languages, helping optimize resource usage and reduce costs in serverless scenarios. Azure Functions supports Go using custom handlers, and you can use triggers and input and output bindings via extension bundles. Azure Functions is tightly integrated with Azure Cosmos DB using bindings (input, output) and triggers. This blog post will walk you through how to build Azure Functions with Go that make use of these Azure Cosmos DB integrations. Bindings allow you to easily read and write data to Cosmos DB, while triggers are useful for building event-driven applications that respond to changes in your data in Cosmos DB. Part 1 of this blog starts off with a function that gets triggered by changes in a Cosmos DB container and simply logs the raw Azure Functions event payload and the Cosmos DB document. You will learn how to run the function and also test it with Cosmos DB locally, thanks to the Cosmos DB emulator and Azure Functions Core Tools. If this is your first time working with Go and Azure Functions, you should find it helpful to get up and running quickly. Although you can deploy it to Azure, we will save that for the next part of this blog. Part 2 dives into another function that generates embeddings for the documents in the Cosmos DB container. This example will use an Azure OpenAI embedding model to generate embeddings for the documents in the container and then store the embeddings back in the container. This is useful for building applications that require semantic search or other generative AI applications. Check out the GitHub repository for the complete code. Part 1: Build a Simple Cosmos DB Trigger-Based Function and Run It Locally Just as the Cosmos DB emulator lets you run Cosmos DB locally, Azure Functions Core Tools lets you develop and test your functions locally. Start by installing the Azure Functions Core Tools; refer to the documentation for instructions specific to your OS. For example, on Linux, you can: Shell sudo apt-get update sudo apt-get install azure-functions-core-tools-4 Next, start the Cosmos DB emulator. The commands below are for Linux and use the Docker container-based approach. Refer to the documentation for other options. You need to have Docker installed and running on your machine. If you don't have it installed, please refer to the Docker installation guide. YAML docker pull mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest docker run \ --publish 8081:8081 \ --name linux-emulator \ -e AZURE_COSMOS_EMULATOR_PARTITION_COUNT=1 \ mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest Make sure to configure the emulator SSL certificate as well. For example, for the Linux system I was using, I ran the following command to download the certificate and regenerate the certificate bundle: Shell curl --insecure https://localhost:8081/_explorer/emulator.pem > ~/emulatorcert.crt sudo update-ca-certificates Use the following URL to navigate to the Cosmos DB Data Explorer using your browser: http://localhost:8081/_explorer/index.html. Create the following resources: A databaseA container with a partition key /id – this is the source containerA lease container with the name leases and partition key /id – it is used by the trigger to keep track of the changes in the source container. Clone the GitHub repository with the code for the function: Shell git clone https://github.com/abhirockzz/golang_cosmosdb_azure_functions.git cd golang_cosmosdb_azure_functions/getting_started_guide Create a local.settings.json file with the Cosmos DB-related info. Use the same database and container names as you created in the previous step. The local.settings.json file is used to store the configuration settings for your function app when running locally: JSON { "IsEncrypted": false, "Values": { "AzureWebJobsStorage": "", "FUNCTIONS_WORKER_RUNTIME": "custom", "COSMOS_CONNECTION": "AccountEndpoint=https://localhost:8081/;AccountKey=C2y6yDjf5/R+ob0N8A7Cgv30VRDJIWEHLM+4QDU5DE2nQ9nDuVTqobD4b8mGGyPMbIZnqyMsEcaGQy67XIw/Jw==;", "COSMOS_DATABASE_NAME": "test", "COSMOS_CONTAINER_NAME": "tasks" } } COSMOS_CONNECTION has a static value for the connection string for the Cosmos DB emulator — do not change it. Build the Go function binary using the following command. This will create a binary file named main in the current directory: Go go build -o main main.go Start the function locally: Go func start This will start the function app and listen for incoming requests. You should see output similar to this: Plain Text [2025-04-25T07:44:53.921Z] Worker process started and initialized. Functions: processor: cosmosDBTrigger For detailed output, run func with --verbose flag. [2025-04-25T07:44:58.809Z] Host lock lease acquired by instance ID '0000000000000000000000006ADD8D3E'. //... Add data to the source container in Cosmos DB. You can do this by navigating to Data Explorer in the emulator. For example, add a document with the following JSON: JSON { "id": "42", "description": "test" } The function should be triggered automatically when the document is added to the container. You can check the logs of the function app to see if it was triggered successfully: Plain Text [2025-04-25T07:48:10.559Z] Executing 'Functions.processor' (Reason='New changes on container tasks at 2025-04-25T07:48:10.5593689Z', Id=7b62f8cf-683b-4a5b-9db0-83d049bc4c86) [2025-04-25T07:48:10.565Z] processor function invoked... [2025-04-25T07:48:10.565Z] Raw event payload: {{"[{\"id\":\"42\",\"description\":\"test\",\"_rid\":\"AxI2AL1rrFoDAAAAAAAAAA==\",\"_self\":\"dbs/AxI2AA==/colls/AxI2AL1rrFo=/docs/AxI2AL1rrFoDAAAAAAAAAA==/\",\"_etag\":\"\\\"00000000-0000-0000-b5b6-6123f4d401db\\\"\",\"_attachments\":\"attachments/\",\"_ts\":1745567285,\"_lsn\":4}]"} {{processor 2025-04-25T07:48:10.560243Z 4f29b3f3-ba95-4043-9b67-2856a43b4734}} [2025-04-25T07:48:10.566Z] Cosmos DB document: {42 AxI2AL1rrFoDAAAAAAAAAA== dbs/AxI2AA==/colls/AxI2AL1rrFo=/docs/AxI2AL1rrFoDAAAAAAAAAA==/ "00000000-0000-0000-b5b6-6123f4d401db" attachments/ 1745567285 4} [2025-04-25T07:48:10.566Z] Executed 'Functions.processor' (Succeeded, Id=7b62f8cf-683b-4a5b-9db0-83d049bc4c86, Duration=6ms) //..... How It Works Here is a very high-level overview of the code: main.go: Implements an HTTP server with a processor endpoint. When triggered, it reads a Cosmos DB trigger payload from the request, parses the nested documents, logs information, and returns a structured JSON response. It uses types and helpers from the commonpackage.common package: Contains shared types and utilities for Cosmos DB trigger processing: payload.go: Defines data structures for the trigger payload, documents, and response.parse.go: Provides a Parse function to extract and unmarshal documents from the trigger payload’s nested JSON structure. Part 2: Use Azure OpenAI to Generate Embeddings for the Documents in the Cosmos DB Container In addition to its low-latency, high-performance, and scalability characteristics, its support for vector (semantic/similarity), full-text, and hybrid search makes Azure Cosmos DB a great fit for generative AI applications. Consider a use case for managing a product catalog for an e-commerce platform. Each time a new product is added to the system (with a short description like “Bluetooth headphones with noise cancellation”), we want to immediately make that item searchable semantically. As soon as the product document is written to Cosmos DB, an Azure Function is triggered. It extracts the product description, generates a vector embedding using Azure OpenAI, and writes the embedding back to the same document using an output binding. With the embedding in place, the product is now indexed and ready for semantic and hybrid search queries, without any additional effort. Prerequisites You will run this example in Azure, so you need to have an Azure account. If you don't have one, you can create a free account. Create an Azure Cosmos DB for NoSQL account. Enable the vector indexing and search feature – this is a one-time operation. Just like before, you will need to create the following resources: A databaseA container with partition key /id – this is the source containerA lease container with the name leases and partition key /id – it is used by the trigger to keep track of the changes in the source container. The lease container needs to be created in advance since we have configured Azure Functions to use managed identity to access the Cosmos DB account – you don't need to use keys or connection strings. Create an Azure OpenAI Service resource. Azure OpenAI Service provides access to OpenAI's models, including GPT-4o, GPT-4o mini (and more), as well as embedding models. Deploy an embedding model of your choice using the Azure AI Foundry portal (for example, I used the text-embedding-3-small model). Just like the Cosmos DB account, the Azure Function app uses a managed identity to access the Azure OpenAI Service resource. Deploy Resources Move into the right directory: Shell cd ../embeddings_generator To simplify the deployment of the function app along with the required resources and configuration, you can use the deploy.sh script. At a high level, it: Sets up environment variables for Azure resources.Creates an Azure resource group, storage account, and function app plan.Deploys a custom Go-based Azure Function App.Builds the Go binary for Windows.Publishes the function app to Azure.Enables the function app system identity and provides it the required roles for Cosmos DB and Azure OpenAI resource access. Before you deploy the solution, update the local.settings.json. Use the same database and container names as you created in the previous step: JSON { "IsEncrypted": false, "Values": { "AzureWebJobsStorage": "", "FUNCTIONS_WORKER_RUNTIME": "custom", "COSMOS_CONNECTION__accountEndpoint": "https://ENTER_COSMOSDB_ACCOUNT_NAME.documents.azure.com:443/", "COSMOS_DATABASE_NAME": "name of the database", "COSMOS_CONTAINER_NAME": "name of the container", "COSMOS_HASH_PROPERTY": "hash", "COSMOS_VECTOR_PROPERTY": "embedding", "COSMOS_PROPERTY_TO_EMBED": "description", "OPENAI_DEPLOYMENT_NAME": "enter the embedding model deployment name e.g. text-embedding-3-small", "OPENAI_DIMENSIONS": "enter the dimensions e.g. 1536", "OPENAI_ENDPOINT": "https://ENTER_OPENAI_RESOURCE_NAME.openai.azure.com/" } } COSMOS_CONNECTION_accountEndpoint: Endpoint URL for the Azure Cosmos DB account.COSMOS_DATABASE_NAME: Name of the Cosmos DB database to use.COSMOS_CONTAINER_NAME: Name of the Cosmos DB container to use.COSMOS_HASH_PROPERTY: Name of the property used as a hash in Cosmos DB documents (no need to modify this).COSMOS_VECTOR_PROPERTY: Name of the property storing vector embeddings in Cosmos DB.COSMOS_PROPERTY_TO_EMBED: Name of the property whose value will be embedded. Change this based on your document structure.OPENAI_DEPLOYMENT_NAME: Name of the Azure OpenAI model deployment to use for embeddings.OPENAI_DIMENSIONS: Number of dimensions for the embedding vectors.OPENAI_ENDPOINT: Endpoint URL for the Azure OpenAI resource. Run the deploy.sh script: Shell chmod +x deploy.sh ./deploy.sh As part of the azure functionapp publish command that's used in the script, you will be prompted to overwrite the value of the existing AzureWebJobsStorage setting in the local.settings.json file to Azure – choose "no". Run the End-to-End Example Add data to the source container in Cosmos DB. For example, add a document with the following JSON: JSON { "id": "de001c6d-4efe-4a65-a59a-39a0580bfa2a", "description": "Research new technology" } The function should be triggered automatically when the document is added to the container. You can check the logs of the function app to see if it was triggered successfully: Shell func azure functionapp logstream <FUNCTION_APP_NAME> You should see logs similar to this (the payload will be different depending on the data you add): Plain Text 2025-04-23T05:34:41Z [Information] function invoked 2025-04-23T05:34:41Z [Information] cosmosVectorPropertyName: embedding 2025-04-23T05:34:41Z [Information] cosmosVectorPropertyToEmbedName: description 2025-04-23T05:34:41Z [Information] cosmosHashPropertyName: hash 2025-04-23T05:34:41Z [Information] Processing 1 documents 2025-04-23T05:34:41Z [Information] Processing document ID: de001c6d-4efe-4a65-a59a-39a0580bfa2a 2025-04-23T05:34:41Z [Information] Document data: Research new technology 2025-04-23T05:34:41Z [Information] New document detected, generated hash: 5bb57053273563e2fbd4202c666373ccd48f86eaf9198d7927a93a555aa200aa 2025-04-23T05:34:41Z [Information] Document modification status: true, hash: 5bb57053273563e2fbd4202c666373ccd48f86eaf9198d7927a93a555aa200aa 2025-04-23T05:34:41Z [Information] Created embedding for document: map[description:Research new technology id:de001c6d-4efe-4a65-a59a-39a0580bfa2a] 2025-04-23T05:34:41Z [Information] Adding 1 document with embeddings 2025-04-23T05:34:41Z [Information] Added enriched documents to binding output 2025-04-23T05:34:41Z [Information] Executed 'Functions.cosmosdbprocessor' (Succeeded, Id=91f4760f-047a-4867-9030-46a6602ab179, Duration=128ms) //.... Verify the data in Cosmos DB. You should see an embedding for the description property of the document stored in the embedding property. It should look something like this: JSON { "id": "de001c6d-4efe-4a65-a59a-39a0580bfa2a", "description": "Research new technology", "embedding": [ 0.028226057, -0.00958694 //.... ], "hash": "5bb57053273563e2fbd4202c666373ccd48f86eaf9198d7927a93a555aa200aa" } Once the embeddings are generated, you can integrate them with generative AI applications. For example, you can use the vector search feature of Azure Cosmos DB to perform similarity searches based on the embeddings. How It Works Here is a very high-level overview of the code: main.go: Implements an HTTP server with a cosmosdbprocessorendpoint. When triggered, it reads a Cosmos DB trigger payload from the request, parses the nested documents, generates embeddings using Azure OpenAI, and writes the enriched documents back to the Cosmos DB container. Exposes the cosmosdbprocessor endpoint, which processes incoming Cosmos DB documents.For each document, checks if it is new or modified (using a hash), generates an embedding (vector) using Azure OpenAI, and prepares enriched documents for output.Handles logging and error reporting for the function execution.common package: Contains shared utilities and types for processing Cosmos DB documents embedding.go: Handles creation of embeddings using Azure OpenAI.parse.go: Parses and extracts documents from the Cosmos DB trigger payload.payload.go: Defines data structures for payloads and responses used across the project. The function uses a hash property to check if the document has already been processed. If the hash value is different from the one stored in Cosmos DB, it means that the document has been modified and needs to be reprocessed. In this case, the function will generate a new embedding and update the document with the new hash value. This ensures that the function does not get stuck in an infinite loop. If the hash value is the same, it means that the document has not been modified and does not need to be re-processed. In this case, the function will log that the document is unchanged and will not generate a new embedding. You should see logs similar to this: Plain Text 2025-04-23T05:34:42Z [Information] function invoked 2025-04-23T05:34:42Z [Information] cosmosVectorPropertyName: embedding 2025-04-23T05:34:42Z [Information] cosmosVectorPropertyToEmbedName: description 2025-04-23T05:34:42Z [Information] cosmosHashPropertyName: hash 2025-04-23T05:34:42Z [Information] Processing 1 document 2025-04-23T05:34:42Z [Information] Processing document ID: de001c6d-4efe-4a65-a59a-39a0580bfa2a 2025-04-23T05:34:42Z [Information] Document data: Research new technology 2025-04-23T05:34:42Z [Information] Document unchanged, hash: 5bb57053273563e2fbd4202c666373ccd48f86eaf9198d7927a93a555aa200aa 2025-04-23T05:34:42Z [Information] Document modification status: false, hash: 2025-04-23T05:34:42Z [Information] Executed 'Functions.cosmosdbprocessor' (Succeeded, Id=f0cf039a-5de5-4cc1-b29d-928ce32b294e, Duration=6ms) //.... Delete Resources Be sure to clean up the resources you created in Azure. You can do this using the Azure portal or the Azure CLI. For example, to delete the resource group and all its resources, run: Shell az group delete --name <resource-group-name> This will delete the resource group and all its resources, including the Cosmos DB account, function app, and storage account. Conclusion In this blog post, you learned how to build Azure Functions with Go that use Cosmos DB triggers and bindings. You started with a simple function that logs the raw event payload and the Cosmos DB document, and then moved on to a more complex function that generates embeddings for the documents in the Cosmos DB container using Azure OpenAI. You also learned how to run the functions locally using the Cosmos DB emulator and Azure Functions Core Tools, and how to deploy them to Azure. You can use these examples as a starting point for building your own serverless applications with Go and Azure Functions. The combination of Go's performance and simplicity, along with Azure Functions' scalability and integration with Cosmos DB, makes it a powerful platform for building modern applications.
In today's post, I would like to dive deeper into one of the newest—relatively speaking— topics in the distributed systems domain. As you may have guessed already, the spotlight is on Conflict-free Replicated Data Types or CRDTs for short. I will explain what they are and what role they play in the larger landscape of distributed systems. Let’s start our journey from explaining what Strong Eventual Consistency (SEC) means in this context. Why SEC Matters Consistency is one of the most important—if not the most important—traits in any system. However, the original strong consistency model imposes a significant toll on performance. It also limits the scalability and availability of our systems. As a result, “weaker” consistency models became more and more popular and widely adopted. Eventual consistency promises to solve some of the issues created by strong consistency models. However, it also introduces some totally new types of problems—conflict resolution is one of them. SEC aims to tackle this particular issue. It is a consistency model built atop an eventual consistency that aims to provide a conflict-free environment to ensure availability in the face of failure. It also reduces the cognitive load put on system architects by removing the need for implementing complex conflict-resolution and rollback logic. The theoretical base for SEC is simple mathematics properties like monotonicity, commutativity, and associativity. As such, it is only valid for very specific data types and operations. These data types are commonly denoted as CRDT. It's not surprising, taking into consideration SEC was introduced in the original, as far as I know, CRDT paper. What are CRDTs? CRDTs are a data structure designed to ensure that data on different computers (replicas) will eventually converge—and will be merged—into a consistent state. All of that, no matter what modifications were made and without any special conflict resolution code or user intervention. Additionally, CRDTs are decentralized, and thus, they do not need any coordination between the replicas. Particular replicas exchange data between each other. This trait makes them quite interesting and different from algorithms used in most online gaming and Distributed File Systems (DFS). We can differentiate two basic types of CRDTs: object-based and state-based. There is also a delta-based type which is an extension on top of the state-based CRDTs family. Convergent Replicated Data Types (CvRDTs) CvRDTs are state-based CRDTs. They rely on continuous exchanges of current states between particular replicas. By the way, this is a classic use case for the gossip protocol. When a replica receives the new version of the state, it uses a predefined merge function, effectively updating its own state. In such a setting, when updates stop coming, all the replicas will reach the same, consistent state. Keep in mind that the key here is that the replicas exchange the total state each time. Thus, the size of messages may become quite big. Commutative Replicated Data Types (CmRDTs) CmRDTs (also called operation-based CRDT) are an alternative to state-based types. Contrary to state-based types, they do not have a merge method. Instead, they split the update operations into two steps: prepare update and effect-update. The first phase is executed locally at a particular replica, and it is directly followed by the second phase that executes across all other replicas, effectively equalizing the state across the whole deployment. However, for the second phase to work correctly, CmRDTs require a reliable communication protocol that provides causal ordering of messages. While it is not a very complex problem, because such tools are very common nowadays, it adds another layer of complexity. Equivalence There is one interesting fact regarding both CvRDTs and CmRDTs. They are equivalent to each other, at least from a mathematical perspective. In the previously linked paper, there is an entire subsection (3.2) explaining in great detail why this statement holds true. I will not be copy-pasting the same text here—TLDR, it is based on emulating one type with the other. Delta-State Conflict-Free Replicated Data Types (δ-CRDT) Paulo Sérgio Almeida et al., in their paper Efficient State-based CRDTs by Delta-Mutation, proposed δ-CRDT. It is an extension on top of classic state-based CRDTs which addresses its biggest weakness; the continuous exchange of messages containing the full state of the object. There are two key concepts used to achieve this, namely, the delta-mutator and a delta-state. The δ-state is a representation of changes applied by the mutator to the current state. This delta is later sent to other replicas, effectively reducing the size of messages. Additionally, to reduce the number of messages exchanged between replicas, we can group multiple deltas into a delta-group. I do not want to get too much into the details of different types; there is much more here to uncover. If you are interested in all the math behind CRDTs, you can find all of these details here. CRDTs Fault Tolerance In terms of classic availability and fault tolerance, CRDTs are quite an interesting case. Their base consistency model—SEC—promises to provide very high resilience. It is possible, mostly thanks to the eventual consistency of SEC itself, but also the resilient nature of the CRDTs algorithms themselves. In case of state-based CRDTs, they exchange full state between each other, so besides the case of total failure, sooner or later the replicas should be able to converge into a consistent state. On the other hand, in the case of operation-based (op-based) CRDTs, the update-effect is cumulative, so again no matter the order of messages spread throughout the replicas, they will also be able to converge on an equivalent state. With delta CRDTs, the situation is similar, as it is built upon both op- and state-based types. There are 3 traits of CRDTs that make them especially resilient: Decentralized CRDTs operate without a central coordinator, eliminating single points of failure. Thus, they naturally handle network partitions. Updates are applied locally and propagate when communication is restored. Asynchronous Communication CRDTs utilize only async communication, either via a gossip-based protocol or some broadcasting protocols. Nodes do not need to wait for any type of acknowledgments, nor do they use any type of consensus algorithm. The CRDT-wide state convergence happens asynchronously. Node Failures and Recovery Nodes continue to store and process their local state even in case of network failure. Upon recovery, they can synchronize with other replicas to merge any missed updates. Byzantine fault tolerance Despite all the traits above and in general, very high fault tolerance, CRDTs are not fully invincible. There is a very particular type of failures which CRDTs cannot easily recover from—Byzantine faults. Ironically, the exact same thing that makes CRDTs so highly available—decentralization—is also the main factor of them being susceptible to Byzantine fault. Byzantine faults occur when nodes in a distributed system behave maliciously or send malformed states, potentially leading to inconsistencies. In such a situation, reaching a consistent state across all the replicas through a gossip-based protocol or broadcast can be highly problematic. Unfortunately, at least in this case, CRDTs heavily rely on exactly these approaches. Making CRDTs Byzantine fault-tolerant is a relatively new and hot topic among researchers focused on distributed systems, with Martin Kleppmann’s paper Making CRDTs Byzantine Fault Tolerant being one of the most cited CRDTs papers ever. CRDTs vs CAP CAP Theorem describes the spectrum of availability and consistency while stating that having both of them at the same time is not possible. CRDTs put this claim into question to some extent, at least part of it, as CAP is more nuanced than just consistency vs availability. CRDTs promise very high availability and eventual consistency. CRDT replicas are always available for reads and writes no matter the network partition or failures, and what is more, any subset of communicating replicas will eventually be consistent. While it is not the same as the lineralization required by CAP, it still gives strong guarantees as to the eventual state consistency. CRDTs show that CAP is more of a spectrum than an exact choice, and that we can balance both availability and consistency throughout our system. Types of CRDT The full list of all existing CRDTs is very, very long and would require multiple pages to list, not to mention describe. Here I will cover only some basic types which can later be used to build more complex structures. Let’s start with a simple register. Register Register is the simplest CRDT structure. It is responsible for holding a single value, like a variable. There are two basic semantics for building CRDT registers, depending on how they approach the resolving of concurrent writes: Multi-value Register - Stores and returns all concurrently written values, effectively returning a multi-set. Requires a conflict resolution mechanism on a higher level.Last-write-wins Register (LWW) - As the name suggests, only the newest value will be stored in the register. Counter A counter is similar to a register in the fact that it stores only one value, to be precise, a numeric type. In the case of the counter, we can also differentiate two basic types: Grow-only counter (GCounter) - The simplest counter that only supports an increment operation. In this counter, each replica holds its own state, and the global state is the sum of all local counters.Positive-Negative Counter (PN-Counter) - Somewhat more complex counter; it supports both increment and decrement operations. It tracks increments and decrements as two counters (GCounters in particular). The result is computed by counting the difference between them. Global state, similarly as in the case of GCounter, is the total sum of all counters across the nodes. Set Surprising as it may be, this is just a normal set, but distributed in a CRDT manner. We have multiple different set-like CRDTs. Grow-only set (GSet) is one of the most basic ones. It works almost the same way as GCounter, so I will not spend too much time on it.Another one is USet that works in a similar fashion to PN-Counter, using GSet to handle adds and removes. The USet returns the set difference between them.We also have Add-wins sets that favor the add operation while resolving conflicts between addition and removal of a particular element in the set.There is Remove-wins set that works in the directly opposite manner to Add-wins set and favors removal operations during conflict resolution.Later, we have even more CRDTs like Last-write-wins set, ORSet (observable-removal), and many more. Sequence Sequence CRDTs are a very specialized type of structure. They are extensively used in the field of collaborative editing—documents shared and edited in Google Docs. There are multiple open-source implementations of this type of CRDT. Here are a few examples, with Yjs probably being the most popular one (over 17k stars on GitHub), followed by Automerge (4k stars on GitHub), and many, many more. Map The case of Map is very similar to the Set CRDTs. We have Add-wins Map, Remove-wins Map, Last-write-wins Map. All of these structures have similar behavior as their set counterparts but with one difference: the conflict resolution is handled on a per-key basis. An interesting case is the Multi-value Map, similar to Multi-value Register, where the result of each concurrent put operation is stored within the same key, and conflict resolution needs to be handled on a higher level. A more advanced case of a Map-based structure is a Map of CRDTs, for example a PN-Counter Map that holds PN-Counters as entry values. There is some more nuance behavior when we want to update such entries, but in the end, composing CRDTs is a relatively easy task. This is just a simplified and shortened list of all available basic CRDTs, probably not even all high-level types are covered above. For example, we also have graph-based CRDTs which can be implemented using a GSet of GSets. As to the full one, I’m not sure if it even exists, however the list available here is somewhat lengthier. Moreover, as you could see above, with PN-Counter you can build more complex CRDTs from simpler ones as building blocks. Summary You could read a quite comprehensive introduction to the subject of CRDTs above. You now know what CvRDTs mean and what the difference is between them and delta CRDTs. You also have some insight on how they behave when put in unfavorable situations. Moreover, you know some of the basic CRDT types, what they are, and where you can use them. If you would like to read more about CRDTs, here is a very good page. It is run by Martin Kleppmann and aggregates a lot of data around CRDTs, like white papers and actual implementations. Thank you for your time.
Microservices architecture has gained significant popularity due to its scalability, flexibility, and modular nature. However, with multiple independent services communicating over a network, failures are inevitable. A robust failure-handling strategy is crucial to ensure reliability, resilience, and a seamless user experience. In this article, we will explore different failure-handling mechanisms in microservices and understand their importance in building resilient applications. Why Failure Handling Matters in Microservices? Without proper failure-handling mechanisms, these failures can lead to system-wide disruptions, degraded performance, or even complete downtime. Failure scenarios commonly occur due to: Network failures (e.g., DNS issues, latency spikes)Service unavailability (e.g., dependent services down)Database outages (e.g., connection pool exhaustion)Traffic spikes (e.g., unexpected high load) In Netflix, if the recommendation service is down, it shouldn’t prevent users from streaming videos. Instead, Netflix degrades gracefully by displaying generic recommendations. Key Failure Handling Mechanisms in Microservices 1. Retry Mechanism Sometimes, failures are temporary (e.g., network fluctuations, brief server downtime). Instead of immediately failing, a retry mechanism allows the system to automatically reattempt the request after a short delay. Use cases: Database connection timeoutsTransient network failuresAPI rate limits (e.g., retrying failed API calls after a cooldown period) For example, Amazon’s order service retries fetching inventory from a database before marking an item as out of stock. Best practice: Use Exponential Backoff and Jitter to prevent thundering herds. Using Resilience4j Retry: Java @Retry(name = "backendService", fallbackMethod = "fallbackResponse") public String callBackendService() { return restTemplate.getForObject("http://backend-service/api/data", String.class); } public String fallbackResponse(Exception e) { return "Service is currently unavailable. Please try again later."; } 2. Circuit Breaker Pattern If a microservice is consistently failing, retrying too many times can worsen the issue by overloading the system. A circuit breaker prevents this by blocking further requests to the failing service for a cooldown period. Use cases: Preventing cascading failures in third-party services (e.g., payment gateways)Handling database connection failuresAvoiding overloading during traffic spikes For example, Netflix uses circuit breakers to prevent overloading failing microservices and reroutes requests to backup services. States used: Closed → Calls allowed as normal.Open → Requests are blocked after multiple failures.Half-Open → Test limited requests to check recovery. Below is an example using Circuit Breaker in Spring Boot (Resilience4j). Java @CircuitBreaker(name = "paymentService", fallbackMethod = "fallbackPayment") public String processPayment() { return restTemplate.getForObject("http://payment-service/pay", String.class); } public String fallbackPayment(Exception e) { return "Payment service is currently unavailable. Please try again later."; } 3. Timeout Handling Slow service can block resources, causing cascading failures. Setting timeouts ensures a failing service doesn’t hold up other processes. Use cases: Preventing slow services from blocking threads in high-traffic applicationsHandling third-party API delaysAvoiding deadlocks in distributed systems For example, Uber’s trip service times out requests if a response isn’t received within 2 seconds, ensuring riders don’t wait indefinitely. Below is an example of how to set timeouts in Spring Boot (RestTemplate and WebClient). Java @Bean public RestTemplate restTemplate() { var factory = new SimpleClientHttpRequestFactory(); factory.setConnectTimeout(3000); // 3 seconds factory.setReadTimeout(3000); return new RestTemplate(factory); } 4. Fallback Strategies When a service is down, fallback mechanisms provide alternative responses instead of failing completely. Use cases: Showing cached data when a service is downReturning default recommendations in an e-commerce app Providing a static response when an API is slow For example, YouTube provides trending videos when personalized recommendations fail. Below is an example for implementing Fallback in Resilience4j. Java @Retry(name = "recommendationService") @CircuitBreaker(name = "recommendationService", fallbackMethod = "defaultRecommendations") public List<String> getRecommendations() { return restTemplate.getForObject("http://recommendation-service/api", List.class); } public List<String> defaultRecommendations(Exception e) { return List.of("Popular Movie 1", "Popular Movie 2"); // Generic fallback } 5. Bulkhead Pattern Bulkhead pattern isolates failures by restricting resource consumption per service. This prevents failures from spreading across the system. Use cases: Preventing one failing service from consuming all resourcesIsolating failures in multi-tenant systemsAvoiding memory leaks due to excessive load For example, Airbnb’s booking system ensures that reservation services don’t consume all resources, keeping user authentication operational. Java @Bulkhead(name = "inventoryService", type = Bulkhead.Type.THREADPOOL) public String checkInventory() { return restTemplate.getForObject("http://inventory-service/stock", String.class); } 6. Message Queue for Asynchronous Processing Instead of direct service calls, use message queues (Kafka, RabbitMQ) to decouple microservices, ensuring failures don’t impact real-time operations. Use cases: Decoupling microservices (Order Service → Payment Service)Ensuring reliable event-driven processing Handling traffic spikes gracefully For example, Amazon queues order processing requests in Kafka to avoid failures affecting checkout. Below is an example of using Kafka for order processing. Java @Autowired private KafkaTemplate<String, String> kafkaTemplate; public void placeOrder(Order order) { kafkaTemplate.send("orders", order.toString()); // Send order details to Kafka } 7. Event Sourcing and Saga Pattern When a distributed transaction fails, event sourcing ensures that each step can be rolled back. Banking applications use Saga to prevent money from being deducted if a transfer fails. Below is an example of a Saga pattern for distributed transactions. Java @SagaOrchestrator public void processOrder(Order order) { sagaStep1(); // Reserve inventory sagaStep2(); // Deduct balance sagaStep3(); // Confirm order } 8. Centralized Logging and Monitoring Microservices are highly distributed, without proper logging and monitoring, failures remain undetected until they become critical. In a microservices environment, logs are distributed across multiple services, containers, and hosts. A log aggregation tool collects logs from all microservices into a single dashboard, enabling faster failure detection and resolution. Instead of storing logs separately for each service, a log aggregator collects and centralizes logs, helping teams analyze failures in one place. Below is an example of logging in microservices using the ELK stack (Elasticsearch, Logstash, Kibana). YAML logging: level: root: INFO org.springframework.web: DEBUG Best Practices for Failure Handling in Microservices Design for Failure Failures in microservices are inevitable. Instead of trying to eliminate failures completely, anticipate them and build resilience into the system. This means designing microservices to recover automatically and minimize user impact when failures occur. Test Failure Scenarios Most systems are only tested for success cases, but real-world failures happen in unexpected ways. Chaos engineering helps simulate failures to test how microservices handle them. Graceful Degradation In high-traffic scenarios or service failures, the system should prioritize critical features and gracefully degrade less essential functionalities. Prioritize essential services over non-critical ones. Idempotency Ensure retries don’t duplicate transactions. If a microservice retries a request due to a network failure or timeout, it can accidentally create duplicate transactions (e.g., charging a customer twice). Idempotency ensures that repeated requests have the same effect as a single request. Conclusion Failure handling in microservices is not optional — it’s a necessity. By implementing retries, circuit breakers, timeouts, bulkheads, and fallback strategies, you can build resilient and fault-tolerant microservices.
In the world of software engineering, we’re constantly racing against the clock—deadlines, deployments, and decisions. In this rush, testing often gets sidelined. Some developers see it as optional, or something they’ll “get to later.” But that’s a costly mistake. Because just like documentation, testing is a long-term investment—one that pays off in quality, safety, and peace of mind. Testing is crucial. It’s about ensuring quality, guaranteeing expected behavior, and enabling safe refactoring. Without tests, every change becomes a risk. With tests, change becomes an opportunity to improve. Testing doesn’t just prevent bugs. It shapes the way we build software. It enables confident change, unlocks collaboration, and acts as a form of executable documentation. Tests are a Guarantee of Behavior At its core, a test is a contract. It tells the system—and anyone reading the code—what should happen when given specific inputs. This contract helps ensure that as the software evolves, its expected behavior remains intact. A system without tests is like a building without smoke detectors. Sure, it might stand fine for now, but the moment something catches fire, there’s no safety mechanism to contain the damage. Testing Supports Safe Refactoring Over time, all code becomes legacy. Business requirements shift, architectures evolve, and what once worked becomes outdated. That’s why refactoring is not a luxury—it’s a necessity. But refactoring without tests? That’s walking blindfolded through a minefield. With a reliable test suite, engineers can reshape and improve their code with confidence. Tests confirm that behavior hasn’t changed—even as the internal structure is optimized. This is why tests are essential not just for correctness, but for sustainable growth. Tests Help Teams Move Faster There’s a common myth: tests slow you down. But seasoned engineers know the opposite is true. Tests speed up development by reducing time spent debugging, catching regressions early, and removing the need for manual verification after every change. They also allow teams to work independently, since tests define and validate interfaces between components. The ROI of testing becomes especially clear over time. It’s a long-term bet that pays exponential dividends. When to Use Mocks (and When not to) Not every test has to touch a database or external service. That’s where mocks come in. A mock is a lightweight substitute for an absolute dependency—valid when you want to isolate logic, simulate failures, or verify interactions without relying on complete integration. Use mocks when: You want to test business logic in isolationYou need to simulate rare or hard-to-reproduce scenariosYou want fast, deterministic tests that don’t rely on external state But be cautious: mocking too much can lead to fragile tests that don’t reflect reality. Always complement unit tests with integration tests that use real components to validate your system holistically. A Practical Stack for Java Testing If you're working in Java, here's a battle-tested stack that combines readability, power, and simplicity: JUnit Jupiter JUnit is the foundation for writing structured unit and integration tests. It supports lifecycle hooks, parameterized tests, and extensions with ease. AssertJ This is a fluent assertion library that makes your tests expressive and readable. Instead of writingassertEquals(expected, actual), you write assertThat(actual).isEqualTo(expected)—much more human-friendly. Testcontainers These are perfect for integration tests. With Testcontainers, you can spin up real databases, message brokers, or services in Docker containers as part of your test lifecycle—no mocks, no fakes—just the real thing, isolated and reproducible. Here’s a simple example of combining all three: Java @Test void shouldPersistGuestInDatabase() { Guest guest = new Guest("Ada Lovelace"); guestRepository.save(guest); List<Guest> guests = guestRepository.findAll(); assertThat(guests).hasSize(1).extracting(Guest::getName).contains("Ada Lovelace"); } This kind of test, when paired with Testcontainers and a real database, gives you confidence that your system works, not just in theory, but in practice. Learn More: Testing Java Microservices For a deeper dive into testing strategies—including contract testing, service virtualization, and containerized tests—check out Testing Java Microservices. It’s an excellent resource that aligns with modern practices and real-world challenges. Understanding the Value of Metrics in Testing Once tests are written and passing, a natural follow-up question arises: how do we know they're doing their job? In other words, how can we be certain that our tests are identifying genuine problems, rather than merely giving us a false sense of security? This is where testing metrics come into play—not as final verdicts, but as tools for better judgment. Two of the most common and impactful metrics in this space are code coverage and mutation testing. Code coverage measures how much of your source code is executed when your tests run. It’s often visualized as a percentage and can be broken down by lines, branches, methods, or even conditions. The appeal is obvious: it gives a quick sense of how thoroughly the system is being exercised. But while coverage is easy to track, it’s just as easy to misunderstand. The key limitation of code coverage is that it indicates where the code executes, but not how effectively it is being executed. A line of code can be executed without a single meaningful assertion. This means a project with high coverage might still be fragile underneath—false confidence is a real risk. That’s where mutation testing comes in. This approach works by introducing small changes—known as mutants—into the code, such as flipping a conditional or changing an arithmetic operator. The test suite is then rerun to see whether it detects the change. If the tests fail, the mutant is considered “killed,” indicating that the test is practical. If they pass, the mutant “survives,” exposing a weakness in the test suite. Mutation testing digs into test quality in a way coverage cannot. It challenges the resilience of your tests and asks: Would this test catch a bug if the logic were to break slightly? Of course, this comes with a cost. Mutation testing is slower and more computationally intensive. On large codebases, it can take considerable time to run, and depending on the granularity and mutation strategy, the results can be noisy or overwhelming. That’s why it’s best applied selectively—used on complex business logic or critical paths where the risk of undetected bugs is high. Now here’s where things get powerful: coverage and mutation testing aren’t competing metrics—they’re complementary. Coverage helps you identify what parts of your code aren't being tested at all. Mutation testing indicates how well the tested parts are protected. Used together, they offer a fuller picture: breadth from coverage, and depth from mutation. But even combined, they should not become the ultimate goal. Metrics exist to serve understanding, not to replace it. Chasing a 100% mutation score or full coverage can lead to unrealistic expectations or, worse, wasted effort on tests that don’t matter. What truly matters is having enough coverage and confidence in the parts of the system that are hard to change or essential to your business. In the end, the most valuable metric is trust: trust that your system behaves as expected, trust that changes won’t break things silently, and trust that your test suite is more than a checkbox—it’s a safety net that allows you to move fast without fear. Coverage and mutation testing, when used wisely, help you build and maintain that trust. Final Thoughts: Test Like a Professional Testing is more than a safety net; it’s a form of engineering craftsmanship. It’s how we communicate, refactor, scale, and collaborate without fear. So, treat tests like you treat production code—because they are. They’re your guarantee that what works today still works tomorrow. And in the ever-changing world of software, that’s one of the most valuable guarantees you can have.
Since the software industry is evolving at an incredibly fast speed, there have been many developments in terms of frameworks and different component libraries to enhance the performance, functionalities, and features of applications. And the name Node.js has come among the top ones as it has taken the app development world by storm. Just to clarify, Node.js is not a framework, but it is an open-source JavaScript runtime environment that can be operated on Linux, Unix, macOS, and Windows as well. After its initial release in 2009, Node.js has dominated the software industry. There are many benefits of using Node.js, such as: It can help you create lightning-fast applications.Node.js enables developers to build server-side and frontend JavaScript code more easily.Node.js is a flexible platform, meaning that developers can effortlessly add and remove modules as per their requirements. Further, you can make the most of Node.js by using it with React. It enables you to build server-side rendering applications, thus boosting the performance, reducing page load times, and building better user interfaces. On the other hand, building an app with Node.js is not a piece of cake since there can be many complications in the process. Therefore, it is necessary that you take care of those complications with the right development strategies to build scalable and powerful Node.js applications. 1. Using Node.js Module System Node.js has an effective module system. These modules are the chunks of encapsulated code. Developers can use this system as it offers reusability and the ability to split those complex blocks of code into more manageable and smaller parts. Moreover, you can organize or manage your code into modules with simplicity. Thus, splitting up those apps into smaller modules can also help developers test and maintain them easily. There are three types of modules: core modules, third-party modules, and custom modules. Core Modules In Node.js, core modules are built-in modules and are an essential part of the Node.js platform, providing various important functionalities and features. You don’t need to install any external packages for them since they are readily available. HTTP server (HTTP), file system operations (fs), utilities (util), Path (path), Event Emitter (events), and URL (URL) are some of the famous core modules. Other than these, Node.js also offers plenty more core modules. For example, 'crypto' can be used for cryptography functionality or features, and when it comes to streaming data, the ‘stream’ core module is going to work wonders. So, it is obvious that these core modules enable developers to use different strong features and abilities in order to make flexible and scalable apps. Third-Party Modules The Node.js community has developed these third-party modules. Third-party modules are easily available on package registries like npm (Node Package Manager). Express, Mongoose, Async, Helmet, etc., are some of the most famous third-party modules. You can easily install these modules and add them to your application by using require (). Custom Modules There is no doubt that the built-in core modules in Node.js offer various benefits. But in terms of providing flexibility as per the project’s particular requirements, custom modules are the best choice for developers. Custom modules work by encapsulating particular functionality or features in an application. These modules make it easy for developers to maintain code, enabling smooth reusability; therefore, also strengthening the maintainability of your application. Developers also build and use custom modules in order to make sure that the code is modular; thus, streamlining the process of easily understanding, testing, and refactoring. What Are the Advantages of Modules? Modules enable the encapsulation of code, so as a result, developers are able to conceal the specifics of the execution process and display the vital features, interfaces, and functionalities. Using modules streamlines the process of organizing your code easily into smaller and feasible units; hence, ultimately scaling apps that are more complicated or complex in nature.Moreover, it is easy to maintain and refactor the modular code because if there is any change, update, or any kind of modification in the implementation of the modules, this will not impact the app as a whole. Further, these modules enable code reusability since developers can take advantage of reusing modules in various sections of an app. So, this proves that modules are very useful to implement in the development process. They make it easy for Node.js developers to maintain or organize code. Further, modules allow for maximum reusability of code; therefore, enhancing the performance of applications. 2. Error Handling in Node.js In Node.js, errors can be split into two main categories: programmer errors and operational errors. So, first, let’s talk about programmer errors: Programmer Errors Often, while programming, programmers make errors, and most of the time, programmers are not able to handle these errors completely. If they want to fix these errors, then they can only be corrected by fixing the codebase. Here you can take a look at the most common programming errors: Syntax errors: These types of errors happen if you do not close the curly braces in the process of defining a JavaScript function. Array index out of bounds: This common error happens when you want to have a 7th element of the array, while on the other hand, there are only six available. Reference errors: This is the most common error that happens only when you access functions that are not well-defined. Operational Errors It does not matter if your program is correct, however it will have operational errors. These issues can happen during runtime, and not to mention, external factors can contribute to interrupting the regular flow of your program. However, developers can better understand and handle operational errors than programmer errors. Here are the most common operational errors: Socket hang-upRequest timeoutFile not found Unable to connect to a server Developers can get rid of these common errors to prevent the sudden ending or closing of their program if they know the best practices to handle these errors. Now, let’s discuss the best practices that will significantly help developers handle these errors: Try-Catch Block You can use try-catch blocks, which are a simple method to handle those errors in your Node.js app development. This technique is very useful since you can try coding in the try block. And in case you find any error in the catch block, then the catch block will eliminate those errors. Try-catch blocks are a very constructive technique for synchronous functions as well. While using the Try-Catch Block, the try block wraps the code where there are possibilities of code errors. Putting it into simple words, you just have to surround the piece of code for which you want to check the errors. Error-First Callbacks In order to build strong and reliable applications, it is a must for Node.js developers to do complete error handling. The Error-first callback in Node.js is a functionality that works by returning an error object while there is any successful data that is carried back by the function. Moreover, Asynchronous programming is a vital feature of Node.js development as it enables Node.js developers to conduct non-blocking I/O operations; thus, ultimately boosting applications’ performance. But, when it comes to handling errors in asynchronous programming, this is where the real challenges lie. Because you don’t know where an error might happen in the procedure, it is difficult for you to find and fix it. In situations like these, you can use Error-first callbacks as it is the regular technique that helps in error handling for mainly asynchronous code as well as in the callback pattern. The Error-first callback in Node.js is a functionality that works by returning an error object, even if there is any successful data carried back by the function. Using this best practice, developers can find errors as they happen and solve them in an effective and well-organized way. For example, the error object includes details of the error, like a description, as well as a stack trace. As a result, it streamlines the process of debugging and fixing code issues. On top of this, you can write defensive code by using error-first callbacks. So, developers can further enhance the performance and stability of their apps, as they can identify potential errors and solve them accordingly. Thus, in the end, it helps to minimize the possible scenarios of crashes. Promises Promises are a further evolution of callbacks in Node.js. This is one of the best practices in the process of error handling in Node.js since it offers a very organized process to handle those asynchronous codes. And it is considered even better when we compare it with the traditional callback functions. The promise constructor enables developers to easily build a promise and takes a function as its argument, which is later known as the executor functionality. Then, further in the process, the executor functionality is passed into two functions, such as resolve and reject. These two features or functionalities are utilized to indicate that the Promise is fulfilled or rejected. 3. Using Asynchronous Programming Asynchronous programming allows developers to operate various tasks simultaneously. So, there is no need to wait for the completion of the task in order to start another one. Most of the I/O operations, such as reading from a database, making HTTP requests, etc, are asynchronous in the Node.js platform. As a Node.js developer, you can use Asynchronous programming that enables improved performance, scalability, as well as interactive user interfaces. You can take full advantage by using its callback function. The main functionality of a callback is to run code in response to an event. Callbacks are a great system in Node.js that assists in handling asynchronous programming. A callback is a functionality that passes an argument to another function. And when there is a completion of an operation, then it is executed further. Thus, there will be full continuity of executing code without any blocking. 4. Use NPM Package Manager Being a default package manager for Node.js, NPM is best for its amazing integration with Node.js. By using this package manager, developers are able to streamline the process of installing, managing, and sharing code dependencies. Java developers can take advantage of its extensive repository that has more than 2 million packages. Benefits of Using NPM Package Manager Now, let’s talk about the different benefits NPM can offer to developers: Wide Range of Package Repositories Available Developers prefer to use NPM as you can’t compare its extraordinary repository housing over 2 million packages. Further, these packages provide extensive varieties of functions as well as use cases. Developers can also speed up the process of development since they can easily access a wide range of open-source libraries as well as modules. NPM CLI As NPM is referred to as a Node Package Manager, the CLI is referred to as a Command Line Interface. Using the command line means accessing Node.js packages as well as dependencies. CLI has various commands to manage scripts, configurations, and packages. Moreover, by using CLI, you as a developer will be able to do the tasks effectively and easily, which can include installation, updates, as well as scripting. Custom Scripts By utilizing the NPM run command, developers can easily define custom scripts in the “package.json” file. Moreover, this function allows the automation of different development procedures that can range from building, testing, and deploying to simplifying the development workflow. Well-Established Ecosystem Since NPM has a strong infrastructure as well as great community support, developers can take advantage of its well-organized and well-established environment. The community and the NPM ecosystem have been utilized for a long time and have seen significant improvements over the years. Therefore, they are considered steady, secure, and authentic tools to smoothly manage project dependencies. NPM is a powerful package manager that enables developers to solve the problem while building Node.js apps. From its extensive range of package repositories to a well-established ecosystem, developers prefer to use the NPM package manager. Not to mention, some developers try to use both Yarn and NPM in their projects, but it is not an effective thing to do. Because NPM and Yarn both work in their own way when it comes to handling dependencies, as a result of this, you will find lots of volatility as well as inaccuracies in your project. Thus, it is best to use only one package manager while working on a Node.js application. 5. Securing Node.js Application As a developer you want to secure your application, therefore you must implement secure coding. Nevertheless, you can’t be 100% sure of the security of your code if you are utilizing open-source packages. Moreover, hackers are always looking for any badly handled data in order to get access to your codebase. Use a Robust Authentication System If your Node.js has a bad and imperfect authentication system in place, then your app is more vulnerable to attacks by hackers. Therefore, it becomes more than important that you implement a robust authentication mechanism. For example, while implementing native Node.js authentication, you must use Bcrypt or Scrypt instead of using the Node.js integrated crypto library. Further, you must limit unsuccessful login attempts and don’t show that your username or password is wrong; instead, show a common error like a failed login or attempt. If you want to significantly improve the security and safety of your Node.js application, then 2FA authentication is the best solution to use with modules like node-2fa or speakeasy. Server-Side Logging Developers can use effective features and functions for debugging and tracking other tasks by using a quality logging library. If there are many logs, then it will negatively affect the performance of your application. For example, when deploying an app, make sure to use logical logging. On the other hand, if you log indistinct messages, it will create misconceptions. Therefore, make sure to structure the log messages and format them in a way that both humans and machines can easily read and recognize them. In addition to this, using reasonable logging mechanisms will include IP address, username as well as activities carried out by a user. However, storing sensitive details within the logs means you are not complying with PCI and GDPR. But, if it is important to store sensitive information within app logs, then concealing such details before capturing and writing them into the application logs will be the best solution for you as a Node.js developer. You can use these best practices and authentication mechanisms to secure your Node.js application. Not to mention, you should also work on upgrading and managing those dependencies so that you are able to avoid the risk of any security threats to your Node.js application. 5 Key Takeaways Node.js Module System offers chunks of encapsulated code, enabling developers to easily divide complicated code into tiny parts. Thus, it is easy to manage those small parts of your code. While programming, programmers tend to make mistakes most of the time. Moreover, your program can also have operation errors such as Socket hang-up, Request Timeout, File not Found, etc. Programmers can solve such errors by using Try-Catch Block, Error-First Callbacks, and Promises. Using Asynchronous programming enables developers to do multiple tasks at the same time. Furthermore, you can take advantage of its callback function, which can help you a lot while using asynchronous programming. Further, NPM Package Manager offers an extensive range of package repositories to boost the development process. On top of this, developers can also use NPM CLI to streamline the process of installation, updation, and scripting. In order to improve the security of your Node.js app, you can use a strong authentication system like 2FA authentication. This authentication system works best with modules such as node-2fa or speakeasy. In addition to this, you can also implement reasonable logging mechanisms that will help store the IP address, username, and other activities done by the end user. Therefore, using these advanced, well-tested practices such as using Node.js Module System, prioritizing error handling, implementing asynchronous programming language, updating dependencies, and securing Node.js applications, you will be well on your way to developing faster, responsive, robust, and better-performing Node.js applications.
Welcome back to the third — and final — installment in our series on how to work with the curses library in Python to draw with text. If you missed the first two parts of this programming tutorial series — or if you wish to reference the code contained within them — you can read them here: "Python curses, Part 1: Drawing With Text" "Python curses, Part 2: How to Create a Python curses-Enabled Application" Once reviewed, let’s move on to the next portion: how to decorate windows with borders and boxes using Python’s curses module. Decorating Windows With Borders and Boxes Windows can be decorated using custom values as well as a default “box” adornment in Python. This can be accomplished using the window.box() and window.border(…) functions. The Python code example below creates a red 5×5 window and then alternates displaying and clearing the border on each key press: Python # demo-window-border.py import curses import math import sys def main(argv): # BEGIN ncurses startup/initialization... # Initialize the curses object. stdscr = curses.initscr() # Do not echo keys back to the client. curses.noecho() # Non-blocking or cbreak mode... do not wait for Enter key to be pressed. curses.cbreak() # Turn off blinking cursor curses.curs_set(False) # Enable color if we can... if curses.has_colors(): curses.start_color() # Optional - Enable the keypad. This also decodes multi-byte key sequences # stdscr.keypad(True) # END ncurses startup/initialization... caughtExceptions = "" try: # Create a 5x5 window in the center of the terminal window, and then # alternate displaying a border and not on each key press. # We don't need to know where the approximate center of the terminal # is, but we do need to use the curses terminal size constants to # calculate the X, Y coordinates of where we can place the window in # order for it to be roughly centered. topMostY = math.floor((curses.LINES - 5)/2) leftMostX = math.floor((curses.COLS - 5)/2) # Place a caption at the bottom left of the terminal indicating # action keys. stdscr.addstr (curses.LINES-1, 0, "Press Q to quit, any other key to alternate.") stdscr.refresh() # We're just using white on red for the window here: curses.init_pair(1, curses.COLOR_WHITE, curses.COLOR_RED) index = 0 done = False while False == done: # If we're on the first iteration, let's skip straight to creating the window. if 0 != index: # Grabs a value from the keyboard without Enter having to be pressed. ch = stdscr.getch() # Need to match on both upper-case or lower-case Q: if ch == ord('Q') or ch == ord('q'): done = True mainWindow = curses.newwin(5, 5, topMostY, leftMostX) mainWindow.bkgd(' ', curses.color_pair(1)) if 0 == index % 2: mainWindow.box() else: # There's no way to "unbox," so blank out the border instead. mainWindow.border(' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ') mainWindow.refresh() stdscr.addstr(0, 0, "Iteration [" + str(index) + "]") stdscr.refresh() index = 1 + index except Exception as err: # Just printing from here will not work, as the program is still set to # use ncurses. # print ("Some error [" + str(err) + "] occurred.") caughtExceptions = str(err) # BEGIN ncurses shutdown/deinitialization... # Turn off cbreak mode... curses.nocbreak() # Turn echo back on. curses.echo() # Restore cursor blinking. curses.curs_set(True) # Turn off the keypad... # stdscr.keypad(False) # Restore Terminal to original state. curses.endwin() # END ncurses shutdown/deinitialization... # Display Errors if any happened: if "" != caughtExceptions: print ("Got error(s) [" + caughtExceptions + "]") if __name__ == "__main__": main(sys.argv[1:]) This code was run over an SSH connection, so there is an automatic clearing of the screen upon its completion. The border “crops” the inside of the window, and any text that is placed within the window must be adjusted accordingly. And as the call to the window.border(…) function suggests, any character can be used for the border. The code works by waiting for a key to be pressed. If either Q or Shift+Q is pressed, the termination condition of the loop will be activated and the program will quit. Note that, pressing the arrow keys may return key presses and skip iterations. How to Update Content in “Windows” With Python curses Just as is the case with traditional graphical windowed programs, the text content of a curses window can be changed. And just as is the case with graphical windowed programs, the old content of the window must be “blanked out” before any new content can be placed in the window. The Python code example below demonstrates a digital clock that is centered on the screen. It makes use of Python lists to store sets of characters which when displayed, look like large versions of digits. A brief note: The code below is not intended to be the most efficient means of displaying a clock; rather, it is intended to be a more portable demonstration of how curses windows are updated. Python # demo-clock.py # These list assignments can be done on single lines, but it's much easier to see what # these values represent by doing it this way. space = [ " ", " ", " ", " ", " ", " ", " ", " ", " ", " "] colon = [ " ", " ", " ::: ", " ::: ", " ", " ", " ::: ", " ::: ", " ", " "] forwardSlash = [ " ", " //", " // ", " // ", " // ", " // ", " // ", " // ", "// ", " "] number0 = [ " 000000 ", " 00 00 ", " 00 00 ", " 00 00 ", " 00 00 ", " 00 00 ", " 00 00 ", " 00 00 ", " 00 00 ", " 000000 "] number1 = [ " 11 ", " 111 ", " 1111 ", " 11 ", " 11 ", " 11 ", " 11 ", " 11 ", " 11 ", " 111111 "] number2 = [ " 222222 ", " 22 22 ", " 22 22 ", " 22 ", " 22 ", " 22 ", " 22 ", " 22 ", " 22 ", " 22222222 "] number3 = [ " 333333 ", " 33 33 ", " 33 33 ", " 33 ", " 3333 ", " 33 ", " 33 ", " 33 33 ", " 33 33 ", " 333333 "] number4 = [ " 44 ", " 444 ", " 4444 ", " 44 44 ", " 44 44 ", "444444444 ", " 44 ", " 44 ", " 44 ", " 44 "] number5 = [ " 55555555 ", " 55 ", " 55 ", " 55 ", " 55555555 ", " 55 ", " 55 ", " 55 ", " 55 ", " 55555555 "] number6 = [ " 666666 ", " 66 66 ", " 66 ", " 66 ", " 6666666 ", " 66 66 ", " 66 66 ", " 66 66 ", " 66 66 ", " 666666 "] number7 = [ " 77777777 ", " 77 ", " 77 ", " 77 ", " 77 ", " 77 ", " 77 ", " 77 ", " 77 ", " 77 "] number8 = [ " 888888 ", " 88 88 ", " 88 88 ", " 88 88 ", " 888888 ", " 88 88 ", " 88 88 ", " 88 88 ", " 88 88 ", " 888888 "] number9 = [ " 999999 ", " 99 99 ", " 99 99 ", " 99 99 ", " 999999 ", " 99 ", " 99 ", " 99 ", " 99 99 ", " 999999 "] import curses import math import sys import datetime def putChar(windowObj, inChar, inAttr = 0): #windowObj.box() #windowObj.addstr(inChar) # The logic below maps the normal character input to a list which contains a "big" # representation of that character. charToPut = "" if '0' == inChar: charToPut = number0 elif '1' == inChar: charToPut = number1 elif '2' == inChar: charToPut = number2 elif '3' == inChar: charToPut = number3 elif '4' == inChar: charToPut = number4 elif '5' == inChar: charToPut = number5 elif '6' == inChar: charToPut = number6 elif '7' == inChar: charToPut = number7 elif '8' == inChar: charToPut = number8 elif '9' == inChar: charToPut = number9 elif ':' == inChar: charToPut = colon elif '/' == inChar: charToPut = forwardSlash elif ' ' == inChar: charToPut = space lineCount = 0 # This loop will iterate each line in the window to display a "line" of the digit # to be displayed. for line in charToPut: # Attributes, or the bitwise combinations of multiple attributes, are passed as-is # into addstr. Note that not all attributes, or combinations of attributes, will # work with every terminal. windowObj.addstr(lineCount, 0, charToPut[lineCount], inAttr) lineCount = 1 + lineCount windowObj.refresh() def main(argv): # Initialize the curses object. stdscr = curses.initscr() # Do not echo keys back to the client. curses.noecho() # Non-blocking or cbreak mode... do not wait for Enter key to be pressed. curses.cbreak() # Turn off blinking cursor curses.curs_set(False) # Enable color if we can... if curses.has_colors(): curses.start_color() # Optional - Enable the keypad. This also decodes multi-byte key sequences # stdscr.keypad(True) caughtExceptions = "" try: # First things first, make sure we have enough room! if curses.COLS <= 88 or curses.LINES <= 11: raise Exception ("This terminal window is too small.rn") currentDT = datetime.datetime.now() hour = currentDT.strftime("%H") min = currentDT.strftime("%M") sec = currentDT.strftime("%S") # Depending on how the floor values are calculated, an extra character for each # window may be needed. This code crashed when the windows were set to exactly # 10x10 topMostY = math.floor((curses.LINES - 11)/2) leftMostX = math.floor((curses.COLS - 88)/2) # Note that print statements do not work when using ncurses. If you want to write # to the terminal outside of a window, use the stdscr.addstr method and specify # where the text will go. Then use the stdscr.refresh method to refresh the # display. stdscr.addstr(curses.LINES-1, 0, "Press a key to quit.") stdscr.refresh() # Boxes - Each box must be 1 char bigger than stuff put into it. hoursLeftWindow = curses.newwin(11, 11, topMostY,leftMostX) putChar(hoursLeftWindow, hour[0:1]) hoursRightWindow = curses.newwin(11, 11, topMostY,leftMostX+11) putChar(hoursRightWindow, hour[-1]) leftColonWindow = curses.newwin(11, 11, topMostY,leftMostX+22) putChar(leftColonWindow, ':', curses.A_BLINK | curses.A_BOLD) minutesLeftWindow = curses.newwin(11, 11, topMostY, leftMostX+33) putChar(minutesLeftWindow, min[0:1]) minutesRightWindow = curses.newwin(11, 11, topMostY, leftMostX+44) putChar(minutesRightWindow, min[-1]) rightColonWindow = curses.newwin(11, 11, topMostY, leftMostX+55) putChar(rightColonWindow, ':', curses.A_BLINK | curses.A_BOLD) leftSecondWindow = curses.newwin(11, 11, topMostY, leftMostX+66) putChar(leftSecondWindow, sec[0:1]) rightSecondWindow = curses.newwin(11, 11, topMostY, leftMostX+77) putChar(rightSecondWindow, sec[-1]) # One of the boxes must be non-blocking or we can never quit. hoursLeftWindow.nodelay(True) while True: c = hoursLeftWindow.getch() # In non-blocking mode, the getch method returns -1 except when any key is pressed. if -1 != c: break currentDT = datetime.datetime.now() currentDTUsec = currentDT.microsecond # Refreshing the clock "4ish" times a second may be overkill, but doing # on every single loop iteration shoots active CPU usage up significantly. # Unfortunately, if we only refresh once a second it is possible to # skip a second. # However, this type of restriction breaks functionality in Windows, so # for that environment, this has to run on Every. Single. Iteration. if 0 == currentDTUsec % 250000 or sys.platform.startswith("win"): hour = currentDT.strftime("%H") min = currentDT.strftime("%M") sec = currentDT.strftime("%S") putChar(hoursLeftWindow, hour[0:1], curses.A_BOLD) putChar(hoursRightWindow, hour[-1], curses.A_BOLD) putChar(minutesLeftWindow, min[0:1], curses.A_BOLD) putChar(minutesRightWindow, min[-1], curses.A_BOLD) putChar(leftSecondWindow, sec[0:1], curses.A_BOLD) putChar(rightSecondWindow, sec[-1], curses.A_BOLD) # After breaking out of the loop, we need to clean up the display before quitting. # The code below blanks out the subwindows. putChar(hoursLeftWindow, ' ') putChar(hoursRightWindow, ' ') putChar(leftColonWindow, ' ') putChar(minutesLeftWindow, ' ') putChar(minutesRightWindow, ' ') putChar(rightColonWindow, ' ') putChar(leftSecondWindow, ' ') putChar(rightSecondWindow, ' ') # De-initialize the window objects. hoursLeftWindow = None hoursRightWindow = None leftColonWindow = None minutesLeftWindow = None minutesRightWindow = None rightColonWindow = None leftSecondWindow = None rightSecondWindow = None except Exception as err: # Just printing from here will not work, as the program is still set to # use ncurses. # print ("Some error [" + str(err) + "] occurred.") caughtExceptions = str(err) # End of Program... # Turn off cbreak mode... curses.nocbreak() # Turn echo back on. curses.echo() # Restore cursor blinking. curses.curs_set(True) # Turn off the keypad... # stdscr.keypad(False) # Restore Terminal to original state. curses.endwin() # Display Errors if any happened: if "" != caughtExceptions: print ("Got error(s) [" + caughtExceptions + "]") if __name__ == "__main__": main(sys.argv[1:]) Checking Window Size Note how the first line within the try block in the main function checks the size of the terminal window and raises an exception should it not be sufficiently large enough to display the clock. This is a demonstration of “preemptive” error handling, as if the individual window objects are written to a screen which is too small, a very uninformative exception will be raised. Cleaning Up Windows With curses The example above forces a cleanup of the screen for all 3 operating environments. This is done using the putChar(…) function to print a blank space character to each window object upon breaking out of the while loop. The objects are then set to None. Cleaning up window objects in this manner can be a good practice when it is not possible to know all the different terminal configurations that the code could be running on, and having a blank screen on exit gives these kinds of applications a cleaner look overall. CPU Usage Like the previous code example, this too works as an “infinite” loop in the sense that it is broken by a condition that is generated by pressing any key. Showing two different ways to break the loop is intentional, as some developers may lean towards one method or another. Note that this code results in extremely high CPU usage because, when run within a loop, Python will consume as much CPU time as it possibly can. Normally, the sleep(…) function is used to pause execution, but in the case of implementing a clock, this may not be the best way to reduce overall CPU usage. Interestingly enough though, the CPU usage, as reported by the Windows Task Manager for this process is only about 25%, compared to 100% in Linux. Another interesting observation about CPU usage in Linux: even when simulating significant CPU usage by way of the stress utility, as per the command below: Python $ stress -t 30 -c 16 The demo-clock.py script was still able to run without losing the proper time. Going Further With Python curses This three-part introduction only barely scratches the surface of the Python curses module, but with this foundation, the task of creating robust user interfaces for text-based Python applications becomes quite doable, even for a novice developer. The only downsides are having to worry about how individual terminal emulation implementations can impact the code, but that will not be that significant of an impediment, and of course, having to deal with the math involved in keeping window objects properly sized and positioned. The Python curses module does provide mechanisms for “moving” windows (albeit not very well natively, but this can be mitigated), as well as resizing windows and even compensating for changes in the terminal window size! Even complex text-based games can be (and have been) implemented using the Python curses module, or its underlying ncurses C/C++ libraries. The complete documentation for the ncurses module can be found in the "curses — Terminal handling for character-cell displays" section of the Python documentation. As the Python curses module uses syntax that is “close enough” to the underlying ncurses C/C++ libraries, the manual pages for those libraries, as well as reference resources for those libraries can also be consulted for more information. Happy “faux” Windowed Programming!
Monitoring containerized applications in Kubernetes environments is essential for ensuring reliability and performance. Azure Monitor Application Insights provides powerful application performance monitoring capabilities that can be integrated seamlessly with Azure Kubernetes Service (AKS). This article focuses on auto-instrumentation, which allows you to collect telemetry from your applications running in AKS without modifying your code. We'll explore a practical implementation using the monitoring-demo-azure repository as our guide. What Is Auto-Instrumentation? Auto-instrumentation is a feature that enables Application Insights to automatically collect telemetry, such as metrics, requests, and dependencies, from your applications. As described in Microsoft documentation, "Auto-instrumentation automatically injects the Azure Monitor OpenTelemetry Distro into your application pods to generate application monitoring telemetry" [1]. The key benefits include: No code changes requiredConsistent telemetry collection across servicesEnhanced visibility with Kubernetes-specific contextSimplified monitoring setup Currently, AKS auto-instrumentation supports (this is currently in preview as of Apr 2025) JavaNode.js How Auto-Instrumentation Works The auto-instrumentation process in AKS involves: Creating a custom resource of type Instrumentation in your Kubernetes clusterThe resource defines which language platforms to instrument and where to send telemetryAKS automatically injects the necessary components into application podsTelemetry is collected and sent to your Application Insights resource Demo Implementation Using monitoring-demo-azure The monitoring-demo-azure repository provides a straightforward example of setting up auto-instrumentation in AKS. The repository contains a k8s directory with the essential files needed to demonstrate this capability. Setting Up Your Environment Before applying the example files, ensure you have: An AKS cluster running in AzureA workspace-based Application Insights resourceAzure CLI version 2.60.0 or greater Run the following commands to prepare your environment: Shell # Install the aks-preview extension az extension add --name aks-preview # Register the auto instrumentation feature az feature register --namespace "Microsoft.ContainerService" --name "AzureMonitorAppMonitoringPreview" # Check registration status az feature show --namespace "Microsoft.ContainerService" --name "AzureMonitorAppMonitoringPreview" # Refresh the registration az provider register --namespace Microsoft.ContainerService # Enable Application Monitoring on your cluster az aks update --resource-group <resource_group> --name <cluster_name> --enable-azure-monitor-app-monitoring Key Files in the Demo Repository The demo repository contains three main Kubernetes manifest files in the k8s directory: 1. namespace.yaml Creates a dedicated namespace for the demonstration: YAML apiVersion: v1 kind: Namespace metadata: name: demo-namespace 2. auto.yaml This is the core file that configures auto-instrumentation: YAML CopyapiVersion: monitor.azure.com/v1 kind: Instrumentation metadata: name: default namespace: demo-namespace spec: settings: autoInstrumentationPlatforms: - Java - NodeJs destination: applicationInsightsConnectionString: "InstrumentationKey=your-key;IngestionEndpoint=https://your-location.in.applicationinsights.azure.com/" The key components of this configuration are: autoInstrumentationPlatforms: Specifies which languages to instrument (Java and Node.js in this case)destination: Defines where to send the telemetry (your Application Insights resource) 3. The Deployment and Manifests The three services can be deployed using the 3 YAML files in the k8s folder. In this case, I used the Automated Deployments to create the images and deploy them into the AKS cluster. Notice that this deployment file doesn't contain any explicit instrumentation configuration. The auto-instrumentation is entirely handled by the Instrumentation custom resource. Deploying the Demo Deploy the demo resources in the following order: Shell # Apply the namespace first kubectl apply -f namespace.yaml # Apply the instrumentation configuration kubectl apply -f auto.yaml # Deploy the application # Optional: Restart any existing deployments to apply instrumentation kubectl rollout restart deployment/<deployment-name> -n demo-namespace Verifying Auto-Instrumentation After deployment, you can verify that auto-instrumentation is working by: Generating some traffic to your applicationNavigating to your Application Insights resource in the Azure portalLooking for telemetry with Kubernetes-specific metadata Key Visualizations in Application Insights Once your application is sending telemetry, Application Insights provides several powerful visualizations: Application Map The Application Map shows the relationships between your services and their dependencies. For Kubernetes applications, this visualization displays how your microservices interact within the cluster and with external dependencies. The map shows: Service relationships with connection linesHealth status for each componentPerformance metrics like latency and call volumesKubernetes-specific context (like pod names and namespaces) Performance View The Performance view breaks down response times and identifies bottlenecks in your application. For containerized applications, this helps pinpoint which services might be causing performance issues. You can: See operation durations across servicesIdentify slow dependenciesAnalyze performance by Kubernetes workloadCorrelate performance with deployment events Failures View The Failures view aggregates exceptions and failed requests across your application. For Kubernetes deployments, this helps diagnose issues that might be related to the container environment. The view shows: Failed operations grouped by typeException patterns and trendsDependency failuresContainer-related issues (like resource constraints) Live Metrics Stream Live Metrics Stream provides real-time monitoring with near-zero latency. This is particularly useful for: Monitoring deployments as they happenTroubleshooting production issues in real timeObserving the impact of scaling operationsValidating configuration changes Conclusion Auto-instrumentation in AKS with Application Insights provides a streamlined way to monitor containerized applications without modifying your code. The monitoring-demo-azure repository offers a minimal, practical example that demonstrates: How to configure auto-instrumentation in AKSThe pattern for separating instrumentation configuration from application deploymentThe simplicity of adding monitoring to existing applications By leveraging this approach, you can quickly add comprehensive monitoring to your Kubernetes applications and gain deeper insights into their performance and behavior. References [1] Azure Monitor Application Insights Documentation [2] Auto-Instrumentation Overview [3] GitHub: monitoring-demo-azure
In the field of big data analytics, Apache Doris and Elasticsearch (ES) are frequently utilized for real-time analytics and retrieval tasks. However, their design philosophies and technical focuses differ significantly. This article offers a detailed comparison across six dimensions: core architecture, query language, real-time capabilities, application scenarios, performance, and enterprise practices. 1. Core Design Philosophy: MPP Architecture vs. Search Engine Architecture Apache Doris employs a typical MPP (Massively Parallel Processing) distributed architecture, tailored for high-concurrency, low-latency real-time online analytical processing (OLAP) scenarios. It comprises front-end and back-end components, leveraging multi-node parallel computing and columnar storage to efficiently manage massive datasets. This design enables Doris to deliver query results in sub-seconds, making it ideal for complex aggregations and analytical queries on large datasets. In contrast, Elasticsearch is based on a full-text search engine architecture, utilizing a sharding and inverted index design that prioritizes rapid text retrieval and filtering. ES stores data as documents, with each field indexed via an inverted index, excelling in keyword searches and log queries. However, it struggles with complex analytics and large-scale aggregation computations. The core architectural differences are summarized below: Architectural Philosophy Apache Doris (MPP Analytical Database) Elasticsearch (Distributed Search Engine) Design Intent Geared toward real-time data warehousing/BI, supporting high-throughput parallel computing OLAP engine; emphasizes high-concurrency aggregation queries and low latency Focused on full-text search/log retrieval, built on Lucene’s inverted index; excels at keyword search and filtering, primarily a search engine despite structured query support Data Storage Columnar storage with column-encoded compression, achieving high compression ratios (5-10×) to save space; supports multiple table models (Duplicate, Aggregate, Unique) with pre-aggregation during writes Document storage , with inverted indexes per field (low compression ratio, ~1.5×); schema changes are challenging post-index creation, requiring reindexing for field additions or modifications Scalability and Elasticity Shared-nothing node design for easy linear scaling; supports strict read-write separation and multi-tenant isolation; version 3.0 introduces storage-compute separation for elastic scaling Scales via shard replicas but is constrained by single-node memory and JVM GC limits, risking memory shortages during large queries; thread pool model offers limited isolation Typical Features Fully open-source (Apache 2.0), MySQL protocol compatible; no external dependencies, offers materialized views and rich SQL functions for enhanced analytics Core developed by Elastic (license changes over time), natively supports full-text search and near-real-time indexing; rich ecosystem (Kibana, Logstash), with some advanced features requiring paid plugins Analysis: Doris’s MPP architecture provides a natural edge in big data aggregation analytics, leveraging columnar storage and vectorized execution to optimize IO and CPU usage. Features like pre-aggregation, materialized views, and a scalable design make it outperform ES in large-scale data analytics. Conversely, Elasticsearch’s search engine roots make it superior for instant searches and basic metrics, but it falters in complex SQL analytics and joins. Doris also offers greater schema flexibility, allowing real-time column/index modifications, while ES’s fixed schemas often necessitate costly reindexing. Overall, Doris emphasizes analytical power and usability, while ES prioritizes retrieval, giving Doris an advantage in complex enterprise analytics. 2. Query Language: SQL vs. DSL Ease of Use and Expressiveness Doris and ES diverge sharply in query interfaces: Doris natively supports standard SQL, while Elasticsearch uses JSON DSL (Domain Specific Language). Doris aligns with the MySQL protocol, offering robust SQL 92 features such as SELECT, WHERE, GROUP BY, ORDER BY, multi-table JOINs, subqueries, window functions, UDFs/UDAFs, and materialized views. This comprehensive SQL support allows analysts and engineers to perform complex queries using familiar syntax without learning a new language. Elasticsearch, however, employs a proprietary JSON-based DSL, distinct from SQL, requiring nested structures for filtering and aggregation. This presents a steep learning curve for new users and complicates integration with traditional BI tools. The comparison is detailed below: Query Language Apache Doris (SQL Interface) Elasticsearch (JSON DSL) Syntax Style Standard SQL (MySQL-like), intuitive and readable Proprietary DSL (JSON), nested and less intuitive Expressiveness Supports multi-table JOINs, subqueries, views, UDFs for complex logic; enables direct associative analytics Limited to single-index queries, no native JOINs or subqueries; complex analytics require pre-processed data models Learning Cost SQL is widely known, low entry barrier; mature debugging tools available DSL is custom, high learning threshold; error troubleshooting is challenging Ecosystem Integration MySQL protocol compatible, integrates seamlessly with BI tools (e.g., Tableau, Grafana) Closed ecosystem, difficult to integrate with BI tools without plugins; Kibana offers basic visualization Analysis: Doris’s SQL interface excels in usability and efficiency, lowering the entry threshold by leveraging familiar syntax. For instance, aggregating log data by multiple dimensions in Doris requires a simple SQL GROUP BY, while ES demands complex, nested DSL aggregations, reducing development efficiency. Doris’s support for JOINs and subqueries also suits data warehouse modeling (e.g., star schemas), whereas ES’s lack of JOINs necessitates pre-denormalized data or application-layer processing. Thus, Doris outperforms in query ease and power, enhancing integration with analytics ecosystems. 3. Real-Time Data Processing Mechanisms: Write Architecture and Data Updates Doris and ES adopt distinct approaches to real-time data ingestion and querying. Elasticsearch prioritizes near-real-time search with document-by-document writes and frequent index refreshes. Data is ingested via REST APIs (e.g., Bulk), tokenized, and indexed, becoming searchable after periodic refreshes (default: 1 second). This ensures rapid log retrieval but incurs high write overhead, with CPU-intensive indexing limiting single-core throughput to ~2 MB/s, often causing bottlenecks during peaks. Apache Doris, conversely, uses a high-throughput batch write architecture. Data is imported in small batches (via Stream Load or Routine Load from queues like Kafka), written efficiently in columnar format across multiple replicas. Avoiding per-field indexing, Doris achieves write speeds 5 times higher than ES per ES Rally benchmarks, and supports direct queue integration, simplifying pipelines. Key differences in updates and real-time capabilities include: Storage mechanism: Doris’s columnar storage achieves 5:1 to 10:1 compression, using ~20% of ES’s space for the same data, enhancing IO efficiency. ES’s inverted indexes yield a ~1.5:1 compression ratio, inflating storage. Data updates: Doris’s Unique Key model supports primary key updates with minimal performance loss (<10%), while ES’s document updates require costly reindexing (up to 3x performance hit). Doris’s Aggregate Key model ensures consistent aggregations during imports, unlike ES’s less flexible, eventually consistent rollups. Query visibility: ES offers second-level visibility post-refresh, ideal for instant log retrieval. Doris achieves sub-minute visibility via batch imports, sufficient for most real-time analytics, with memory-buffered data ensuring timely query access. Analysis: Doris excels in high-throughput, consistent analysis, while ES focuses on millisecond writes and near-real-time retrieval. Doris’s batch writes and compression outperform ES in write performance (5x), query speed (2.3x), and storage efficiency (1/5th), making it ideal for high-frequency writes and fast analytics, with flexible schema evolution further enhancing its real-time capabilities. 4. Typical Application Scenario Comparison: Log Analysis, BI Reporting, etc. Doris and ES shine in different scenarios due to their architectural strengths: Scenario Apache Doris Elasticsearch Log Analysis Excels in storage and multi-dimensional analysis of large logs; supports long-term retention and fast aggregations/JOINs. Enterprises report 10x faster analytics and 60% cost savings, integrating search and analysis with inverted index support Ideal for real-time log search and simple stats; fast keyword retrieval suits monitoring and troubleshooting (e.g., ELK). Struggles with complex aggregations and long-term analysis due to cost and performance limits BI Reporting Perfect for interactive reporting and ad-hoc analysis; full SQL and JOINs support data warehousing and dashboards. A logistics firm saw 5-10x faster queries and 2x concurrency Rarely used for BI; lacks JOINs and robust SQL, limiting complex reporting. Best for simple metrics in monitoring, not rich BI logic Analysis: In log analysis, Doris and ES complement each other: ES handles real-time searches, while Doris manages long-term, complex analytics. For BI, Doris’s SQL and performance make it far superior, directly supporting enterprise data warehouses and reporting. 5. Performance Benchmark Comparison ES Rally benchmarks highlight Doris’s edge: Log analysis: Elasticsearch vs Apache Doris - Apache Doris Performance comparison: write throughput, storage, query response time Doris achieves 550 MB/s write speed (5x ES), uses 1/5th the storage, and offers 2.3x faster queries (e.g., 1s vs. 6-7s for 40M log aggregations). Its MPP architecture ensures stability under high concurrency, unlike ES, which struggles with memory limits. 6. Enterprise Practice Cases 360 security browser: Replaced ES with Doris, improving analytics speed by 10x and cutting storage costs by 60%. Tencent music: Reduced storage by 80% (697GB to 195GB) and boosted writes 4x with Doris. Large bank: Enhanced log analysis efficiency, eliminating redundancy. Payment firm: Achieved 4x write speed, 3x query performance, and 50% storage savings. These cases underscore Doris’s superiority in large-scale writes and complex queries, often supplementing ES’s search strengths. Summary Doris excels in complex analytics, SQL usability, and efficiency, ideal for unified real-time platforms, while ES dominates in full-text search and real-time queries. Enterprises can combine them — Doris for analysis, ES for retrieval — to maximize value, with Doris poised to expand in analytics and ES in intelligent search.
Platform Engineering for Cloud Teams
April 21, 2025
by
CORE
April 21, 2025
by
CORE
Thermometer Continuation in Scala
April 25, 2025 by
SQL Server Index Optimization Strategies: Best Practices with Ola Hallengren’s Scripts
April 25, 2025
by
CORE
Clean Up Event Data in Ansible Event-Driven Automation
April 25, 2025
by
CORE
SQL Server Index Optimization Strategies: Best Practices with Ola Hallengren’s Scripts
April 25, 2025
by
CORE
How to Build the Right Infrastructure for AI in Your Private Cloud
April 25, 2025 by
Analyzing “java.lang.OutOfMemoryError: Failed to create a thread” Error
April 25, 2025 by
Thermometer Continuation in Scala
April 25, 2025 by
SQL Server Index Optimization Strategies: Best Practices with Ola Hallengren’s Scripts
April 25, 2025
by
CORE
Clean Up Event Data in Ansible Event-Driven Automation
April 25, 2025
by
CORE
Clean Up Event Data in Ansible Event-Driven Automation
April 25, 2025
by
CORE
How to Build the Right Infrastructure for AI in Your Private Cloud
April 25, 2025 by
Java's Quiet Revolution: Thriving in the Serverless Kubernetes Era
April 25, 2025 by
How to Build the Right Infrastructure for AI in Your Private Cloud
April 25, 2025 by
Analyzing “java.lang.OutOfMemoryError: Failed to create a thread” Error
April 25, 2025 by
Java's Quiet Revolution: Thriving in the Serverless Kubernetes Era
April 25, 2025 by