Testing, Deployment, and Maintenance Resources

DZone's Featured Testing, Deployment, and Maintenance Resources

Chaos Engineering With Litmus: A CNCF Incubating Project

By Sai Sandeep Ogety

CORE

Problem statement: Ensuring the resilience of a microservices-based e-commerce platform. System resilience stands as the key requirement for e-commerce platforms during scaling operations to keep services operational and deliver performance excellence to users. We have developed a microservices architecture platform that encounters sporadic system failures when faced with heavy traffic events. The problems with degraded service availability along with revenue impact occur mainly because of Kubernetes pod crashes along with resource exhaustion and network disruptions that hit during peak shopping seasons. The organization plans to utilize the CNCF-incubated project Litmus for conducting assessments and resilience enhancements of the platform. Our system weakness points become clearer when we conduct simulated failure tests using Litmus, which allows us to trigger real-world failure situations like pod termination events and network delays, and resource usage limits. The experiments enable us to validate scalability automation systems while testing disaster recovery procedures and maximize Kubernetes settings toward total system reliability. The system creates a solid foundation to endure failure situations and distribute busy traffic periods safely without deteriorating user experience quality. Chaos engineering applied proactively to our infrastructure enables better risk reduction and increased observability, which allows us to develop automated recovery methods that enhance our platform's e-commerce resilience to every operational condition. Set Up the Chaos Experiment Environment Install LitmusChaos in your Kubernetes cluster: Shell helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/ helm repo update helm install litmus litmuschaos/litmus Verify installation: Shell kubectl get pods -n litmus Note: Ensure your cluster is ready for chaos experiments. Define the Chaos Experiment Create a ChaosExperiment YAML file to simulate a Pod Delete scenario. Example (pod-delete.yaml): YAML apiVersion: litmuschaos.io/v1alpha1 kind: ChaosExperiment metadata: name: pod-delete namespace: litmus spec: definition: scope: Namespaced permissions: - apiGroups: ["*"] resources: ["*"] verbs: ["*"] image: "litmuschaos/go-runner:latest" args: - -c - ./experiments/generic/pod_delete/pod_delete.test command: - /bin/bash Install ChaosOperator and Configure Service Account Deploy ChaosOperator to manage experiments: Shell kubectl apply -f https://raw.githubusercontent.com/litmuschaos/litmus/master/litmus-operator/cluster-k8s.yml Note: Create a ServiceAccount to grant necessary permissions. Inject Chaos into the Target Application Label the application namespace for chaos: Shell kubectl label namespace <target-namespace> litmuschaos.io/chaos=enabled Deploy a ChaosEngine to trigger the experiment: Example (chaosengine.yaml): YAML apiVersion: litmuschaos.io/v1alpha1 kind: ChaosEngine metadata: name: pod-delete-engine namespace: <target-namespace> spec: appinfo: appns: '<target-namespace>' applabel: 'app=<your-app-label>' appkind: 'deployment' chaosServiceAccount: litmus-admin monitoring: false experiments: - name: pod-delete Apply the ChaosEngine: Shell kubectl apply -f chaosengine.yaml Monitor the Experiment View the progress: Shell kubectl describe chaosengine pod-delete-engine -n <target-namespace> Check the status of the chaos pods: Shell kubectl get pods -n <target-namespace> Analyze the Results Post-experiment, review logs and metrics to determine if the application recovered automatically or failed under stress. Here are some metrics to monitor: Application response timeError rates during and after the experimentTime taken for pods to recover Solution Root cause identified: During high traffic, pods failed due to an insufficient number of replicas in the deployment and improper resource limits. Fixes applied: Increased the number of replicas in the deployment to handle higher trafficConfigured proper resource requests and limits for CPU and memory in the pod specificationImplemented a Horizontal Pod Autoscaler (HPA) to handle traffic spikes dynamically Conclusion By using LitmusChaos to simulate pod failures, we identified key weaknesses in the e-commerce platform’s Kubernetes deployment. The chaos experiment demonstrated that resilience can be significantly improved with scaling and resource allocation adjustments. Chaos engineering enabled proactive system hardening, leading to better uptime and customer satisfaction. More

Docker Performance Optimization: Real-World Strategies

By Anil Kumar Moka

After optimizing containerized applications processing petabytes of data in fintech environments, I've learned that Docker performance isn't just about speed — it's about reliability, resource efficiency, and cost optimization. Let's dive into strategies that actually work in production. The Performance Journey: Common Scenarios and Solutions Scenario 1: The CPU-Hungry Container Have you ever seen your container CPU usage spike to 100% for no apparent reason? We can fix that with this code below: Shell # Quick diagnosis script #!/bin/bash container_id=$1 echo "CPU Usage Analysis" docker stats --no-stream $container_id echo "Top Processes Inside Container" docker exec $container_id top -bn1 echo "Hot CPU Functions" docker exec $container_id perf top -a This script provides three levels of CPU analysis: docker stats – shows real-time CPU usage percentage and other resource metricstop -bn1 – lists all processes running inside the container, sorted by CPU usageperf top -a – identifies specific functions consuming CPU cycles After identifying CPU bottlenecks, here's how to implement resource constraints and optimizations: YAML services: cpu-optimized: deploy: resources: limits: cpus: '2' reservations: cpus: '1' environment: # JVM optimization (if using Java) JAVA_OPTS: > -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=4 -XX:ConcGCThreads=2 This configuration: Limits the container to use maximum 2 CPU coresGuarantees 1 CPU core availabilityOptimizes Java applications by: Using the G1 garbage collector for better throughputSetting a maximum pause time of 200ms for garbage collectionConfiguring parallel and concurrent GC threads for optimal performance Scenario 2: The Memory Leak Detective If you have a container with growing memory usage, here is your debugging toolkit: Shell #!/bin/bash # memory-debug.sh container_name=$1 echo "Memory Trend Analysis" while true; do docker stats --no-stream $container_name | \ awk '{print strftime("%H:%M:%S"), $4}' >> memory_trend.log sleep 10 done This script: Takes a container name as inputRecords memory usage every 10 secondsLogs timestamp and memory usage to memory_trend.logUses awk to format the output with timestamps Memory optimization results: Plain Text Before Optimization: - Base Memory: 750MB - Peak Memory: 2.1GB - Memory Growth Rate: +100MB/hour After Optimization: - Base Memory: 256MB - Peak Memory: 512MB - Memory Growth Rate: +5MB/hour - Memory Use Pattern: Stable with regular GC Scenario 3: The Slow Startup Syndrome If your container is taking ages to start, we can fix it with the code below: Dockerfile # Before: 45s startup time FROM openjdk:11 COPY . . RUN ./gradlew build # After: 12s startup time FROM openjdk:11-jre-slim as builder WORKDIR /app COPY build.gradle settings.gradle ./ COPY src ./src RUN ./gradlew build --parallel --daemon FROM openjdk:11-jre-slim COPY --from=builder /app/build/libs/*.jar app.jar # Enable JVM tiered compilation for faster startup ENTRYPOINT ["java", "-XX:+TieredCompilation", "-XX:TieredStopAtLevel=1", "-jar", "app.jar"] Key optimizations explained: Multi-stage build reduces final image sizeUsing slim JRE instead of full JDKCopying only necessary files for buildingEnabling parallel builds with Gradle daemonJVM tiered compilation optimizations: -XX:+TieredCompilation – enables tiered compilation-XX:TieredStopAtLevel=1 – stops at first tier for faster startup Real-World Performance Metrics Dashboard Here's a Grafana dashboard query that will give you the full picture: YAML # prometheus.yml scrape_configs: - job_name: 'docker-metrics' static_configs: - targets: ['localhost:9323'] metrics_path: /metrics metric_relabel_configs: - source_labels: [container_name] regex: '^/.+' target_label: container_name replacement: '$1' This configuration: Sets up a scrape job named 'docker-metrics'Targets the Docker metrics endpoint on localhost:9323Configures metric relabeling to clean up container namesCollects all Docker engine and container metrics Performance metrics we track: Plain Text Container Health Metrics: Response Time (p95): < 200ms CPU Usage: < 80% Memory Usage: < 70% Container Restarts: 0 in 24h Network Latency: < 50ms Warning Signals: Response Time > 500ms CPU Usage > 85% Memory Usage > 80% Container Restarts > 2 in 24h Network Latency > 100ms The Docker Performance Toolkit Here's my go-to performance investigation toolkit: Shell #!/bin/bash # docker-performance-toolkit.sh container_name=$1 echo "Container Performance Analysis" # Check base stats docker stats --no-stream $container_name # Network connections echo "Network Connections" docker exec $container_name netstat -tan # File system usage echo "File System Usage" docker exec $container_name df -h # Process tree echo "Process Tree" docker exec $container_name pstree -p # I/O stats echo "I/O Statistics" docker exec $container_name iostat This toolkit provides: Container resource usage statisticsNetwork connection status and statisticsFile system usage and available spaceProcess hierarchy within the containerI/O statistics for disk operations Benchmark Results From The Field Here are some real numbers from a recent optimization project: Plain Text API Service Performance: Before → After - Requests/sec: 1,200 → 3,500 - Latency (p95): 250ms → 85ms - CPU Usage: 85% → 45% - Memory: 1.8GB → 512MB Database Container: Before → After - Query Response: 180ms → 45ms - Connection Pool Usage: 95% → 60% - I/O Wait: 15% → 3% - Cache Hit Ratio: 75% → 95% The Performance Troubleshooting Playbook 1. Container Startup Issues Shell # Quick startup analysis docker events --filter 'type=container' --filter 'event=start' docker logs --since 5m container_name What This Does The first command (docker events) monitors real-time container events, specifically filtered for: type=container – only show container-related eventsevent=start – focus on container startup eventsThe second command (docker logs) retrieves logs from the last 5 minutes for the specified container When to Use Container fails to start or starts slowlyInvestigating container startup dependenciesDebugging initialization scriptsIdentifying startup-time configuration issues 2. Network Performance Issues Shell # Network debugging toolkit docker run --rm \ --net container:target_container \ nicolaka/netshoot \ iperf -c iperf-server Understanding the commands: --rm – automatically remove the container when it exits--net container:target_container – share the network namespace with the target containernicolaka/netshoot – a specialized networking troubleshooting container imageiperf -c iperf-server– network performance testing tool -c – run in client modeiperf-server – target server to test against 3. Resource Contention Shell # Resource monitoring docker run --rm \ --pid container:target_container \ --net container:target_container \ nicolaka/netshoot \ htop Breakdown of the commands: --pid container:target_container – share the process namespace with target container--net container:target_container – share the network namespacehtop – interactive process viewer and system monitor Tips From the Experience 1. Instant Performance Boost Use tmpfs for high I/O workloads: YAML services: app: tmpfs: - /tmp:rw,noexec,nosuid,size=1g This configuration: Mounts a tmpfs (in-memory filesystem) at /tmpAllocates 1GB of RAM for temporary storageImproves I/O performance for temporary filesOptions explained: rw – read-write accessnoexec – prevents execution of binariesnosuid – disables SUID/SGID bits 2. Network Optimization Enable TCP BBR for better throughput: Shell echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf These settings: Enable Fair Queuing scheduler for better latencyActivate BBR congestion control algorithmImprove network throughput and latency 3. Image Size Reduction Use multi-stage builds with distroless: Dockerfile FROM golang:1.17 AS builder WORKDIR /app COPY . . RUN CGO_ENABLED=0 go build -o server FROM gcr.io/distroless/static COPY --from=builder /app/server / CMD ["/server"] This Dockerfile demonstrates: Multi-stage build patternStatic compilation of Go binaryDistroless base image for minimal attack surfaceSignificant reduction in final image size Conclusion Remember, Docker performance optimization is a more gradual process. Start with these metrics and tools, but always measure and adapt based on your specific needs. These strategies have helped me handle millions of transactions in production environments, and I'm confident they'll help you, too! More

Shared vs Shielded Context: Testers and Devs Writing Tests Together

By Natalia Poliakova

Productivity and Organization Tips for Software Engineers

By Tyler Hawkins

CORE

Exploring the Purpose of Pytest Fixtures: A Practical Guide

By Sidharth Shukla

A Guide to Using Amazon Bedrock Prompts for LLM Integration

As generative AI revolutionizes various industries, developers increasingly seek efficient ways to integrate large language models (LLMs) into their applications. Amazon Bedrock is a powerful solution. It offers a fully managed service that provides access to a wide range of foundation models through a unified API. This guide will explore key benefits of Amazon Bedrock, how to integrate different LLM models into your projects, how to simplify the management of the various LLM prompts your application uses, and best practices to consider for production usage. Key Benefits of Amazon Bedrock Amazon Bedrock simplifies the initial integration of LLMs into any application by providing all the foundational capabilities needed to get started. Simplified Access to Leading Models Bedrock provides access to a diverse selection of high-performing foundation models from industry leaders such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon. This variety allows developers to choose the most suitable model for their use case and switch models as needed without managing multiple vendor relationships or APIs. Fully Managed and Serverless As a fully managed service, Bedrock eliminates the need for infrastructure management. This allows developers to focus on building applications rather than worrying about the underlying complexities of infrastructure setup, model deployment, and scaling. Enterprise-Grade Security and Privacy Bedrock offers built-in security features, ensuring that data never leaves your AWS environments and is encrypted in transit and at rest. It also supports compliance with various standards, including ISO, SOC, and HIPAA. Stay Up-to-Date With the Latest Infrastructure Improvements Bedrock regularly releases new features that push the boundaries of LLM applications and require little to no setup. For example, it recently released an optimized inference mode that improves LLM inference latency without compromising accuracy. Getting Started With Bedrock In this section, we’ll use the AWS SDK for Python to build a small application on your local machine, providing a hands-on guide to getting started with Amazon Bedrock. This will help you understand the practical aspects of using Bedrock and how to integrate it into your projects. Prerequisites You have an AWS account.You have Python installed. If not installed, get it by following this guide.You have the Python AWS SDK (Boto3) installed and configured correctly. It's recommended to create an AWS IAM user that Boto3 can use. Instructions are available in the Boto3 Quickstart guide.If using an IAM user, ensure you add the AmazonBedrockFullAccess policy to it. You can attach policies using the AWS console.Request access to 1 or more models on Bedrock by following this guide. 1. Creating the Bedrock Client Bedrock has multiple clients available within the AWS CDK. The Bedrock client lets you interact with the service to create and manage models, while the BedrockRuntime client enables you to invoke existing models. We will use one of the existing off-the-shelf foundation models for our tutorial, so we’ll just work with the BedrockRuntime client. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') 2. Invoking the Model In this example, I’ve used the Amazon Nova Micro model (with modelId amazon.nova-micro-v1:0), one of Bedrock's cheapest models. We’ll provide a simple prompt to ask the model to write us a poem and set parameters to control the length of the output and the level of creativity (called “temperature”) the model should provide. Feel free to play with different prompts and tune parameters to see how they impact the output. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') # Select a model (Feel free to play around with different models) modelId = 'amazon.nova-micro-v1:0' # Configure the request with the prompt and inference parameters body = json.dumps({ "schemaVersion": "messages-v1", "messages": [{"role": "user", "content": [{"text": "Write a short poem about a software development hero."}]}], "inferenceConfig": { "max_new_tokens": 200, # Adjust for shorter or longer outputs. "temperature": 0.7 # Increase for more creativity, decrease for more predictability } }) # Make the request to Bedrock response = bedrock.invoke_model(body=body, modelId=modelId) # Process the response response_body = json.loads(response.get('body').read()) print(response_body) We can also try this with another model like Anthropic’s Haiku, as shown below. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') # Select a model (Feel free to play around with different models) modelId = 'anthropic.claude-3-haiku-20240307-v1:0' # Configure the request with the prompt and inference parameters body = json.dumps({ "anthropic_version": "bedrock-2023-05-31", "messages": [{"role": "user", "content": [{"type": "text", "text": "Write a short poem about a software development hero."}]}], "max_tokens": 200, # Adjust for shorter or longer outputs. "temperature": 0.7 # Increase for more creativity, decrease for more predictability }) # Make the request to Bedrock response = bedrock.invoke_model(body=body, modelId=modelId) # Process the response response_body = json.loads(response.get('body').read()) print(response_body) Note that the request/response structures vary slightly between models. This is a drawback that we will address by using predefined prompt templates in the next section. To experiment with other models, you can look up the modelId and sample API requests for each model from the “Model Catalog” page in the Bedrock console and tune your code accordingly. Some models also have detailed guides written by AWS, which you can find here. 3. Using Prompt Management Bedrock provides a nifty tool to create and experiment with predefined prompt templates. Instead of defining prompts and specific parameters such as token lengths or temperature in your code every time you need them, you can create pre-defined templates in the Prompt Management console. You specify input variables that will be injected during runtime, set up all the required inference parameters, and publish a version of your prompt. Once done, your application code can invoke the desired version of your prompt template. Key advantages of using predefined prompts: It helps your application stay organized as it grows and uses different prompts, parameters, and models for various use cases.It helps with prompt reuse if the same prompt is used in multiple places.Abstracts away the details of LLM inference from our application code.Allows prompt engineers to work on prompt optimization in the console without touching your actual application code.It allows for easy experimentation, leveraging different versions of prompts. You can tweak the prompt input, parameters like temperature, or even the model itself. Let’s try this out now: Head to the Bedrock console and click “Prompt Management” on the left panel.Click on “Create Prompt” and give your new prompt a nameInput the text that we want to send to the LLM, along with a placeholder variable. I used Write a short poem about a {{topic}.In the Configuration section, specify which model you want to use and set the values of the same parameters we used earlier, such as “Temperature” and “Max Tokens.” If you prefer, you can leave the defaults as-is.It's time to test! At the bottom of the page, provide a value for your test variable. I used “Software Development Hero.” Then, click “Run” on the right to see if you’re happy with the output. For reference, here is my configuration and the results. We need to publish a new Prompt Version to use this Prompt in your application. To do so, click the “Create Version” button at the top. This creates a snapshot of your current configuration. If you want to play around with it, you can continue editing and creating more versions. Once published, we need to find the ARN (Amazon Resource Name) of the Prompt Version by navigating to the page for your Prompt and clicking on the newly created version. Copy the ARN of this specific prompt version to use in your code. Once we have the ARN, we can update our code to invoke this predefined prompt. We only need the prompt version's ARN and the values for any variables we inject into it. Python import boto3 import json # Create a Bedrock client bedrock = boto3.client(service_name='bedrock-runtime', region_name='us-east-1') # Select your prompt identifier and version promptArn = "<ARN from the specific prompt version>" # Define any required prompt variables body = json.dumps({ "promptVariables": { "topic":{"text":"software development hero"} } }) # Make the request to Bedrock response = bedrock.invoke_model(modelId=promptArn, body=body) # Process the response response_body = json.loads(response.get('body').read()) print(response_body) As you can see, this simplifies our application code by abstracting away the details of LLM inference and promoting reusability. Feel free to play around with parameters within your prompt, create different versions, and use them in your application. You could extend this into a simple command line application that takes user input and writes a short poem on that topic. Next Steps and Best Practices Once you're comfortable with using Bedrock to integrate an LLM into your application, explore some practical considerations and best practices to get your application ready for production usage. Prompt Engineering The prompt you use to invoke the model can make or break your application. Prompt engineering is the process of creating and optimizing instructions to get the desired output from an LLM. With the pre-defined prompt templates explored above, skilled prompt engineers can get started with prompt engineering without interfering with the software development process of your application. You may need to tailor your prompt to be specific to the model you would like to use. Familiarize yourself with prompt techniques specific to each model provider. Bedrock provides some guidelines for commonly large models. Model Selection Making the right model choice is a balance between the needs of your application and the cost incurred. More capable models tend to be more expensive. Not all use cases require the most powerful model, while the cheapest models may not always provide the performance you need. Use the Model Evaluation feature to quickly evaluate and compare the outputs of different models to determine which one best meets your needs. Bedrock offers multiple options to upload test datasets and configure how model accuracy should be evaluated for individual use cases. Fine-Tune and Extend Your Model With RAG and Agents If an off-the-shelf model doesn't work well enough for you, Bedrock offers options to tune your model to your specific use case. Create your training data, upload it to S3, and use the Bedrock console to initiate a fine-tuning job. You can also extend your models using techniques such as retrieval-augmented generation (RAG) to improve performance for specific use cases. Connect existing data sources which Bedrock will make available to the model to enhance its knowledge. Bedrock also offers the ability to create agents to plan and execute complex multi-step tasks using your existing company systems and data sources. Security and Guardrails With Guardrails, you can ensure that your generative application gracefully avoids sensitive topics (e.g., racism, sexual content, and profanity) and that the generated content is grounded to prevent hallucinations. This feature is crucial for maintaining your applications' ethical and professional standards. Leverage Bedrock's built-in security features and integrate them with your existing AWS security controls. Cost Optimization Before widely releasing your application or feature, consider the cost that Bedrock inference and extensions such as RAG will incur. If you can predict your traffic patterns, consider using Provisioned Throughput for more efficient and cost-effective model inference.If your application consists of multiple features, you can use different models and prompts for every feature to optimize costs on an individual basis.Revisit your choice of model as well as the size of the prompt you provide for each inference. Bedrock generally prices on a "per-token" basis, so longer prompts and larger outputs will incur more costs. Conclusion Amazon Bedrock is a powerful and flexible platform for integrating LLMs into applications. It provides access to many models, simplifies development, and delivers robust customization and security features. Thus, developers can harness the power of generative AI while focusing on creating value for their users. This article shows how to get started with an essential Bedrock integration and keep our Prompts organized. As AI evloves, developers should stay updated with the latest features and best practices in Amazon Bedrock to build their AI applications.

By Adit Jamdar

Processing Cloud Data With DuckDB And AWS S3

DuckDb is a powerful in-memory database that has a parallel processing feature, which makes it a good choice to read/transform cloud storage data, in this case, AWS S3. I've had a lot of success using it and I will walk you through the steps in implementing it. I will also include some learnings and best practices for you. Using the DuckDb, httpfs extension and pyarrow, we can efficiently process Parquet files stored in S3 buckets. Let's dive in: Before starting the installation of DuckDb, make sure you have these prerequisites: Python 3.9 or higher installed Prior knowledge of setting up Python projects and virtual environments or conda environments Installing Dependencies First, let's establish the necessary environment: Shell # Install required packages for cloud integration pip install "duckdb>=0.8.0" pyarrow pandas boto3 requests The dependencies explained: duckdb>=0.8.0: The core database engine that provides SQL functionality and in-memory processingpyarrow: Handles Parquet file operations efficiently with columnar storage supportpandas: Enables powerful data manipulation and analysis capabilitiesboto3: AWS SDK for Python, providing interfaces to AWS servicesrequests: Manages HTTP communications for cloud interactions Configuring Secure Cloud Access Python import duckdb import os # Initialize DuckDB with cloud support conn = duckdb.connect(':memory:') conn.execute("INSTALL httpfs;") conn.execute("LOAD httpfs;") # Secure AWS configuration conn.execute(""" SET s3_region='your-region'; SET s3_access_key_id='your-access-key'; SET s3_secret_access_key='your-secret-key'; """) This initialization code does several important things: Creates a new DuckDB connection in memory using :memory:Installs and loads the HTTP filesystem extension (httpfs) which enables cloud storage accessConfigures AWS credentials with your specific region and access keysSets up a secure connection to AWS services Processing AWS S3 Parquet Files Let's examine a comprehensive example of processing Parquet files with sensitive data masking: Python import duckdb import pandas as pd # Create sample data to demonstrate parquet processing sample_data = pd.DataFrame({ 'name': ['John Smith', 'Jane Doe', 'Bob Wilson', 'Alice Brown'], 'email': ['john.smith@email.com', 'jane.doe@company.com', 'bob@email.net', 'alice.b@org.com'], 'phone': ['123-456-7890', '234-567-8901', '345-678-9012', '456-789-0123'], 'ssn': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012'], 'address': ['123 Main St', '456 Oak Ave', '789 Pine Rd', '321 Elm Dr'], 'salary': [75000, 85000, 65000, 95000] # Non-sensitive data }) This sample data creation helps us demonstrate data masking techniques. We include various types of sensitive information commonly found in real-world datasets: Personal identifiers (name, SSN)Contact information (email, phone, address)Financial data (salary) Now, let's look at the processing function: Python def demonstrate_parquet_processing(): # Create a DuckDB connection conn = duckdb.connect(':memory:') # Save sample data as parquet sample_data.to_parquet('sample_data.parquet') # Define sensitive columns to mask sensitive_cols = ['email', 'phone', 'ssn'] # Process the parquet file with masking query = f""" CREATE TABLE masked_data AS SELECT -- Mask name: keep first letter of first and last name regexp_replace(name, '([A-Z])[a-z]+ ([A-Z])[a-z]+', '\1*** \2***') as name, -- Mask email: hide everything before @ regexp_replace(email, '([a-zA-Z0-9._%+-]+)(@.*)', '****\2') as email, -- Mask phone: show only last 4 digits regexp_replace(phone, '[0-9]{3}-[0-9]{3}-', '***-***-') as phone, -- Mask SSN: show only last 4 digits regexp_replace(ssn, '[0-9]{3}-[0-9]{2}-', '***-**-') as ssn, -- Mask address: show only street type regexp_replace(address, '[0-9]+ [A-Za-z]+ ', '*** ') as address, -- Keep non-sensitive data as is salary FROM read_parquet('sample_data.parquet'); """ Let's break down this processing function: We create a new DuckDB connectionConvert our sample DataFrame to a Parquet fileDefine which columns contain sensitive informationCreate a SQL query that applies different masking patterns: Names: Preserves initials (e.g., "John Smith" → "J*** S***")Emails: Hides local part while keeping domain (e.g., "" → "****@email.com")Phone numbers: Shows only the last four digitsSSNs: Displays only the last four digitsAddresses: Keeps only street typeSalary: Remains unmasked as non-sensitive data The output should look like: Plain Text Original Data: ============= name email phone ssn address salary 0 John Smith john.smith@email.com 123-456-7890 123-45-6789 123 Main St 75000 1 Jane Doe jane.doe@company.com 234-567-8901 234-56-7890 456 Oak Ave 85000 2 Bob Wilson bob@email.net 345-678-9012 345-67-8901 789 Pine Rd 65000 3 Alice Brown alice.b@org.com 456-789-0123 456-78-9012 321 Elm Dr 95000 Masked Data: =========== name email phone ssn address salary 0 J*** S*** ****@email.com ***-***-7890 ***-**-6789 *** St 75000 1 J*** D*** ****@company.com ***-***-8901 ***-**-7890 *** Ave 85000 2 B*** W*** ****@email.net ***-***-9012 ***-**-8901 *** Rd 65000 3 A*** B*** ****@org.com ***-***-0123 ***-**-9012 *** Dr 95000 Now, let's explore different masking patterns with explanations in the comments of the Python code snippets: Email Masking Variations Python # Show first letter only "john.smith@email.com" → "j***@email.com" # Show domain only "john.smith@email.com" → "****@email.com" # Show first and last letter "john.smith@email.com" → "j*********h@email.com" Phone Number Masking Python # Last 4 digits only "123-456-7890" → "***-***-7890" # First 3 digits only "123-456-7890" → "123-***-****" # Middle digits only "123-456-7890" → "***-456-****" Name Masking Python # Initials only "John Smith" → "J.S." # First letter of each word "John Smith" → "J*** S***" # Fixed length masking "John Smith" → "XXXX XXXXX" Efficient Partitioned Data Processing When dealing with large datasets, partitioning becomes crucial. Here's how to handle partitioned data efficiently: Python def process_partitioned_data(base_path, partition_column, sensitive_columns): """ Process partitioned data efficiently Parameters: - base_path: Base path to partitioned data - partition_column: Column used for partitioning (e.g., 'date') - sensitive_columns: List of columns to mask """ conn = duckdb.connect(':memory:') try: # 1. List all partitions query = f""" WITH partitions AS ( SELECT DISTINCT {partition_column} FROM read_parquet('{base_path}/*/*.parquet') ) SELECT * FROM partitions; """ This function demonstrates several important concepts: Dynamic partition discoveryMemory-efficient processingError handling with proper cleanupMasked data output generation The partition structure typically looks like: Partition Structure Plain Text sample_data/ ├── date=2024-01-01/ │ └── data.parquet ├── date=2024-01-02/ │ └── data.parquet └── date=2024-01-03/ └── data.parquet Sample Data Plain Text Original Data: date customer_id email phone amount 2024-01-01 1 user1@email.com 123-456-0001 500.00 2024-01-01 2 user2@email.com 123-456-0002 750.25 ... Masked Data: date customer_id email phone amount 2024-01-01 1 **** **** 500.00 2024-01-01 2 **** **** 750.25 Below are some benefits of partitioned processing: Reduced memory footprintParallel processing capabilityImproved performanceScalable data handling Performance Optimization Techniques 1. Configuring Parallel Processing Python # Optimize for performance conn.execute(""" SET partial_streaming=true; SET threads=4; SET memory_limit='4GB'; """) These settings: Enable partial streaming for better memory managementSet parallel processing threadsDefine memory limits to prevent overflow 2. Robust Error Handling Python def robust_s3_read(s3_path, max_retries=3): """ Implement reliable S3 data reading with retries. Parameters: - s3_path: Path to S3 data - max_retries: Maximum retry attempts """ for attempt in range(max_retries): try: return conn.execute(f"SELECT * FROM read_parquet('{s3_path}')") except Exception as e: if attempt == max_retries - 1: raise time.sleep(2 ** attempt) # Exponential backoff This code block demonstrates how to implement retries and also throw exceptions where needed so as to take proactive measures. 3. Storage Optimization Python # Efficient data storage with compression conn.execute(""" COPY (SELECT * FROM masked_data) TO 's3://output-bucket/masked_data.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD'); """) This code block demonstrates applying storage compression type for optimizing the storage. Best Practices and Recommendations Security Best Practices Security is crucial when handling data, especially in cloud environments. Following these practices helps protect sensitive information and maintain compliance: IAM roles. Use AWS Identity and Access Management roles instead of direct access keys when possibleKey rotation. Implement regular rotation of access keysLeast privilege. Grant minimum necessary permissionsAccess monitoring. Regularly review and audit access patterns Why it's important: Security breaches can lead to data leaks, compliance violations, and financial losses. Proper security measures protect both your organization and your users' data. Performance Optimization Optimizing performance ensures efficient resource utilization and faster data processing: Partition sizing. Choose appropriate partition sizes based on data volume and processing patternsParallel processing. Utilize multiple threads for faster processingMemory management. Monitor and optimize memory usageQuery optimization. Structure queries for maximum efficiency Why it's important: Efficient performance reduces processing time, saves computational resources, and improves overall system reliability. Error Handling Robust error handling ensures reliable data processing: Retry mechanisms. Implement exponential backoff for failed operationsComprehensive logging. Maintain detailed logs for debuggingStatus monitoring. Track processing progressEdge cases. Handle unexpected data scenarios Why it's important: Proper error handling prevents data loss, ensures processing completeness, and makes troubleshooting easier. Conclusion Cloud data processing with DuckDB and AWS S3 offers a powerful combination of performance and security. Let me know how your DuckDb implementation goes!error handling

By Anil Kumar Moka

Keycloak and Docker Integration: A Step-by-Step Tutorial

Keycloak is a powerful authentication and authorization solution that provides plenty of useful features, such as roles and subgroups, an advanced password policy, and single sign-on. It’s also very easy to integrate with other solutions. We’ve already shown you how to connect Keycloak to your Angular app, but there’s more you can do. For example, by integrating this technology with Cypress, you can enable the simulation of real-user login scenarios, including multi-factor authentication and social logins, ensuring that security protocols are correctly implemented and functioning as expected. Most importantly, you can also use Docker containers to provide a portable and consistent environment across different platforms (possibly with container image scanning, for increased security). This integration ensures easy deployment, scalability, and efficient dependency management, streamlining the process of securing applications and services. Additionally, Docker Compose can be used to orchestrate multiple containers, simplifying complex configurations and enhancing the overall management of Keycloak instances. This guide will show you precisely how to set all of this up. Let’s get started! Prerequisites The article is based on the contents of a GitHub repository consisting of several elements: Frontend application written in AngularKeycloak configurationE2E tests written in CypressDocker configuration for the whole stack The point of this tech stack is to allow users to work with Angular/Keycloak/Cypress locally and also in Docker containers. Keycloak Configuration We’ll start by setting up Keycloak, which is a crucial part of both configurations. The idea is to run it inside a Docker container and expose it at http://localhost:8080. Keycloak has predefined configurations, including users, realm, and client ID, so setting it up for this project requires minimum effort. Normal User Your normal user in the Keycloak panel should be configured using the following details: User: testPassword: sIjKqg73MTf9uTU Keycloak Administrator Here’s the default configuration for the admin user (of course, you probably shouldn’t use default settings for the admin account in real-world scenarios). User: adminPassword: admin Local Configuration This configuration allows you to work locally with an Angular application in dev mode along with E2E tests. It requires Keycloak to be run and available on http://localhost:8080. This is set in the Docker configuration, which is partially used here. To run the configuration locally, use the following commands in the command line. First, in the main project directory: JavaScript npm install In /e2e directory: JavaScript npm install In the main directory for frontend application development: JavaScript npm run start In /e2e directory: JavaScript npm run cy:run In the main project directory: JavaScript docker-compose up -d keycloak Docker Configuration Installing and configuring Docker is a relatively simple matter — the solution provides detailed documentation you can use if you run into any problems. In the context of our project, the Docker configuration does several key things: Running Keycloak and importing the predefined realm along with usersBuilding and exposing the Angular application on http://localhost:4200 via nginx on a separate Docker containerRunning e2e container to allow you to run tests via Cypress To run a dockerized configuration, type in the command line in the main project directory: JavaScript docker-compose up -d To run Cypress tests inside the container, use the following command: JavaScript docker container exec -ti e2e bash Then, inside the container, run: JavaScript npm run cy:run Test artifacts are connected to the host machine via volume, so test reports, screenshots, and videos will be available immediately on path /e2e/cypress/ in the following folders: reports, screenshots, and videos. Conclusion And that’s about it. As you can see, integrating Keycloak (or rather an Angular app that uses Keycloak), Docker, and Cypress is a relatively straightforward process. There are only a couple of steps you must take to get a consistent, containerized environment for easy deployment, scaling, and efficient dependency management — with the added benefit of real-user login scenario simulation thanks to Cypress for top-notch security.

By Michał Zięba

The Role of DQ Checks in Data Pipelines

Overview One of the key principles of writing a good data pipeline is ensuring accurate data is loaded into the target table. We have no control over the quality of the upstream data we read from, but we can have a few data quality (DQ) checks in our pipeline to ensure any bad data would be caught early on without letting it propagate downstream. DQ checks are critical in making sure the data that gets processed every day is reliable, and that downstream tables can query them safely. This will save a lot of time and resources, as we will be able to halt the data flow, giving us some time to do RCA and fix the issue rather than pass incorrect data. The biggest challenge with large data warehouses with multiple interdependent pipelines is that we would have no idea about the data issue if bad data gets introduced in one of the pipelines, and sometimes, it could take days, even before it's detected. Even though DQ check failures could cause some temporary delay in landing the data, it's much better than customers or users reporting data quality issues and then having to backfill all the impacted tables. Some of the common data quality issues that could occur are: Duplicate rows – a table at user grain (which means there can only be one row per user), having duplicates0 or null values – you expect certain critical columns not to have any null or 0 values, e.g., SSN, age, country columnsAbnormal row count – the overall row count of the table suddenly increases or drops compared to the historical valuesAbnormal metric value – a specific metric, say '# of daily user logins' suddenly spikes or drops compared to historical values Note: The operators we will be referencing below are part of the Dataswarm and Presto tech stack, which are a proprietary data pipeline building tool and an SQL query engine, respectively, developed at Facebook. Importance of Signal Tables It's a good practice to publish signal tables, which should serve as the source for downstream pipelines. These are essentially linked views that can be created on top of any table. Since they are views, they don’t take up any storage, so there is no reason not to build them. These should be created only after the DQ checks pass, and downstream pipelines should wait for these signal tables to be available rather than waiting for the source table directly, as these would have been vetted for any data anomalies. Building the Right DAG In the data lineage flow below, if bad data gets loaded into table1, then without DQ checks, they would get passed on to table2 and table3 as there is no way for pipelines2 and 3 to know of any data issues, as all they do is simply check if the table1 data has landed. But if DQ checks had been implemented, then it would fail the job/pipeline, and the table1_sigal wouldn’t have been created; thus, the downstream WaitForOperators would still be waiting, stopping the propagation of bad data. Types of DQ Failures to Enforce Hard failure. If these DQ checks fail, the job will fail and notify the oncall or table owner, so the signal table will not be created. These could potentially cause downstream pipelines to be delayed and could be an issue if they have tighter Service Level Agreements (SLAs). But for critical pipelines, this might be worth it, as sending bad data could have catastrophic ripple effects.Soft failure. If these fail, the oncall and table owner would be notified, but the job won't fail, so the signal table would still get published, and the data would get loaded and propagated downstream. For cases where the data quality loss is tolerable, this can be used. Setting Up DQ Checks We will go over some examples of how we can set up the different DQ checks and some simplified trigger logic behind each of the DQ operators. Some things to know beforehand: '<DATEID>' is a macro that will resolve to the date the Dataswarm pipeline is scheduled to run (e.g., when the job runs on Oct 1, 2020, it will resolve to ds = '2020-10-01').The output of presto_api will be an array of dictionaries, e.g., [{'ds': '2020-10-01', 'userID': 123, ‘country’: ‘US’}, {'ds': '2020-10-01', 'userID': 124, ‘country’: ‘CA’}, {...}], where each dictionary value represents the corresponding row values of the table being queried, and the key is the column name. Below would be the table representation of the data, Duplicate Rows We can simply aggregate by the key column (e.g., userID) specified by the user and check if there are any duplicate rows present by peforming a simple GROUP BY with a HAVING clause, and limiting to just 1 row. The presto_results variable should be empty ([]); if not, then there are duplicates present in the table. Python # output will be an array of dict representing reach row in the table # eg [{'ds': '2020-10-01', 'userID': 123}, {...}] presto_results = presto_api( namespace = 'namespace_name', sql = ''' SELECT useriID FROM table WHERE ds = '<DATEID>' GROUP BY 1 HAVING SUM(1) > 1 LIMIT 1 ''' ) if len(presto_results) > 0: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS 0 or Null Values We can check if any of the specified columns have any invalid values by leveraging count_if presto UDF. Here, the output, if there are no invalid values, should be [{'userid_null_count': 0}]. Python presto_results = presto_api( namespace = 'namespace_name', sql = ''' SELECT COUNT_IF( userid IS NULL OR userid = 0 ) AS userid_null_count FROM table WHERE ds = '<DATEID>' ''' ) if presto_results[0]['userid_null_count'] > 0: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS Abnormal Row Count To get a sense of what the normal/expected row count is for a table on a daily basis, we can do a simple 7-day average of the previous 7 days, and if today's value deviates too much from that, we can trigger the alert. The thresholds can be either: Static – a fixed upper and lower threshold that is always static. Every day, the operator checks if today’s row count is either over or below the thresholds.Dynamic – use a +x% and -x% threshold value (you can start with, say, 15%, and adjust as needed), and if today's value is greater than the 7d avg + x% or lower than the 7d avg - x%, then trigger the alert. Python dq_insert_operator = PrestoInsertOperator( input_data = {"in": "source_table"}, output_data = {"out": "dq_check_output_table"}, select = """ SELECT SUM(1) AS row_count FROM source_table WHERE ds = '<DATEID>' """, ) dq_row_check_result = presto_api( namespace = 'namespace_name', sql = ''' SELECT ds, row_count FROM dq_check_output_table WHERE ds >= '<DATEID-7>' ORDER BY 1 ''' ) # we will loop through the dq_row_check_result object, which will have 8 values # where we will find the average between DATEID-7 and DATEID-1 and compare against DATEID x = .15 # threshold prev_7d_list = dq_row_check_result[0:7] prev_7d_sum = sum([prev_data['row_count'] for prev_data in prev_7d_list]) prev_7d_avg = prev_7d_sum/7 today_value = dq_row_check_result[-1]['row_count'] upper_threshold = prev_7d_avg * (1 + x) lower_threshold = prev_7d_avg * (1 - x) if today_value > upper_threshold or today_value < lower_threshold: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS So, every day, we calculate the sum of the total row count and load it into a dq_check_output_table (a temporary intermediate table that is specially used for storing DQ aggregated results). Then, we query the last 7 days and today's data from that table and store the values in an object, which we then loop through to calculate the upper and lower thresholds and check if today's value is violating either of them. Abnormal Metric Value If there are specific metrics that you want to track to see if there are any anomalies, you can set them up similarly to the above 'abnormal row count' check. Python dq_insert_operator = PrestoInsertOperator( input_data={"in": "source_table"}, output_data={"out": "dq_check_output_table"}, select=""" SELECT APPROX_DISTINCT(userid) AS distinct_user_count, SUM(cost) AS total_cost, COUNT_IF(has_login = True) AS total_logins FROM source_table WHERE ds = '<DATEID>' """, ) dq_row_check_result = presto_api( namespace='namespace_name', sql=''' SELECT ds, distinct_user_count, total_cost, total_logins FROM table WHERE ds >= '<DATEID-7>' ORDER BY 1 ''' ) Here, we calculate the distinct_user_count, total_cost, and total_logins metric and load it into a dq_check_output_table table, which we will query to find the anomalies. Takeaways You can extend this to any kind of custom checks/alerts like month-over-month value changes, year-over-year changes, etc. You can also specify GROUP BY clauses, for example, track the metric value at the interface or country level over a period of time. You can set up a DQ check tracking dashboard, especially for important metrics, to see how they have been behaving over time. In the screenshot below, you can see that there have been DQ failures for two of the dates in the past, while for other days, it has been within the predefined range. This can also be used to get a sense of how stable the upstream data quality is. They can save a lot of time as developers would be able to catch issues early on and also figure out where in the lineage the issue is occurring.Sometimes, the alerts could be false positive (FP) (alerts generated not due to bad/incorrect data, but maybe due to seasonality/new product launch, there could be a genuine volume increase or decrease). We need to ensure such edge cases are handled correctly to avoid noisy alerts. There is nothing worse than oncall being bombarded with FP alerts, so we want to be mindful of the thresholds we set and tune them as needed periodically.

By Ajay Krishnan Prabhakaran

The Quest for HA and DR in Loki

According to the 2016 Ponemon Institute research, the average downtime cost is nearly $9,000 per minute. These downtimes not only cost money, but also hurt the competitive edge and brand reputation. The organization can prepare for downtime by identifying the root causes. For that, they need information on how the software and infrastructure is running. Many software programs help aggregate this information, and one of the popular and most used tools is Loki. However, keeping Loki active under pressure is another problem. Recently, our team ran the single monolith instance of Loki as a private logging solution for our application microservices rather than for observing Kubernetes clusters. The logs were stored in the EBS filesystem. We wanted our system to be more robust and resilient, so we implemented High Availability (HA) and Disaster Recovery (DR) for our microservice application. But it was difficult due to the following reasons: Running clustered Loki is not possible with the file system store unless the file system is shared in some fashion (NFS, for example)Using shared file systems with Loki can lead to instabilityShared file systems are prone to several issues, including inconsistent performance, locking problems, and increased risk of data corruption, especially under high loadDurability of the data depends solely on the file system’s reliability, which can be unpredictable Our team decided to use object stores like S3 or GCS. Object stores are specifically engineered for high durability and provide advanced behind-the-scenes mechanisms — such as automatic replication, versioning, and redundancy — to ensure your data remains safe and consistent, even in the face of failures or surges. In this blog post, we will share how we achieved high availability (HA) and configured disaster recovery (DR) for Loki with AWS S3 as our object store. This ensures we can prevent or minimize data loss and business disruption from catastrophic events. First, let’s briefly discuss Loki and see what makes it different. What Is Loki, and How Does It Help With Observability? Loki is a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus. Loki differs from Prometheus by focusing on logs instead of metrics, and collecting logs via push, instead of pull. It is designed to be very cost-effective and highly scalable. Unlike other logging systems, Loki does not index the contents of the logs but only indexes metadata about your logs as a set of labels for each log stream. A log stream is a set of logs that share the same labels. Labels help Loki to find a log stream within your data store, so having a quality set of labels is key to efficient query execution. Log data is then compressed and stored in chunks in an object store such as Amazon Simple Storage Service (S3) or Google Cloud Storage (GCS) or, for development or proof of concept, on the file system. A small index and highly compressed chunks simplify the operation and significantly lower Loki’s cost. Now, we can understand the Loki deployment modes. Loki Deployment Modes Loki is a distributed system composed of multiple microservices, each responsible for specific tasks. These microservices can be deployed independently or together in a unique build mode where all services coexist within the same binary. Understanding the available deployment modes helps you decide how to structure these microservices to achieve optimal performance, scalability, and resilience in your environment. Different modes will impact how Loki’s components — like the Distributor, Ingester, Querier, and others — interact and how efficiently they manage logs. The list of Loki microservices includes: Cache Generation LoaderCompactorDistributorIndex-GatewayIngesterIngester-QuerierOverrides ExporterQuerierQuery-FrontendQuery-SchedulerRulerTable Manager (deprecated) Different Deployment Modes Loki offers different deployment modes, which allow us to build a highly available logging system. We need to choose the modes considering our log reads/writes rate, maintenance overhead, and complexity. Loki can be deployed in three modes, each suited for varying scales and complexity. Monolithic Mode The monolithic mode is the simplest option, where all Loki’s microservices run within a single binary or Docker image under all targets. The target flag is used to specify which microservices will run on startup. This mode is ideal for getting started with Loki, as it can handle log volumes of up to approximately 20 GB/day. High availability can be achieved by running multiple instances of the monolithic setup. Simple Scalable Deployment (SSD) Mode The Simple Scalable Deployment (SSD) mode is the preferred mode for most installations and is the default configuration when installing Loki via Helm charts. This mode balances simplicity and scalability by separating the execution paths into distinct targets: READ, WRITE, and BACKEND. These targets can be scaled independently based on business needs, allowing this deployment to handle up to a few terabytes of logs per day. The SSD mode requires a reverse proxy, such as Nginx, to route client API requests to the appropriate read or write nodes, and this setup is included by default in the Loki Helm chart. Microservices Deployment Mode The microservices deployment mode is the most granular and scalable option, where each Loki component runs as a separate process specified by individual targets. While this mode offers the highest control over scaling and cluster management, it is also the most complex to configure and maintain. Therefore, microservices mode is recommended only for huge Loki clusters or operators requiring precise control over the infrastructure. Achieving High Availability (HA) in Loki To achieve HA in Loki, we would: Configure multiple Loki instances using the memberlist_config configurationUse a shared object store for logs, such as: AWS S3Google Cloud StorageAny self-hosted storageSet the replication_factor to 3 These steps help ensure your logging service remains resilient and responsive. Memberlist Config memberlist_config is a key configuration element for achieving high availability in distributed systems like Loki. It enables the discovery and communication between multiple Loki instances, allowing them to form a cluster. This configuration is essential for synchronizing the state of the ingesters and ensuring they can share information about data writes, which helps maintain consistency across your logging system. In a high-availability setup, memberlist_config facilitates the dynamic management of instances, allowing the system to respond to failures and maintain service continuity. Other factors contributing to high availability include quorum, Write-Ahead Log (WAL), and replication factor. Replication Factor, Quorum, and Write-Ahead Log (WAL) 1. Replication Factor Typically set to 3, the replication factor ensures that data is written to multiple ingesters (servers), preventing data loss during restarts or failures. Having multiple copies of the same data increases redundancy and reliability in your logging system. 2. Quorum With a replication factor of 3, at least 2 out of 3 writes must succeed to avoid errors. This means the system can tolerate the loss of one ingester without losing any data. If two ingesters fail, however, the system will not be able to process writes successfully, thus emphasizing the importance of having a sufficient number of active ingesters to maintain availability. 3. Write-Ahead Log (WAL) The Write-Ahead Log provides an additional layer of protection against data loss by logging incoming writes to disk. This mechanism is enabled by default and ensures that even if an ingester crashes, the data can be recovered from the WAL. The combination of replication and WAL is crucial for maintaining data integrity, as it ensures that your data remains consistent and retrievable, even in the face of component failures. We chose the Simple Scalable Deployment (SSD) mode as the default deployment method for running Loki instead of using multiple instances in monolithic mode for high availability. The SSD mode strikes a balance between ease of use and the ability to scale independently, making it an ideal choice for our needs. Additionally, we opted to use AWS S3 as the object store while running our application and Loki in AWS EKS services, which provides a robust and reliable infrastructure for our logging needs. To streamline the setup process, refer to the Terraform example code snippet to create the required AWS resources, such as IAM roles, policies, and an S3 bucket with appropriate bucket policies. This code helps automate the provisioning of the necessary infrastructure, ensuring that you have a consistent and repeatable environment for running Loki with high availability. Guide to Installing Loki Following the guide, you can install Loki in Simple Scalable mode with AWS S3 as the object store. Below are the Helm chart values for reference, which you can customize based on your requirements. YAML # https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml # Grafana loki parameters: https://grafana.com/docs/loki/latest/configure/ loki: storage_config: # using tsdb instead of boltdb tsdb_shipper: active_index_directory: /var/loki/tsdb-shipper-active cache_location: /var/loki/tsdb-shipper-cache cache_ttl: 72h # Can be increased for faster performance over longer query periods, uses more disk space shared_store: s3 schemaConfig: configs: - from: 2020-10-24 store: tsdb object_store: s3 schema: v12 index: prefix: index_ period: 24h commonConfig: path_prefix: /var/loki replication_factor: 3 ring: kvstore: store: memberlist storage: bucketNames: chunks: aws-s3-bucket-name ruler: aws-s3-bucket-name type: s3 s3: # endpoint is required if we are using aws IAM user secret access id and key to connect to s3 # endpoint: "s3.amazonaws.com" # Region of the bucket To ensure the Loki pods are in the Running state, use the command kubectl get pods—n loki. In this setup, we are running multiple replicas of Loki read, write, and backend pods. With a replication_factor of 3, it is imperative to ensure that both the write and backend are operating with three replicas; otherwise, the quorum will fail, and Loki will be unavailable. The following image illustrates Loki’s integration with Amazon S3 for log storage in a single-tenant environment. In this configuration, logs are organized into two primary folders within the S3 bucket: index and fake. Index folder. This folder contains the index files that allow Loki to efficiently query and retrieve log data. The index serves as a mapping of log entries, enabling fast search operations and optimizing the performance of log retrieval.Fake folder. This folder is used to store the actual log data. In a single-tenant setup, it may be labeled as "fake," but it holds the important logs generated by your applications. Now Loki is running with HA. Using logcli, we should also be able to verify the logs by querying against Loki instances. Exploring Approaches for Disaster Recovery Loki is a critical component of our application stack, responsible for aggregating logs from multiple microservices and displaying them in the web application console for end-user access. These logs need to be retained for an extended period — up to 90 days. As part of our disaster recovery (DR) strategy for the application stack, ensuring the availability and accessibility of logs during a disaster is crucial. If Region-1 becomes unavailable, the applications must continue to run and access logs seamlessly. To address this, we decided to implement high availability for Loki by running two instances in separate regions. If one Loki instance fails, the instance in the other region should continue to handle both read and write operations for the logs. We explored three different approaches to setting up DR for Loki, intending to enable read and write capabilities across both Region-1 and Region-2, ensuring fault tolerance and uninterrupted log management. Approach 1: Implementing S3 Cross-Region Replication AWS S3 Cross-Region Replication (CRR) is a feature that allows you to automatically replicate objects from one S3 bucket to another bucket in a different AWS region. This is particularly useful for enhancing data durability, availability, and compliance by ensuring that your data is stored in multiple geographic locations. With CRR enabled, any new objects added to your source bucket are automatically replicated to the destination bucket, providing a backup in case of regional failures or disasters. In Loki, setting up S3 CRR means that logs written to a single S3 bucket are automatically duplicated to another region. This setup ensures that logs are accessible even if one region encounters issues. However, when using multiple cross-region instances of Loki pointing to the same S3 bucket, there can be delays in log accessibility due to how Loki handles log flushing. Flushing Logs and Configuration Parameters When logs are generated, Loki stores them in chunks, which are temporary data structures that hold log entries before they are flushed to the object store (in this case, S3). The flushing process is controlled by two critical parameters: max_chunk_age and chunk_idle_period. Max Chunk Age: The max_chunk_age parameter defines the maximum time a log stream can be buffered in memory before it is flushed to the object store. When this value is set to a lower threshold (less than 2 hours), Loki flushes chunks more frequently. This leads to higher storage input/output (I/O) activity but reduces memory usage because logs are stored in S3 more often. Conversely, if max_chunk_age is set to a higher value (greater than 2 hours), it results in less frequent flushing, which can lead to higher memory consumption. In this case, there is also an increased risk of data loss if an ingester (the component that processes and writes logs) fails before the buffered data is flushed. Chunk Idle Period: The chunk_idle_period parameter determines how long Loki waits for new log entries in a stream before considering that stream idle and flushing the chunk. A lower value (less than 2 hours) can lead to the creation of too many small chunks, increasing the storage I/O demands. On the other hand, setting a higher value (greater than 2 hours) allows inactive streams to retain logs in memory longer, which can enhance retention but may lead to potential memory inefficiency if many streams become idle. This example shows querying logs from one Loki instance, which is pointed to a CRR-enabled S3 bucket. Here, we are querying the logs from another Loki instance, which is also reading logs from the same CRR-enabled S3 bucket. You can observe the delay of ~2 hours in the logs retrieved. With this approach, in the event of a disaster or failover in one region, there is a risk of losing up to 2 hours of log data. This potential data loss occurs because logs that have not yet been flushed from memory to the S3 bucket during that time frame may not be recoverable if the ingester fails. Also, Cross-Region Replication is an asynchronous process, but the objects are eventually replicated. Most objects replicate within 15 minutes, but sometimes replication can take a couple of hours or more. Several factors affect replication time, including: The size of the objects to replicateThe number of objects to replicate For example, if Amazon S3 is replicating more than 3,500 objects per second, then there might be latency while the destination bucket scales up for the request rate. Therefore, we wanted real-time logs to be accessible from both instances of Loki running in different regions, so we decided against using AWS S3 Cross-Region Replication (CRR). This choice was made to minimize delays and ensure that logs could be retrieved promptly from both instances without the 2-hour latency associated with chunk flushing when using CRR. Instead, we focused on optimizing our setup to enable immediate log access across regions. Approach 2: Utilizing the S3 Multi-Region Access Point Amazon S3 Multi-Region Access Points (MRAP) offer a global endpoint for routing S3 request traffic across multiple AWS Regions, simplifying the architecture by eliminating complex networking setups. While Loki does not directly support MRAP endpoints, this feature can still enhance your logging solution. MRAP allows for centralized log management, improving performance by routing requests to the nearest S3 bucket, which reduces latency. It also boosts redundancy and reliability by rerouting traffic during regional outages, ensuring logs remain accessible. Additionally, MRAP can help minimize cross-region data transfer fees, making it a cost-effective option. However, at the time of this writing, there is a known bug that prevents Loki from effectively using this endpoint. Understanding MRAP can still be beneficial for future scalability and efficiency in your logging infrastructure. Approach 3: Employing Vector as a Sidecar We decided to use Vector, a lightweight and ultra-fast tool for building observability pipelines. With Vector, we could collect, transform, and route logs to AWS S3. So, our infrastructure is one S3 bucket and Loki per region.Vector will be running as a sidecar with the application pods.Since EKS clusters are connected via a transit gateway, we configured a private endpoint for both the Loki instances. We don’t want to expose it to the public as it contains application logs.Configured vector sources to read the application logs, transform and sink, and write to both the Loki instance. This way, all logs are ingested and available in both Loki, and there is no need for cross-region replication and/or sharing the same bucket across many regions. Vector Configuration Vector Remap Language (VRL) is an expression-oriented language designed for transforming observability data (logs and metrics) in a safe and performant manner. Sources collect or receive data from observability data sources into Vector.Transforms manipulate, or change that observability data as it passes through your topology.Sinks send data onward from Vector to external services or destinations. YAML data_dir: /vector-data-dir sinks: # Write events to Loki in the same cluster loki_write: encoding: codec: json endpoint: http://loki-write.loki:3100 inputs: - my_transform_id type: loki # Write events to Loki in the cross-region cluster loki_cross: encoding: codec: json endpoint: https://loki-write.aws-us-west-2.loki inputs: - my_transform_id type: loki # Define the source to read log file sources: my_source_id: type: file include: - /var/log/**/*.log # Define the transform to parse syslog messages transforms: my_transform_id: type: remap inputs: - my_source_id source: . = parse_json(.message) In this setup, Vector collects logs from the /var/log/ directory and internal Vector logs. It parses as JSON replaces the entire event with the parsed JSON object, and sends them to two Loki destinations (local and cross-region). The configuration ensures logs are sent in JSON format and can handle errors during log processing. Conclusion The journey to achieving high availability (HA) and disaster recovery (DR) for Loki has been challenging and enlightening. Through exploring various deployment modes and approaches, we’ve gained a deeper understanding of ensuring our logging system can withstand and recover from potential disruptions. The successful implementation of a Simple Scalable Mode with an S3 backend and the innovative use of Vector as a sidecar has fortified our system’s resilience and underscored the importance of proactive planning and continuous improvement in our infrastructure.

By Pavan N G

Understanding the Two Schools of Unit Testing

Unit testing is an essential part of software development. Unit tests help to check the correctness of newly written logic as well as prevent a system from regression by testing old logic every time (preferably with every build). However, there are two different approaches (or schools) to writing unit tests: Classical (a.k.a Detroit) and Mockists (or London) schools of unit testing. In this article, we’ll explore these two schools, compare their methodologies, and analyze their pros and cons. By the end, you should have a clearer understanding of which approach might work best for your needs. What Is a Unit Test? A unit test checks whether a small piece of code in an application works as expected. It isolates the tested block of code from other code and executes quickly to identify bugs early. The primary difference between Classical school and London school is in their definition of isolation. London School defines isolation as the isolation of a system under test (SUT) from its dependencies. Any external dependencies, such as other classes, are replaced with test doubles (e.g., mocks or stubs) to ensure the SUT’s behavior is unaffected by external factors. The Classical school focuses on isolating tests from one another, enabling them to run independently and in parallel. Dependencies are tested together, provided they don’t rely on shared states like a database, which could cause interference. Another important difference between the two approaches lies in the definition of what a unit is. In the London approach, a unit is usually a single class or a method of a class since all the other dependencies are mocked. Classical school can test a piece of logic consisting of several classes because it checks a unit of behavior but not a unit of code. A unit of behavior here is something useful for the domain, for example, API of making a purchase without atomizing it on smaller actions such as withdrawal and deposit and testing them separately. A Comparison of the Two Schools Here are two examples of a test written in Classical and London styles. Here is a test written in Classical style: Java @Test public void withdrawal_success() { final BankAccount account = new BankAccount(100); final Client client = new Client(); client .withdraw(20, account); assertThat(account.getBalance()).isEqualTo(80); } And here is the same test but written in London style: Java @Test public void withdrawal_success() { final Client client = new Client(); final BankAccount accountMock = mock(BankAccount.class); when(accountMock.isSufficient(20)).thenReturn(true); client .withdraw(20, account); verify(accountMock, times(1)).withdraw(20); } The difference between the two examples is in BankAccount object. In the Classical approach, the test uses a real BankAccount object and validates the final state of the object (the updated balance). In the London approach, we had to define the exact behavior of the mock to satisfy our test. In the end, we verified that a certain method was called instead of checking the real state of the object. Key Differences Testing of Implementation vs. Testing of Abstractions London School The London approach leads to highly detailed tests. It happens because with this approach a test contains implementation details that are hardcoded and always expected to be as they were described in the test. This leads to the vulnerability of tests. Any time one makes a change to some inner logic, tests fail. It happens even if it doesn’t result in changing the output of the test (e.g., splitting the class into two). After that, one has to fix broken tests, and this exercise doesn’t lead to a higher product quality, nor does it highlight a bug. It is just an overhead one has to deal with because the tests are vulnerable. Classical School The classical approach doesn’t have this problem, as it checks only the correctness of the contract. It doesn’t check whether some intermediate dependencies were called and how many times they were called. As a result, if a change was made in the code that didn’t cause a different output, tests will not fail. Bugs Localization London School If you made a bug, you would be able to quickly identify the problem with the London testing approach, as usually, only relevant tests would fail. Classical School On the other hand, in Classical style one would see more failed tests because they may check the callers of a faulty class. This makes it harder to detect the problem and requires extra time for debugging. This is not a big problem, however. If you run tests periodically, you will always know what caused a bug. In addition, one doesn’t have to check all the failed tests. Fixing a bug in one place usually leads to fixing the rest of the tests. Handling Complex Dependency Graphs London School If you have a large graph of dependencies, mocks in the London approach are very helpful to reduce the complexity of preparing tests. One can mock only the first level of dependencies without going deeper into the graph. Classical School In the Classical approach, you have to implement all the dependencies, which may be time-consuming and take effort. On the other hand, a deep graph of dependencies can be a good marker of the poor design of an application. In this case, tests can only help identify flaws in the design. Integration Tests The definition of an integration test varies between the two schools. London School In the London style of testing, any test with implemented (not mocked) dependency is an integration test. Classical School The majority of unit tests in the Classical style would be considered integration tests in the London approach. Conclusion Both schools have their pros and cons. In my opinion, Classical school is preferable because it does not have a problem with test vulnerability as in London school. However, the London or mockist style is actively used and very popular, likely due to tools that set up certain ways of testing, for example, JUnit + Mockito for Java apps. Ultimately, the choice depends on your project’s needs and the trade-offs you’re willing to make.

By Alexander Rumyantsev

Structured Logging in Grails 6.2.3

Traditionally, logging has been unstructured and relies on plain text messages to file. This approach is not suitable for large-scale distributed systems emitting tons of events, and parsing unstructured logs is cumbersome for extracting any meaningful insights. Structured logging offers a solution to the above problem by capturing logs in a machine-readable format such as JSON, and it becomes easier to query and analyze log data in a system where logs are aggregated into centralized platforms like ELK (ElasticSearch, Logstash, Kibana). Traditional Logging in Grails Before structured logging, Grails application used unstructured logs using methods like log.info, log.debug and log. error. These logs lacked a consistent format and were difficult to integrate with log management systems, and required manual effort to correlate logs across different parts of the application. For example, consider a simple controller method for listing books: Groovy class BookController { def list() { log.info("Fetching list of books") try { def books = Book.list() log.debug("Books retrieved: ${books}") render books as JSON } catch (Exception e) { log.error("Error fetching books: ${e.message}") render status: 500, text: "Internal Server Error" } } } In the above approach, logs are plain text, and analyzing logs for patterns or errors requires manual parsing or the use of complex tools. Additionally, important context information such as User IDs or timestamps often needed to be added explicitly. Structured Logging in Grails Grails 6.1x introduced structured logging capabilities and it was further enhanced in Grails 6.2x with default support for JSON-encoded logs that makes it easier to adopt the structured logging practices. The new approach allows developers to log information in a structured format, and there is no need to specify the metadata such as timestamps, log levels, and contextual data manually. With structured logging, the same controller method can be written as follows: Groovy import org.slf4j.Logger import org.slf4j.LoggerFactory class BookController { private static final Logger log = LoggerFactory.getLogger(BookController.class) def list() { log.info("action=list_books", [userId: session.userId, timestamp: System.currentTimeMillis()]) try { def books = Book.list() log.debug("action=books_retrieved", [count: books.size()]) render books as JSON } catch (Exception e) { log.error("action=error_fetching_books", [message: e.message, timestamp: System.currentTimeMillis()]) render status: 500, text: "Internal Server Error" } } } Here, logs are structured key-value pairs that are machine-readable and easy to analyze with log aggregation tools. Simplified Configuration With Grails 6.2.x, enabling structured logging is straightforward with the configuration options in application.yml and developers can define JSON-based logging patterns without a huge manual setup. The configuration looks like this: YAML logging: level: grails.app: DEBUG appenders: console: name: STDOUT target: SYSTEM_OUT encoder: pattern: '{"timestamp": "%d{yyyy-MM-dd HH:mm:ss}", "level": "%-5level", "logger": "%logger{36}", "message": "%msg", "thread": "%thread"}%n' These logs output in a structured format that makes them suitable for integrating with tools like ElasticSearch or Splunk. Comparison of Traditional and Structured Logging A traditional log might look like this: VB.NET INFO: Fetching list of books DEBUG: Books retrieved: [Book1, Book2, Book3] ERROR: Error fetching books: NullPointerException Similar logs in structured logging look like below: JSON { "timestamp": "2025-01-20 14:23:45", "level": "INFO", "logger": "BookController", "message": "action=list_books", "userId": "12345", "thread": "http-nio-8080-exec-3" } Conclusion By transitioning from traditional logging to structured logging developers can make use of modern log analysis tools to their advantage. With Grails 6.1.x and 6.2.x, structured logging has become more accessible that developers can easily use in their applications.

By Karthik Kamarapu

Build Modern Data Architectures With Azure Data Services

Modern data architecture is necessary for organizations trying to remain competitive. It is not a choice. Organizations are finding it difficult to use the exponentially expanding amounts of data effectively. Importance of Modern Data Architectures Modern data architectures remain relevant, considering that they offer businesses and foster a systematic way of dealing with large quantities of data and, in return, make faster and quicker decisions. Modern businesses rely on these architectures because they provide real-time processing, powerful analytics, and numerous data sources. Understanding Modern Data Architectures Modern data architectures are frameworks enabling mass data collecting, processing, and data analysis. Usually, they comprise elements including data lakes, data warehouses, real-time processing, and analytics tools. Important components include: Scalability. The capability to handle the increased volume of data over time and still be efficient.Flexibility. Ability and/or suitability to work with different data types irrespective of their formats.Security. Measures to ensure that the right measures are taken to protect and/or keep confidential the data. Modern data architectures provide better data integration, more analytics power, and lower operational costs. Commonly employed are predictive analytics, processed data in real time, and unique solutions for each client. Key Features of Azure for Data Architecture In Microsoft Azure, there are data services tailored for modern-day data architectures. These features empower organizations to store, maintain, process, and analyze data in a safe, scalable, and efficient manner, bearing in mind the need for robust, scalable data solutions. The following is a description of some of the important Azure tools required for modern data architecture: 1. Azure Data Factory Azure Data Factory is an ETL tool offering cloud-based data integration, which is oriented towards building data-centric processes. It allows users to build workflows that are used to schedule and control data movement and transformation. It ensures proper data integration as organizations can centralize data from various sources in one location. 2. Azure Synapse Analytics Azure Synapse Analytics is a sophisticated analytics service that allows both big data and data warehousing. It allows enterprises to perform large-scale analytics on data and offers a unified approach to the ingestion, preparation, governance, and serving of data. 3. Azure Data Lake Storage Azure Data Lake Storage is meant for safe and scale out cloud-based storage. It has low-cost storage and high capabilities of overflooding, therefore maximizing big data technologies. 4. Azure Databricks Azure Databricks is a collaborative, quick, simple Apache Spark-based analytics tool. It's a great choice for creating scalable data pipelines, machine learning models, and data-driven apps since it blends perfectly with Azure services. Designing a Modern Data Architecture Modern data architecture is designed with a deliberate strategy to combine analytics tools, processing frameworks, and many data sources. Organizations can develop scalable, safe, and efficient architectures supporting their data-driven objectives using a disciplined design approach. Steps to Design: Assess, Plan, Design, Implement, and Manage Step 1. Assess Determine how far the present data implementation has gone and where it needs improvement. Step 2. Plan Provide a blueprint that describes the implementation of the compliance requirements and the need for capacity and governance of the data. Step 3. Design Model a system that provides an architecture consisting of analytic application controls and processing application systems and databases. Step 4. Implement Enforce the architecture using Azure services appropriate to your specific requirements. Step 5. Manage Monitor and maximize the applicable level of security, calculation, availability, and performance efficiencies across the entire area. Best Practices for Scalability, Performance, and Security An architecture of systems-based development on the platform above improves operational performance data and the availability of services. These have been diagnosed as the frequency of audits, limiting users’ access, and data encryption. Implementation Steps Modern data architecture principles require adequate and systematic planning and implementation of data scope, structural design, manipulation, and statistical analysis. Organizations can streamline these processes to develop an organized and efficient data ecosystem using the powerful tools of Azure. 1. Data Ingestion Strategies Data ingestion is the taking of data from multiple sources into one system. Azure Data Factory and Azure Event Hubs' effective ingesting capabilities enable batch and real-time data fusion. 2. Data Transformation and Processing Use Azure Databricks and Azure Synapse Analytics to interpret and process the data. Such instruments assist in data cleaning up, transforming, and preparing for analytics. 3. Management and Data Storage Azure Cosmos Database and Azure Data Lake Storage provide Abundant, efficient, and secure storage options. They allow the implementation of good availability and performance and do support multiple data types. 4. Visualization and Data Analysis The augmented analytics and visualizations offered by Azure Machine Learning, Power BI, and Azure Synapse Analytics allow decision-makers to execute strategies based on real-time insights. Challenges and Solutions New data architecture addresses modern needs, but with it comes integration, security, and scalability problems. But, these challenges grant Microsoft Azure great capabilities that allow organizations to explore far and better maximize their data plans. Common Challenges in Building Data Architectures Correcting data, integrating various data sources, and ensuring data security are complex tasks. In addition, there’s the issue of scaling designs when large amounts of data increase. How Azure Address These Challenges To solve these problems, Azure formulates security features and automatically verifies the tested datatypes. Data structures and forms of Azure are very flexible and can grow with the needs of the business. Data Architecture Future Trends In this relation, it is more than likely that 'Data architecture' will be characterized by edge computing, artificial intelligence-based analytics, and the use of blockchain technology for protecting data assets. Looking ahead, the pattern of constant improvements in Azure places the company in a favorable position with respect to the new worldwide trends and provision of firms with the relevant resources for race. Conclusion Organizations trying to maximize the value of data depend on modern data structures. Microsoft Azure offers thorough, scalable solutions from every aspect of data management. These technologies allow companies to create strong data systems that stimulate innovation and expansion.

By Aravind Nuthalapati

Data Governance Essentials: Policies and Procedures (Part 6)

What Is Data Governance, and How Do Data Quality, Policies, and Procedures Strengthen It? Data governance refers to the overall management of data availability, usability, integrity, and security in an organization. It encompasses people, processes, policies, standards, and roles that ensure the effective use of information. Data quality is a foundational aspect of data governance, ensuring that data is reliable, accurate, and fit for purpose. High-quality data is accurate, complete, consistent, and timely, which is essential for informed decision-making. Additionally, well-defined policies and procedures play a crucial role in data governance. They provide clear guidelines for data management, ensuring that data is handled properly and complies with relevant regulations. Data Governance Pillars Together, data quality, policies, and procedures strengthen data governance by promoting accountability, fostering trust in data, and enabling organizations to make better data-driven decisions. What Is Data Quality? Data quality is the extent to which data meets a company's standards for accuracy, validity, completeness, and consistency. It is a crucial element of data management, ensuring that the information used for analysis, reporting, and decision-making is reliable and trustworthy. Data Quality Dimensions 1. Why Is Data Quality Important? Data quality is crucial for several key reasons, and below are some of them: Improved Decision-Making High-quality data supports more accurate and informed decision-making. Enhanced Operational Efficiency Clean and reliable data helps streamline processes and reduce errors. Increased Customer Satisfaction Quality data leads to better products and services, ultimately enhancing customer satisfaction. Reduced Costs Poor data quality can result in significant financial losses. Regulatory Compliance Adhering to data quality standards is essential for meeting regulatory requirements. 2. What Are the Key Dimensions of Data Quality? The essential dimensions of data quality are described as follows: Accuracy. Data must be correct and free from errors.Completeness. Data should be whole and entire, without any missing parts. Consistency. Data must be uniform and adhere to established standards.Timeliness. Data should be current and up-to-date.Validity. Data must conform to defined business rules and constraints.Uniqueness. Data should be distinct and free from duplicates. 3. How to Implement Data Quality The following steps will assist in implementing data quality in the organization. Data profiling. Analyze the data to identify inconsistencies, anomalies, and missing values.Data cleansing. Correct errors, fill in missing values, and standardize data formats.Data validation. Implement rules and checks to ensure the integrity of data.Data standardization. Enforce consistent definitions and formats for the data.Master data management (MDM). Centralize and manage critical data to ensure consistency across the organization.Data quality monitoring. Continuously monitor data quality metrics to identify and address any issues.Data governance. Establish policies, procedures, and roles to oversee data quality. By prioritizing data quality, organizations can unlock the full potential of their data assets and drive innovation. Policies Data policies are the rules and guidelines that ensure how data is managed and used across the organization. They align with legal and regulatory requirements such as CCPA and GDPR and serve as the foundation for safeguarding data throughout its life cycle. Data Protection Policies Below are examples of key policies, including those specific to compliance frameworks like the California Consumer Privacy Act (CCPA) and General Data Protection Regulation (GDPR): 1. Protection Against Data Extraction and Transformation Data Validation Policies Define rules to check the accuracy, completeness, and consistency of data during extraction and transformation. Require adherence to data standards such as format, naming conventions, and mandatory fields. Source System Quality Assurance Policies Mandate profiling and quality checks on source systems before data extraction to minimize errors. Error Handling and Logging Policies Define protocols for detecting, logging, and addressing data quality issues during ETL processes. Data Access Policies Define role-based access controls (RBAC) to restrict who can view or modify data during extraction and transformation processes. Audit and Logging Policies Require logging of all extraction, transformation, and loading (ETL) activities to monitor and detect unauthorized changes. Encryption Policies Mandate encryption for data in transit and during transformation to protect sensitive information. Data Minimization Define policies to ensure only necessary data is extracted and used for specific purposes, aligning with GDPR principles. 2. Protection for Data at Rest and Data in Motion Data Profiling Policies Establish periodic profiling of data at rest to assess and maintain its quality. Data Quality Metrics Define specific metrics (e.g., accuracy rate, completeness percentage, duplication rate) that data at rest must meet. Real-Time Monitoring Policies For data in motion, implement policies requiring real-time validation of data against predefined quality thresholds. Encryption Policies Data at rest. Require AES-256 encryption for stored data across structured, semi-structured, and unstructured data formats.Data in motion. Enforce TLS (Transport Layer Security) encryption for data transmitted over networks. Data Classification Policies Define levels of sensitivity (e.g., Public, Confidential, Restricted) and the required protections for each category. Backup and Recovery Policies Ensure periodic backups and the use of secure storage locations with restricted access. Key Management Policies Establish secure processes for generating, distributing, and storing encryption keys. 3. Protection for Different Data Types Structured Data Define rules for maintaining referential integrity in relational databases.Mandate the use of unique identifiers to prevent duplication.Implement database security policies, including fine-grained access controls, masking of sensitive fields, and regular integrity checks. Semi-Structured Data Ensure compliance with schema definitions to validate data structure and consistency.Enforce policies requiring metadata tags to document the origin and context of the data.Enforce security measures like XML/JSON encryption, validation against schemas, and access rules specific to APIs. Unstructured Data Mandate tools for text analysis, image recognition, or video tagging to assess data quality.Define procedures to detect and address file corruption or incomplete uploads.Define policies for protecting documents, emails, videos, and other formats using tools like digital rights management (DRM) and file integrity monitoring. 4. CCPA and GDPR Compliance Policies Accuracy Policies Align with GDPR Article 5(1)(d), which requires that personal data be accurate and up-to-date. Define periodic reviews and mechanisms to correct inaccuracies. Consumer Data Quality Policies Under CCPA, ensure that data provided to consumers upon request is accurate, complete, and up-to-date. Retention Quality Checks Require quality validation of data before deletion or anonymization to ensure compliance. Data Subject Access Rights (DSAR) Policies Define procedures to allow users to access, correct, or delete their data upon request. Third-Party Vendor Policies Require vendors to comply with CCPA and GDPR standards when handling organizational data. Retention and Disposal Policies Align with legal requirements to retain data only as long as necessary and securely delete it after the retention period. Key aspects of data policies include: Access control. Defining who can access specific data sets.Data classification. Categorizing data based on sensitivity and usage.Retention policies. Outlining how long data should be stored.Compliance mandates. Ensuring alignment with legal and regulatory requirements. Clear and enforceable policies provide the foundation for accountability and help mitigate risks associated with data breaches or misuse. Procedures Procedures bring policies to life, and they are step-by-step instructions. They provide detailed instructions to ensure policies are effectively implemented and followed. Below are expanded examples of procedures for protecting data during extraction, transformation, storage, and transit, as well as for structured, semi-structured, and unstructured data: 1. Data Extraction and Transformation Procedures Data Quality Checklists Implement checklists to validate extracted data against quality metrics (e.g., no missing values, correct formats). Compare transformed data with expected outputs to identify errors. Automated Data Cleansing Automated tools are used to detect and correct quality issues, such as missing or inconsistent data, during transformation. Validation Testing Perform unit and system tests on ETL workflows to ensure data quality is maintained. ETL Workflow Monitoring Regularly review ETL logs and audit trails to detect anomalies or unauthorized activities. Validation Procedures Use checksum or hash validation to ensure data integrity during extraction and transformation. Access Authorization Implement multi-factor authentication (MFA) for accessing ETL tools and systems. 2. Data at Rest and Data in Motion Procedures Data Quality Dashboards Create dashboards to visualize quality metrics for data at rest and in motion. Set alerts for anomalies such as sudden spikes in missing or duplicate records. Real-Time Data Validation Integrate validation rules into data streams to catch errors immediately during transmission. Periodic Data Audits Schedule regular audits to evaluate and improve the quality of data stored in systems. Encryption Key Rotation Schedule periodic rotation of encryption keys to reduce the risk of compromise. Secure Transfer Protocols Standardize the use of SFTP (Secure File Transfer Protocol) for moving files and ensure APIs use OAuth 2.0 for authentication. Data Storage Segmentation Separate sensitive data from non-sensitive data in storage systems to enhance security. 3. Structured, Semi-Structured, and Unstructured Data Procedures Structured Data Run data consistency checks on relational databases, such as ensuring referential integrity and no orphan records.Schedule regular updates of master data to maintain consistency.Conduct regular database vulnerability scans.Implement query logging to monitor access patterns and detect potential misuse. Semi-Structured Data Use tools like JSON or XML schema validators to ensure semi-structured data adheres to expected formats.Implement automated tagging and metadata extraction to enrich the data and improve its usability.Validate data against predefined schemas before ingestion into systems.Use API gateways with rate limiting to prevent abuse. Unstructured Data Deploy machine learning tools to assess and improve the quality of text, image, or video data.Regularly scan unstructured data repositories for incomplete or corrupt filesUse file scanning tools to detect and classify sensitive information in documents or media files.Apply automatic watermarking for files containing sensitive data. 4. CCPA and GDPR Compliance Procedures Consumer Request Validation Before responding to consumer requests under CCPA or GDPR, validate the quality of the data to ensure it is accurate and complete. Implement error-handling procedures to address any discrepancies in consumer data. Data Update Procedures Establish workflows for correcting inaccurate data identified during regular reviews or consumer requests. Deletion and Retention Quality Validation Before data is deleted or retained for compliance, quality checks are performed to confirm its integrity and relevance. Right to Access/Deletion Requests Establish a ticketing system for processing data subject requests and verifying user identity before fulfilling the request. Breach Notification Procedures Define steps to notify regulators and affected individuals within the time frame mandated by GDPR (72 hours) and CCPA. Data Anonymization Apply masking or tokenization techniques to de-identify personal data used in analytics. Roles and Responsibilities in Defining Policies and Procedures The following are the various generalized roles and their responsibilities in defining policies and procedures, which may vary depending on the size and policies of the organization. Data Governance Policy Makers 1. Data Governance Council (DGC) Role A strategic decision-making body comprising senior executives and stakeholders from across the organization. Responsibilities Establish the overall data governance framework.Approve and prioritize data governance policies and procedures.Align policies with business objectives and regulatory compliance requirements (e.g., CCPA, GDPR).Monitor compliance and resolve escalated issues. 2. Chief Data Officer (CDO) Role Oversees the entire data governance initiative and ensures policies align with the organization’s strategic goals. Responsibilities Lead the development of data governance policies and ensure buy-in from leadership.Define data governance metrics and success criteria.Ensure the integration of policies across structured, semi-structured, and unstructured data systems.Advocate for resource allocation to support governance initiatives. 3. Data Governance Lead/Manager Role Operationally manages the implementation of data governance policies and procedures. Responsibilities Collaborate with data stewards and owners to draft policies.Ensure policies address data extraction, transformation, storage, and movement.Develop and document procedures based on approved policies.Facilitate training and communication to ensure stakeholders understand and adhere to policies. 4. Data Stewards Role Serve as subject matter experts for specific datasets, ensuring data quality, compliance, and governance. Responsibilities Enforce policies for data accuracy, consistency, and protection.Monitor the quality of structured, semi-structured, and unstructured data.Implement specific procedures such as data masking, encryption, and validation during ETL processes.Ensure compliance with policies related to CCPA and GDPR (e.g., data classification and access controls). 5. Data Owners Role Typically, business leaders or domain experts are responsible for specific datasets within their area of expertise. Responsibilities Define access levels and assign user permissions.Approve policies and procedures related to their datasets.Ensure data handling aligns with regulatory and internal standards.Resolve data-related disputes or issues escalated by stewards. 6. Legal and Compliance Teams Role Ensure policies meet regulatory and contractual obligations. Responsibilities Advise on compliance requirements, such as GDPR, CCPA, and industry-specific mandates.Review and approve policies related to data privacy, retention, and breach response.Support the organization in audits and regulatory inspections. 7. IT and Security Teams Role Provide technical expertise to secure and implement policies at a systems level. Responsibilities Implement encryption, data masking, and access control mechanisms.Define secure protocols for data in transit and at rest.Monitor and log activities to enforce data policies (e.g., audit trails).Respond to and mitigate data breaches, ensuring adherence to policies and procedures. 8. Business Units and Data Consumers Role Act as end users of the data governance framework. Responsibilities Adhere to the defined policies and procedures in their day-to-day operations.Provide feedback to improve policies based on practical challenges.Participate in training sessions to understand data governance expectations. Workflow for Defining Policies and Procedures Steps in the Policy and Procedure Workflow 1. Policy Development Initiation. The CDO and Data Governance Lead identify the need for specific policies based on organizational goals and regulatory requirements.Drafting. Data stewards, legal teams, and IT collaborate to draft comprehensive policies addressing technical, legal, and operational concerns.Approval. The Data Governance Council reviews and approves the policies. 2. Procedure Design Operational input. IT and data stewards define step-by-step procedures to enforce the approved policies.Documentation. Procedures are formalized and stored in a central repository for easy access.Testing. Procedures are tested to ensure feasibility and effectiveness. 3. Implementation and Enforcement Training programs are conducted for employees across roles.Monitoring tools are deployed to track adherence and flag deviations. 4. Continuous Improvement Policies and procedures are periodically reviewed to accommodate evolving regulations, technologies, and business needs. By involving the right stakeholders and clearly defining roles and responsibilities, organizations can ensure their data governance policies and procedures are robust, enforceable, and adaptable to changing requirements. Popular Tools The following table lists the top 10 most popular companies that support data governance, data quality, policies, and procedures: Tool Best For Key Features Use Cases 1 Ataccama Data Quality, MDM, Governance - Automated data profiling, cleansing, and enrichment - AI-driven data discovery and anomaly detection - Ensuring data accuracy during ETL processes - Automating compliance checks (e.g., GDPR, CCPA) 2 Collibra Enterprise Data Governance, Cataloging - Data catalog for structured, semi-structured, and unstructured data - Workflow management - Data lineage tracking - Cross-functional collaboration on governance - Automating compliance documentation and audits 3 Oracle EDM Comprehensive Data Management - Data security and lifecycle management - Real-time quality checks - Integration with Oracle Analytics - Managing policies in complex ecosystems - Monitoring real-time data quality 4 IBM InfoSphere Enterprise-Grade Governance, Quality - Automated data profiling - Metadata management - AI-powered recommendations for data quality - Governing structured and semi-structured data - Monitoring and enforcing real-time quality rules 5 OvalEdge Unified Governance and Collaboration - Data catalog and glossary - Automated lineage mapping - Data masking capabilities - Developing and communicating governance policies - Tracking and mitigating policy violations 6 Manta Data Lineage and Impact Analysis - Visual data lineage - Integration with quality and governance platforms - Enhancing policy enforcement for data in motion - Strengthening data flow visibility 7 Talend Data Fabric End-to-End Data Integration, Governance - Data cleansing and validation - Real-time quality monitoring - Compliance tools - Maintaining data quality in ETL processes - Automating privacy policy enforcement 8 Informatica Axon Enterprise Governance Frameworks - Integrated quality and governance - Automated workflows - Collaboration tools - Coordinating governance across global teams - Establishing scalable data policies and procedures 9 Microsoft Purview Cloud-First Governance and Compliance - Automated discovery for hybrid environments - Policy-driven access controls - Compliance reporting - Governing hybrid cloud data - Monitoring data access and quality policies 10 DataRobot AI-Driven Quality and Governance - Automated profiling and anomaly detection - Governance for AI models - Real-time quality monitoring - Governing data in AI workflows - Ensuring compliance of AI-generated insights Conclusion Together, data quality, policies, and procedures form a robust foundation for an effective data governance framework. They not only help organizations manage their data efficiently but also ensure that data remains a strategic asset driving growth and innovation. By implementing these policies and procedures, organizations can ensure compliance with legal mandates, protect data integrity and privacy, and enable secure and effective data governance practices. This layered approach safeguards data assets while supporting the organization’s operational and strategic objectives. References AtaccamaCollibraWhat Is a Data Catalog?, OracleWhat is a data catalog?, IBM5 Core Benefits of Data Lineage, OvalEdge

By Sukanya Konatam

Build a Stateless Microservice With GitHub Copilot in VSCode

Microsoft CEO Satya Nadella recently announced that GitHub Copilot is now free for all developers in VSCode. This is a game-changer in the software development industry. Github Copilot is an AI code assistant that helps developers finish their coding tasks easily and quickly. It also helps suggest code snippets and autocomplete functions. In this article, we will learn how to use GitHub Copilot using VSCode in a step-by-step manner for creating the first stateless flask microservice. This is a beginner-friendly guide showcasing how Copilot helps reduce the development time and simplify the process. Setting Up the Environment Locally As our primary focus will be on GitHub Copilot, I will write a high level on the software installation needed. If any installation issues are seen, it is expected that readers would have to solve them locally or comment in this article, where I can try to help. 1. Install Visual Studio Code on Mac or Windows from VSCode (In my examples, I used Mac). 2. Install GitHub Copilot extension in VSCode: Open VSCode and navigate to the Extensions view on the left, as per the below screenshot. Search for "copilot," and GitHub Copilot will appear. Click install. With this step, the Copilot extension is added to VSCode. 3. Activate the Copilot: If you do not have a GitHub account, please create one in GitHub.Back to VSCode, after installing Copilot, we can see in the Welcome tab that it will ask to sign up. Sign up using a GitHub account. Click "Chat with Copilot," and you will see the right side of VSCode, Copilot appears. Click "Chat with Copilot." We will see that the Copilot chat appears on the right-hand side of the VSCode palate. 4. Install Python in your system from Python based on Windows/Mac. Note that we are not installing Flask now; we will do it in a later step while installing the application to run. Writing the Microservice Using CoPilot 1. In VSCode, on the right side with the Copilot pallet, under "Ask Copilot," type: Create a Flask app. There are two ways we can ask Copilot. One is to create the Flask project folder with files and ask Copilot to add the code. Or, start from nothing and ask to create a Flask app. We notice that it will create a workspace for us along with all file creation, which is awesome, and the project gets created with the required files within a few seconds. Click -> Create Workspace -> give the location to save the project. The project will appear in VSCode. 2. We see that the project-created files will have routes.py, where a few default APIs are already generated. Now, we will create 2 APIs using Copilot. The first API is simple and used to greet a person. It takes the name of the person as input and out as "Hello, {name}." Open the routes.py file and add a comment as below: As soon as we hit enter, we see the code generated. Now press the tab, and we will see that the API code is generated. That's the advantage of using Copilot. Similarly, let's create another simple API that takes two integer values as input and returns the multiplication by using Copilot. This time we will try it in the right pallet of VSCode rather than in the routes.py file. Python # Create an endpoint to multiply two numbers. @main.route('/multiply') def multiply(): try: num1 = float(request.args.get('num1')) num2 = float(request.args.get('num2')) result = num1 * num2 return f'The result of {num1} * {num2} is {result}' except (TypeError, ValueError): return 'Invalid input. Please provide two numbers as query parameters.' However, I see a different code was generated when I asked Copilot to write the API inside the routes.py file. See below: Python # Create an endpoint to multiply two numbers. @main.route('/multiply/<int:num1>/<int:num2>') def multiply(num1, num2): return f'{num1} * {num2} = {num1 * num2}' The reason here is based on the previous context it generates the code. When we were on the routes.py file and asked the Copilot to generate the API code, it generated based on the context that the API should have two inputs and return the output. But when we requested to generate in the right palate, it generated based on the previous question with the context that it's a flak app, and input will come from the request param. So here, we can safely conclude that based on previous context, it will generate the next output. Now, both our APIs are ready, so let's deploy the app and test it. But we have not installed Flask yet. So, let's do that. 1. Activate the virtual environment and install Flask. Plain Text source venv/bin/activate # On Linux/Mac venv\Scripts\activate # On Windows pip install flask When we run the application, we see an issue in the startup due to the generated code. Below is the error: Plain Text File "/Users/sibasispadhi/Documents/coding/my-flask-app/venv/lib/python3.12/site-packages/flask/cli.py", line 72, in find_best_app app = app_factory() ^^^^^^^^^^^^^ File "/Users/sibasispadhi/Documents/coding/my-flask-app/app/__init__.py", line 14, in create_app app.register_blueprint(routes.bp) ^^^^^^^^^ AttributeError: module 'app.routes' has no attribute 'bp' (venv) sibasispadhi@Sibasiss-Air my-flask-app % The create_app function in our project's app/__init__.py file is calling app.register_blueprint(routes.bp), but the routes.py file doesn’t have bp (Blueprint object) defined. Below are the changes done to fix the problem. (See the code commented is the one autogenerated). Python # Register blueprints from . import routes # app.register_blueprint(routes.bp) app.register_blueprint(routes.main) Re-running the application will successfully deploy the application, and we are ready to test the functionality. The APIs can be tested using Postman. 2. Testing through Postman gives the results successfully. Conclusion GitHub Copilot generates the project and the boilerplate code seamlessly and it saves development time and effort. It's always advised to review the generated code so that it matches developers' expectations. Whenever there is an error, we must debug or request Copilot further suggestions to solve the problem. In this project, Copilot helped us create and run a stateless Flask microservice in no time. We faced some initial hiccups, which were solved after debugging, but overall, the development time was faster. I would suggest all readers start exploring Copilot today and enhance their day-to-day productivity. Stay tuned for my next set of articles on Copilot, where we will dive deep into more real-world scenarios and see how it solves our day-to-day tasks in a smooth manner.

By Sibasis Padhi