DevOps and CI/CD Resources

DZone's Featured DevOps and CI/CD Resources

Mastering High-Risk GitHub Pull Requests: Review, Rollout Strategies, and Lessons Learned

By Bal Reddy Cherlapally

In modern software development, GitHub has emerged as a cornerstone platform for version control and collaborative coding. The practice of creating and reviewing pull requests (PRs) on GitHub ensures that teams can collaborate effectively while maintaining code quality. However, the review and rollout of high-risk pull requests (PRs) on GitHub present significant challenges to software development teams, particularly when the changes involve critical system components, security implications, performance optimization, or major updates to third-party dependencies. These PRs have a higher probability of introducing unforeseen issues into the codebase, which could compromise the stability, security, and performance of the system. Consequently, addressing high-risk pull requests requires a disciplined and rigorous approach to ensure successful integration with minimal disruption. This article synthesizes key lessons learned from industry experiences and provides insights into best practices for effectively reviewing and rolling out high-risk PRs. The lessons outlined here are intended to guide experts in the field toward more efficient risk management and decision-making during the review and deployment processes of high-risk changes. Key Lessons in Reviewing High-Risk Pull Requests 1. Establish a Clear Understanding of the Scope and Impact A comprehensive understanding of the scope and impact of high-risk changes is paramount. High-risk PRs typically affect core system components, such as authentication, payment infrastructure, or low-level data handling. These changes often have far-reaching consequences, which may not always be immediately apparent. A detailed analysis of the intended changes, along with an evaluation of potential dependencies and the broader system impact, is essential. Lesson Learned: Comprehensive documentation within the PR description, including a clear statement of purpose, intended outcomes, and any identified risks, is crucial for guiding the review process. This documentation enables reviewers to identify areas of concern and allocate appropriate scrutiny. 2. Enforce Comprehensive Test Coverage and Validation Robust test coverage is a non-negotiable element in the review of high-risk PRs. The nature of high-risk changes means that any oversight in testing could lead to catastrophic failures, ranging from system crashes to security breaches. A well-structured test suite, comprising unit tests, integration tests, performance benchmarks, and security assessments, is vital to ensure the stability and integrity of the codebase post-integration. Lesson Learned: When reviewing high-risk PRs, it is imperative that the author has included a comprehensive suite of automated tests, as well as manual testing steps where necessary. A failure to provide sufficient coverage should be flagged, with the reviewer emphasizing the importance of testing critical areas such as data flow, edge cases, and system boundaries. 3. Conduct a Rigorous Risk Assessment High-risk PRs introduce multiple potential failure points, including system outages, data corruption, or even security vulnerabilities. It is essential to perform a detailed risk assessment during the review phase. This assessment should identify and prioritize areas of risk based on the potential severity and likelihood of failure, and should include mitigation strategies such as fallback plans, redundancy, and feature toggles. Lesson Learned: Including a detailed risk assessment as part of the PR description is essential. Reviewers must ensure that the potential risks associated with the changes are well understood and that mitigation strategies are in place. This allows the team to proactively address areas of concern before deployment. 4. Foster Collaboration and Interdisciplinary Review Given the complexity and potential consequences of high-risk PRs, a collaborative approach to code review is essential. Collaboration should not be limited to developers but should also involve other stakeholders, such as security specialists, system architects, and performance engineers. Each domain expert can provide valuable insight into specific aspects of the change, ensuring that critical factors like security vulnerabilities, performance degradation, and architectural integrity are adequately addressed. Lesson Learned: High-risk PRs should undergo interdisciplinary review. Leveraging the expertise of various team members ensures a more thorough examination of the proposed changes. This collaboration fosters a culture of transparency and mitigates the risk of overlooking significant issues. 5. Implement Feature Flags for Controlled Rollout Feature flags, or feature toggles, are a powerful tool for mitigating the risks associated with high-risk PRs. By deploying the code in a dormant state, feature flags allow the team to monitor the changes in production while controlling when they become active for the user base. This strategy enables the team to gather feedback, monitor system performance, and identify issues in a controlled and manageable way. Lesson Learned: Utilizing feature flags during the rollout of high-risk changes provides a safety net by enabling the selective activation of new features. This controlled deployment approach allows teams to quickly respond to issues by deactivating the feature without the need for a full rollback. Key Lessons in Rolling Out High-Risk Pull Requests 1. Monitor Key Metrics Post-Deployment Once a high-risk PR is deployed, continuous monitoring becomes a critical activity. Metrics such as system resource usage, response times, transaction volumes, error rates, and security alerts should be tracked rigorously to identify any anomalies or signs of failure. Given the potential for high-risk changes to affect both system stability and user experience, real-time monitoring ensures that any emerging issues are detected promptly. Lesson Learned: Post-deployment monitoring is essential to ensure that high-risk changes do not negatively impact system performance or user experience. Instrumentation and alerting mechanisms must be in place to detect issues early and trigger appropriate response actions. 2. Prepare for Rollback With Clear Procedures Despite rigorous testing and careful review, unforeseen issues may arise after the deployment of a high-risk PR. A well-defined rollback procedure is, therefore, critical. The procedure should include detailed steps for reverting changes, restoring system state, and minimizing downtime. This plan should also account for any potential data integrity issues, particularly if the changes affect data models or storage mechanisms. Lesson Learned: Developing a comprehensive rollback plan prior to deployment is essential for risk management. The plan should include clear actions, responsibilities, and timelines for reversion, allowing the team to respond quickly in case of failure. 3. Staged Rollout Strategy Rolling out high-risk changes to the entire user base all at once can introduce substantial risk. Instead, a staged rollout strategy is recommended. This strategy involves releasing the change to a small subset of users initially, carefully monitoring the system’s response, and gradually increasing the rollout scope if no critical issues arise. This approach provides the opportunity to identify problems in a controlled manner, limiting the exposure of potential failures. Lesson Learned: Implementing a phased rollout significantly reduces the risk of widespread failure. It allows for more controlled testing of the change in a production environment, minimizing the impact on end users in the event of a failure. 4. Conduct Post-Deployment Reviews After the high-risk PR is fully deployed, conducting a post-deployment review is crucial for identifying any issues that may have gone undetected during the review process. This review should involve all stakeholders, including developers, QA engineers, security experts, and system administrators. The review should focus on evaluating the effectiveness of the deployment process, the adequacy of testing, and the robustness of risk mitigation strategies. Lesson Learned: Post-deployment reviews provide a structured opportunity to reflect on the release process, assess the effectiveness of risk management strategies, and document lessons learned. These reviews should be leveraged to refine the deployment process and improve future handling of high-risk changes. Isolating High-Risk Pull Requests for Efficient Rollback Even after following the best practices, scenarios may arise where high-risk PR releases cause unforeseen failures. To minimize the impact and enable efficient resolution, isolating the release of high-risk PRs is crucial. By isolating the release, teams can more easily identify and address issues without affecting other parts of the codebase or impacting the entire user base. Lesson Learned: Isolating high-risk PRs allows the release management team to quickly identify the scope of an issue and execute a more efficient rollback. This practice ensures that the PR can be reverted without causing disruptions to the rest of the system, enabling faster resolution of issues and minimizing downtime. Conclusion The review and rollout of high-risk pull requests in GitHub require a high degree of diligence and strategic planning. Key lessons learned from successful teams include ensuring a thorough understanding of the changes’ scope and impact, implementing robust test coverage, conducting risk assessments, and fostering collaboration across disciplines. Additionally, controlled rollouts, post-deployment monitoring, and clear rollback procedures are vital components of a successful deployment strategy. By adhering to these best practices and continuously refining the processes for managing high-risk changes, teams can minimize the risks associated with PRs and maintain the stability, security, and performance of their systems. More

Integrating Lighthouse Test Automation Into Your CI/CD Pipeline

By maria bueno

Web performance can make or break your digital presence. While developers constantly push new features and updates, maintaining consistent quality across deployments remains a challenge. Lighthouse test automation has emerged as a powerful solution, transforming how development teams approach quality assurance and performance optimization. Understanding Lighthouse Test Automation Fundamentals Lighthouse test automation serves as the foundation for comprehensive performance testing. When integrated into continuous integration workflows, Google Lighthouse provides consistent, objective measurements of web application performance. This integration enables teams to catch performance regressions before they impact users. Streamlining Quality Assurance With Automated Testing Tools The implementation of automated testing tools through Lighthouse creates a robust testing environment. Development teams can establish performance budgets and automatically verify that new code meets established standards before deployment. Metric CategoryWhat Lighthouse TestsImpact on User ExperiencePerformanceLoading speed, interactivityDirect user satisfactionAccessibilityWCAG complianceInclusive user experienceBest PracticesSecurity, browser compatibilityTechnical reliabilitySEOSearch engine optimizationVisibility and reachPWAProgressive web app readinessMobile experience Maximizing Efficiency Through Lighthouse Test Automation Lighthouse test automation significantly reduces the manual effort required for performance testing. By automating these checks, teams can: Identify performance bottlenecks early in developmentMaintain consistent quality standards across deploymentsGenerate detailed reports for stakeholder communicationTrack performance trends over time Enhancing Development Workflows With Automated Testing Tools The integration of automated testing tools transforms traditional development pipelines. Teams can establish clear performance criteria and automatically prevent deployments that don't meet these standards, ensuring consistent quality across releases. Optimizing Performance With Google Lighthouse Integration Google Lighthouse provides comprehensive insights into various aspects of web application performance. When automated through CI/CD pipelines, these tests become an integral part of the development process. Advanced Metrics Through Lighthouse Test Automation The depth of analysis provided by lighthouse test automation extends beyond basic performance metrics. Teams gain insights into: Critical rendering paths Resource optimization opportunities JavaScript execution timing Network utilization patterns Core Web Vitals compliance Implementing Successful Lighthouse Test Automation Strategies Effective implementation of lighthouse test automation requires thoughtful planning and execution. Development teams must begin by setting realistic performance budgets that align with business objectives and user expectations. Establishing clear baseline metrics provides a foundation for measuring progress, while carefully defined failure thresholds help maintain quality standards without creating unnecessary deployment blockers. Teams should develop comprehensive response protocols for test failures, ensuring quick resolution of performance issues when they arise. This systematic approach to implementation ensures that automated testing becomes an asset rather than a bottleneck in the development process. Measuring Success With Automated Testing Tools Automated testing tools provide sophisticated metrics that enable teams to track progress comprehensively. Through detailed performance monitoring, teams can observe improvements in load times, user interaction metrics, and overall application responsiveness. These tools generate quantifiable data about accessibility compliance rates, demonstrating progress toward inclusive design goals. Best practices adherence can be tracked over time, showing the maturation of development processes. SEO optimization levels provide insight into the application's visibility potential, while performance scores create a clear picture of user experience improvements. This data-driven approach to measuring success ensures that teams can demonstrate concrete value from their testing automation investments. Future-Ready Testing With Lighthouse Test Automation As web technologies evolve, lighthouse test automation continues to adapt and provide relevant insights. The integration with CI/CD pipelines ensures that performance testing remains current with emerging web standards and best practices. Scaling Quality Assurance Through Lighthouse Test Automation The scalability of lighthouse test automation makes it particularly valuable for growing applications. Teams can maintain consistent quality standards even as their applications become more complex and their user base grows. Maximizing ROI Through Strategic Implementation The return on investment from lighthouse test automation becomes evident through:Reduced manual testing time Earlier detection of performance issues Improved user satisfaction metrics Enhanced search engine rankings Lower maintenance costs Best Practices for Lighthouse Integration To maximize the benefits of Google Lighthouse automation: Configure tests for both mobile and desktop environmentsSet appropriate thresholds based on business requirementsImplement detailed logging and monitoringEstablish clear remediation protocols Integrating lighthouse test automation into your CI/CD pipeline represents a strategic investment in your application's quality and performance. By leveraging these automated testing capabilities, development teams can maintain high standards while focusing on innovation and feature development. The key to success lies in thoughtful implementation, clear performance standards, and consistent monitoring of results. With proper setup and maintenance, lighthouse test automation becomes an invaluable tool for ensuring sustainable application quality and performance. More

The Importance of Kubernetes in MLOps and Its Influence on Modern Businesses

By Anupama Babu

Motivations for Creating Filter and Merge Plugins for Apache JMeter With Use Cases

By Vincent DABURON

GitOps Software Development Principles – And the Benefits for the Entire Organization

By Robert Boule

Using the Log Node in IBM App Connect Enterprise

In the world of IBM App Connect Enterprise (ACE), effective logging is crucial for monitoring and troubleshooting. With the introduction of the Log node, it's now easier than ever to log ExceptionList inserts directly into the activity log, which can be viewed from the WebUI. The Log node can be especially valuable, often replacing the Trace node in various scenarios. This article contains two sections: the first will guide you through the process of using the Log node to log these inserts, helping you streamline your debugging and monitoring processes. The second section explores some scenarios that provide Log node hints and tips around default values. Section 1: Logging ACE's ExceptionList Inserts (In the Activity Log With the Log Node) Introducing the Log Node The Log node is a recent addition to ACE (v12.0.11.0), originating from the Designer, which simplifies logging activities. By using the Log node, you can log custom messages directly into the activity log for easy monitoring and follow-up. Understanding the ExceptionList in ACE Since this article mentions and works with the ExceptionList, I’ll recap very quickly what the ExceptionList is. The ExceptionList in ACE is a structured (built-in) way of displaying captured exceptions within your message flows. It provides detailed information about errors, including the file, line, function, type, and additional context that can be invaluable during troubleshooting. As most of you know, the insert fields contain the most usable information so we will be focussing on those. Setting up the Demo Flow To demonstrate how to log ExceptionList inserts using the Log node, we'll set up a simple flow: 1. Create the Flow Add an HTTP Input node.Add an HTTP Request node.Add a Log node.Add an HTTP Reply node. 2. Connect the Nodes Connect the nodes as shown in the diagram to form a complete flow. Configuring the Log Node Next, we need to configure the Log node to capture and log the ExceptionList inserts: 1. Properties Go to the Map Inputs part of the properties.Configure the Log node to include the ExceptionList as a Map input. 2. Configure Open the Configure wizard for the Log Node. Set the basic values for the log message: “Log Level” and “Message detail”. Next, add a new property, give it a name, and set the type to “Array of strings”. Click on “Edit mappings” and click on the left button that looks like a menu with 3 items. Click on “ExceptionList”, scroll down, and select “Insert”. This gives you all the inserts of the ExceptionList. If that is what you want, great — we are done here. But if you only require the insert fields of the last two ExceptionList items, which tend to be the most interesting ones, you can select only those as well. It’s rather important to know that the ExceptionList ordering here is reversed compared to the message tree. So the last two ExceptionList items in the flow are the first two in this JSON representation. 3. Filter Click on Insert and go to “Edit expression”. Change the expression to “$mappingInput_ExceptionList[0..1].Insert”. Do not forget to hit “Save”! Sending a Message Through the Flow To test our configuration, we'll send a message through the flow using the Flow Exerciser: 1. Send Message Open the Flow Exerciser, create a new message, and send it through the HTTP Input node. 2. Monitor the Progress Observe the message flow and ensure the Log node captures and logs the ExceptionList details. To make sure the log message has been written, wait until you get a reply back. Viewing the ExceptionList in the Activity Log Once the message has been processed, you can view the logged ExceptionList inserts in the activity log through the WebUI: 1. Access the Activity Log Navigate to the activity log section in the WebUI to see the logged ExceptionList inserts: Integration Server > Application > Message flows > flow 2. Review Logged Details The activity log should display detailed information about the exceptions captured during the flow execution. The log entry in detail: This is enough to tell me what the issue is. Section 2: ACE Log Node Tips and Tricks, Default Values Having explored how to handle and parse the ExceptionList, let's now examine some scenarios using the Log node. Fallback or Default Values Imagine you want to log a specific field from an incoming message that may or may not be present. Let's use the HTTP Authorization header as an example. If you configure the Log node to use this header as an input parameter, it will either display it or omit it entirely. The syntax to retrieve the Authorization header from your input message is: {{$mappingInput_InputRoot.HTTPInputHeader.Authorization} Apply this to your Log node: When your message contains the header, it appears in the Activity Log Tags. If the field is missing, the tag disappears. This behavior isn’t always ideal and can complicate troubleshooting. Adding default values can help clarify the situation. When you go through the functions, there is no default or coalesce function available in JSONata, then how can you do it? If you would write this in JavaScript, you would simply type the following: Authorization = $mappingInput_InputRoot.HTTPInputHeader.Authorization || UNDEFINED But that doesn’t work in JSONata. What you can do is either one of these: Use a ternary operator expression.Use sequence flattening. Ternary Operator A ternary operator is essentially an IF statement in expression form: condition ? ifTrue : ifFalse Apply this to our JSONata example: {{$mappingInput_InputRoot.HTTPInputHeader.Authorization ? $mappingInput_InputRoot.HTTPInputHeader.Authorization: "FALLBACK"} What happens is that the first parameter, the actual field, is cast to a Boolean and that result is used to choose between the field value or the fallback value. If the field is an empty sequence, empty array, empty object, empty string, zero or null the expression will default to FALLBACK. Sequence Flattening Sequence flattening in JSONata is a useful feature for handling non-existent fields (i.e., an empty sequence). Consider this example: [$field1, $field2, “FALLBACK”][0] The code returns the first value of a flattened sequence. If $field1 has a value, it is returned; otherwise, if $field2 has a value, $field2 is returned. If neither has a value, “FALLBACK” is returned. This functions similarly to a chained COALESCE in ESQL. Here’s how it applies to our example: {{[$mappingInput_InputRoot.HTTPInputHeader.Authorization, "UNDEFINED"][0]} Example Here’s how both options look in a test setup with HEADER_TO and Header_SF: You can see the fields directly from the header as expected: When the fields are unavailable, the output is: These examples might not be the most realistic, but you could use them to determine the type of authentication used (e.g., basic, OAuth) or if a specific message field is filled in. Now that you know how to do it, it's up to you to find the right use case to apply it to. Conclusion By using the Log node to capture and log ExceptionList inserts, you can significantly enhance your ability to monitor and troubleshoot message flows in IBM App Connect Enterprise. This approach ensures that all relevant error details are readily available in the activity log, making it easier to diagnose and resolve issues. Acknowledgment to Dan Robinson and David Coles for their contribution to this article. Resources IBM App Connect Documentation: Log nodeIBM App Connect Documentation: Activity logsJSONata: Sequences"The One Liner If Statement (Kinda): Ternary Operators Explained"IBM App Connect Documentation: Adding entries to the activity log by using a Log node

By Matthias Blomme

Configuring Autoscaling for Various Machine Learning Model Types

AWS Sagemaker has simplified the deployment of machine learning models at scale. Configuring effective autoscaling policies is crucial for balancing performance and cost. This article aims to demonstrate how to set up various autoscaling policies using TypeScript CDK, focusing on request, memory, and CPU-based autoscaling for different ML model types. Model Types Based on Invocation Patterns At a high level, model deployment in SageMaker can be broken into three main categories based on invocation patterns: 1. Synchronous (Real-Time) Inference Synchronous inference is suitable when immediate response or feedback is required by end users, such as when a website interaction is required. This approach is particularly well-suited for applications that demand quick response times with minimal delay. Examples include fraud detection in financial transactions and dynamic pricing in ride-sharing. 2. Asynchronous Inference Asynchronous inference is ideal for handling queued requests when it is acceptable to process messages with a delay. This type of inference is preferred when the model is memory/CPU intensive and takes more than a few seconds to respond. For instance, video content moderation, analytics pipeline, and Natural Language Processing (NLP) for textbooks. 3. Batch Processing Batch processing is ideal when data needs to be processed in chunks (batches) or at scheduled intervals. Batch processing is mostly used for non-time-sensitive tasks when you need the output to be available at periodic intervals like daily or weekly. For example, periodic recommendation updates, where an online retailer generates personalized product recommendations for its customers weekly. Predictive maintenance, where daily jobs are run to predict machines that are likely to fail, is another good example. Types of Autoscaling in SageMaker With CDK Autoscaling in SageMaker can be tailored to optimize different aspects of performance based on the model’s workload: 1. Request-Based Autoscaling Use Case Best for real-time (synchronous) inference models that need low latency. Example Scaling up during peak shopping seasons for an e-commerce recommendation model to meet high traffic. 2. Memory-Based Autoscaling Use Case Beneficial for memory-intensive models, such as large NLP models. Example Increasing instance count when memory usage exceeds 80% for image processing models that require high resolution. 3. CPU-Based Autoscaling Use Case Ideal for CPU-bound models that require more processing power. Example Scaling for high-performance recommendation engines by adjusting instance count as CPU usage reaches 75%. Configuring Autoscaling Policies in TypeScript CDK Below is an example configuration of different scaling policies using AWS CDK with TypeScript: TypeScript import * as cdk from 'aws-cdk-lib'; import * as sagemaker from 'aws-cdk-lib/aws-sagemaker'; import * as autoscaling from 'aws-cdk-lib/aws-applicationautoscaling'; import { Construct } from 'constructs'; export class SageMakerEndpointStack extends cdk.Stack { constructor(scope: Construct, id: string, props?: cdk.StackProps) { super(scope, id, props); const AUTO_SCALE_CONFIG = { MIN_CAPACITY: 1, MAX_CAPACITY: 3, TARGET_REQUESTS_PER_INSTANCE: 1000, CPU_TARGET_UTILIZATION: 70, MEMORY_TARGET_UTILIZATION: 80 }; // Create SageMaker Endpoint const endpointConfig = new sagemaker.CfnEndpointConfig(this, 'EndpointConfig', { productionVariants: [{ modelName: 'YourModelName', // Replace with your model name variantName: 'prod', initialInstanceCount: AUTO_SCALE_CONFIG.MIN_CAPACITY, instanceType: 'ml.c5.2xlarge' }] }); const endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', { endpointName: 'YourEndpointName', // Replace with your endpoint name endpointConfig: endpointConfig }); // Set up autoscaling const scalableTarget = endpoint.createScalableInstanceCount({ minCapacity: AUTO_SCALE_CONFIG.MIN_CAPACITY, maxCapacity: AUTO_SCALE_CONFIG.MAX_CAPACITY }); this.setupRequestBasedAutoscaling(scalableTarget); this.setupCpuBasedAutoscaling(scalableTarget, endpoint); this.setupMemoryBasedAutoscaling(scalableTarget, endpoint); this.setupStepAutoscaling(scalableTarget, endpoint); } private setupRequestBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount) { scalableTarget.scaleOnRequestCount('ScaleOnRequestCount', { targetRequestsPerInstance: AUTO_SCALE_CONFIG.TARGET_REQUESTS_PER_INSTANCE }); } private setupCpuBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) { scalableTarget.scaleOnMetric('ScaleOnCpuUtilization', { metric: endpoint.metricCPUUtilization(), targetValue: AUTO_SCALE_CONFIG.CPU_TARGET_UTILIZATION }); } private setupMemoryBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) { scalableTarget.scaleOnMetric('ScaleOnMemoryUtilization', { metric: endpoint.metricMemoryUtilization(), targetValue: AUTO_SCALE_CONFIG.MEMORY_TARGET_UTILIZATION }); } // Example configuration of step scaling. // Changes the number of instances to scale up and down based on CPU usage private setupStepAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) { scalableTarget.scaleOnMetric('StepScalingOnCpu', { metric: endpoint.metricCPUUtilization(), scalingSteps: [ { upper: 30, change: -1 }, { lower: 60, change: 0 }, { lower: 70, upper: 100, change: 1 }, { lower: 100, change: 2 } ], adjustmentType: autoscaling.AdjustmentType.CHANGE_IN_CAPACITY }); } } Note: CPU metrics can exceed 100% when instances have multiple cores, as they measure total CPU utilization. Balancing Autoscaling Policies by Model Type Autoscaling policies differ based on model requirements: Batch Processing Models Request- or CPU-based autoscaling is ideal here since you won't have to pay for resources when traffic is low or none. Synchronous Models In order to provide a swift response to spikes in real-time requests, request-based autoscaling is recommended. Asynchronous Models CPU-based scaling with longer cooldowns prevents over-scaling and maintains efficiency. Key Considerations for Effective Autoscaling 1. Cost Management Tune metric thresholds to optimize cost without sacrificing performance. 2. Latency Requirements For real-time models, prioritize low-latency scaling; batch and asynchronous models can handle slight delays. 3. Performance Monitoring Regularly assess model performance and adjust configurations to adapt to demand changes. Like in the example above, we can use more than one autoscaling policy to balance cost and performance, but that can lead to increased complexity in setup and management. Conclusion With AWS SageMaker's autoscaling options, you can effectively configure resource management for different types of ML models. By setting up request-based, memory-based, and CPU-based policies in CDK, you can optimize both performance and costs across diverse applications.

By Koushik Balaji Venkatesan

Setting Up DBT and Snowpark for Machine Learning Pipelines

AI/ML workflows excel on structured, reliable data pipelines. To streamline these processes, DBT and Snowpark offer complementary capabilities: DBT is for modular SQL transformations, and Snowpark is for programmatic Python-driven feature engineering. Here are some key benefits of using DBT, Snowpark, and Snowflake together: Simplifies SQL-based ETL with DBT’s modularity and tests.Handles complex computations with Snowpark’s Python UDFs.Leverages Snowflake’s high-performance engine for large-scale data processing. Here’s a step-by-step guide to installing, configuring, and integrating DBT and Snowpark into your workflows. Step 1: Install DBT In Shell, you can use Python’s pip command for installing packages. Assuming Python is already installed and added to your PATH, follow these steps: Shell # Set up a Python virtual environment (recommended): python3 -m venv dbt_env source dbt_env/bin/activate # Install DBT and the Snowflake adapter: pip install dbt-snowflake # Verify DBT installation dbt --version Step 2: Install Snowpark Shell # Install Snowpark for Python pip install snowflake-snowpark-python # Install additional libraries for data manipulation pip install pandas numpy # Verify Snowpark installation python -c "from snowflake.snowpark import Session; print('successful Snowpark installation')" Step 3: Configuring DBT for Snowflake DBT requires a profiles.yml file to define connection settings for Snowflake. Locate the DBT Profiles Directory By default, DBT expects the profiles.yml file in the ~/.dbt/ directory. Create the directory if it doesn’t exist: Shell mkdir -p ~/.dbt Create the profiles.yml File Define your Snowflake credentials in the following format: YAML my_project: outputs: dev: type: snowflake account: your_account_identifier user: your_username password: your_password role: your_role database: your_database warehouse: your_warehouse schema: your_schema target: dev Replace placeholders like your_account_identifier with your Snowflake account details. Test the Connection Run the following command to validate your configuration: Shell dbt debug If the setup is correct, you’ll see a success message confirming the connection. Step 4: Setting Up Snowpark Ensure Snowflake Permissions Before using Snowpark, ensure your Snowflake user has the following permissions: Access to the warehouse and schema.Ability to create and register UDFs (User-Defined Functions). Create a Snowpark Session Set up a Snowpark session using the same credentials from profiles.yml: Python from snowflake.snowpark import Session def create_session(): connection_params = { "account": "your_account_identifier", "user": "your_username", "password": "your_password", "role": "your_role", "database": "your_database", "warehouse": "your_warehouse", "schema": "your_schema", } return Session.builder.configs(connection_params).create() session = create_session() print("Snowpark session created successfully") Register a Sample UDF Here’s an example of registering a simple Snowpark UDF for text processing: Python def clean_text(input_text): return input_text.strip().lower() session.udf.register( func=clean_text, name="clean_text_udf", input_types=["string"], return_type="string", is_permanent=True ) print("UDF registered successfully") Step 5: Integrating DBT With Snowpark You have a DBT model named raw_table that contains raw data. raw_table DBT Model Definition SQL -- models/raw_table.sql SELECT * FROM my_database.my_schema.source_table Use Snowpark UDFs in DBT Models Once you’ve registered a UDF in Snowflake using Snowpark, you can call it directly from your DBT models. SQL -- models/processed_data.sql WITH raw_data AS ( SELECT id, text_column FROM {{ ref('raw_table') } ), cleaned_data AS ( SELECT id, clean_text_udf(text_column) AS cleaned_text FROM raw_data ) SELECT * FROM cleaned_data; Run DBT Models Execute your DBT models to apply the transformation: Shell dbt run --select processed_data Step 6: Advanced AI/ML Use Case For AI/ML workflows, Snowpark can handle tasks like feature engineering directly in Snowflake. Here’s an example of calculating text embeddings: Create an Embedding UDF Using Python and a pre-trained model, you can generate text embeddings: Python from transformers import pipeline def generate_embeddings(text): model = pipeline("feature-extraction", model="bert-base-uncased") return model(text)[0] session.udf.register( func=generate_embeddings, name="generate_embeddings_udf", input_types=["string"], return_type="array", is_permanent=True ) Integrate UDF in DBT Call the embedding UDF in a DBT model to create features for ML: SQL -- models/embedding_data.sql WITH raw_text AS ( SELECT id, text_column FROM {{ ref('raw_table') } ), embedded_text AS ( SELECT id, generate_embeddings_udf(text_column) AS embeddings FROM raw_text ) SELECT * FROM embedded_text; Best Practices Use DBT for reusable transformations: Break down complex SQL logic into reusable models.Optimize Snowpark UDFs: Write lightweight, efficient UDFs to minimize resource usage.Test Your Data: Leverage DBT’s testing framework for data quality.Version Control Everything: Track changes in DBT models and Snowpark scripts for traceability. Conclusion By combining DBT’s SQL-based data transformations with Snowpark’s advanced programming capabilities, you can build AI/ML pipelines that are both scalable and efficient. This setup allows teams to collaborate effectively while leveraging Snowflake’s computational power to process large datasets. Whether you’re cleaning data, engineering features, or preparing datasets for ML models, the DBT-Snowpark integration provides a seamless workflow to unlock your data’s full potential.

By Sevinthi Kali Sankar Nagarajan

How to Automate Blob Deletion in Azure Storage Using PowerShell

Azure storage accounts are a cornerstone for data storage solutions in the Azure ecosystem, supporting various workloads, from storing SQL backups to serving media files. Automating tasks like deleting outdated or redundant blobs from storage containers can optimize storage costs and ensure efficiency. This guide will walk you through using PowerShell to safely and effectively delete blobs from an Azure storage account. Whether you're managing SQL backups, application logs, or other unstructured data, this process can be applied to a wide range of scenarios where cleanup is a routine requirement. New to Storage Account? One of the core services within Microsoft Azure is the storage account service. Many services utilize storage accounts for storing data, such as Virtual Machine Disks, Diagnostics logs (especially application logs), SQL backups, and others. You can also use the Azure storage account service to store your own data, such as blobs or binary data. As per MSDN, Azure blob storage allows you to store large amounts of unstructured object data. You can use blob storage to gather or expose media, content, or application data to users. Because all blob data is stored within containers, you must create a storage container before you can begin to upload data. Step-by-Step Step 1: Get the Prerequisite Inputs In this example, I will delete a SQL database (backed up or imported to storage) stored in bacpac format in SQL container. PowerShell ## prerequisite Parameters $resourceGroupName="rg-dgtl-strg-01" $storageAccountName="sadgtlautomation01" $storageContainerName="sql" $blobName = "core_2022110824.bacpac" Step 2: Connect to Your Azure Subscription Using the az login command with a service principal is a secure and efficient way to authenticate and connect to your Azure subscription for automation tasks and scripts. In scenarios where you need to automate Azure management tasks or run scripts in a non-interactive manner, you can authenticate using a service principal. A service principal is an identity created for your application or script to access Azure resources securely. PowerShell ## Connect to your Azure subscription az login --service-principal -u "210f8f7c-049c-e480-96b5-642d6362f464" -p "c82BQ~MTCrPr3Daz95Nks6LrWF32jXBAtXACccAV" --tenant "cf8ba223-a403-342b-ba39-c21f78831637" Step 3: Check if the Container Exists in the Storage Account When working with Azure Storage, you may need to verify if a container exists in a storage account or create it if it doesn’t. You can use the Get-AzStorageContainer cmdlet to check for the existence of a container. PowerShell ## Get the storage account to check container exist or need to be create $storageAccount = Get-AzStorageAccount -ResourceGroupName $resourceGroupName -Name $storageAccountName ## Get the storage account context $context = $storageAccount.Context Step 4: Ensure the Container Exists Before Deleting the Blob We need to use Remove-AzStorageBlob cmdlet to delete a blob from the Azure Storage container. PowerShell ## Check if the storage container exists if(Get-AzStorageContainer -Name $storageContainerName -Context $context -ErrorAction SilentlyContinue) { Write-Host -ForegroundColor Green $storageContainerName ", the requested container exit,started deleting blob" ## Create a new Azure Storage container Remove-AzStorageBlob -Container $storageContainerName -Context $context -Blob $blobName Write-Host -ForegroundColor Green $blobName deleted } else { Write-Host -ForegroundColor Magenta $storageContainerName "the requested container does not exist" } Here is the full code: PowerShell ## Delete a Blob from an Azure Storage ## Input Parameters $resourceGroupName="rg-dgtl-strg-01" $storageAccountName="sadgtlautomation01" $storageContainerName="sql" $blobName = "core_2022110824.bacpac" ## Connect to your Azure subscription az login --service-principal -u "210f8f7c-049c-e480-96b5-642d6362f464" -p "c82BQ~MTCrPr3Daz95Nks6LrWF32jXBAtXACccAV" --tenant "cf8ba223-a403-342b-ba39-c21f78831637" ## Function to create the storage container Function DeleteblogfromStorageContainer { ## Get the storage account to check container exist or need to be create $storageAccount = Get-AzStorageAccount -ResourceGroupName $resourceGroupName -Name $storageAccountName ## Get the storage account context $context = $storageAccount.Context ## Check if the storage container exists if(Get-AzStorageContainer -Name $storageContainerName -Context $context -ErrorAction SilentlyContinue) { Write-Host -ForegroundColor Green $storageContainerName ", the requested container exit,started deleting blob" ## Remove the blob in Azure Storage container Remove-AzStorageBlob -Container $storageContainerName -Context $context -Blob $blobName Write-Host -ForegroundColor Green $blobName deleted } else { Write-Host -ForegroundColor Magenta $storageContainerName "the requested container does not exist" } } #Call the Function DeleteblogfromStorageContainer Here is the output: Conclusion Automating blob deletion in Azure storage accounts using PowerShell is a practical approach for maintaining a clutter-free and efficient storage system. By following the steps outlined, you can seamlessly integrate this process into your workflows, saving time and reducing manual efforts. This method is not just limited to SQL backup files. It can also be extended to managing other types of data stored in Azure Storage, such as application logs, diagnostic files, or media content. By ensuring the existence of containers and leveraging PowerShell's robust cmdlets, you can confidently manage your Azure resources in an automated, error-free manner.

By thiyagu selvaraj

Running Docker Containers in HashiCorp Nomad: A Beginner’s Guide

Nomad, a flexible and lightweight orchestrator developed by HashiCorp, is an excellent tool for managing containerized applications like Docker. This guide walks you through running Docker containers with Nomad, designed specifically for beginners. Whether you're deploying a simple web server or experimenting with microservices, this guide will provide you with the foundation to get started. What Is Nomad? Nomad is a simple, flexible, and scalable workload orchestrator that supports running containerized and non-containerized applications. Though it is not as popular as Kubernetes, which currently dominates the container orchestration space, Nomad has its advantages: ease of use, lightweight architecture, and support for mixed workloads. Prerequisites Before running Docker containers in Nomad, ensure the following: Nomad Installed: Download and install Nomad from HashiCorp's website.Docker Installed: Install Docker and confirm it is running with: Plain Text docker --version Nomad Agent Running: Start a Nomad agent in development mode for simplicity. The -dev flag starts a local Nomad cluster in development mode: Plain Text nomad agent -dev Step 1: Write a Nomad Job File Nomad uses job specification files written in HashiCorp Configuration Language (HCL) to define the jobs it manages. Below is a basic example of a Nomad job file (docker.nomad) to run an NGINX web server. Example: Simple Nomad Job File YAML job "nginx-job" { datacenters = ["dc1"] group "web-group" { count = 1 task "nginx" { driver = "docker" config { image = "nginx:latest" # Docker image to run ports = ["http"] } resources { network { port "http" { static = 8080 # Expose container on host's port 8080 } } } } } } Explanation of the Job File job: Defines the job name and datacenter to deploy to.group: Groups related tasks (containers) together.task: Specifies a single container to run, including the driver (docker) and configuration.config: Contains Docker-specific settings like the image to use and ports to expose.resources: Defines resource limits and networking settings for the container. Step 2: Run the Nomad Job Submit the job file to the Nomad cluster using the nomad run command: YAML nomad run docker.nomad This will schedule the NGINX container on the Nomad agent. If successful, you’ll see output indicating that the job has been successfully deployed. Step 3: Verify the Job Check the status of the job using: YAML nomad status nginx-job You should see the job details, including its allocation ID and deployment status. Step 4: Access the Running Container Find the IP address of the host running the container. If you're running locally, it’s likely 127.0.0.1.Open a web browser and visit http://localhost:8080. You should see the default NGINX welcome page. Step 5: Stop the Job To stop the container, use the nomad stop command: YAML nomad stop nginx-job This will cleanly shut down the container managed by Nomad. Advanced Examples 1. Add Environment Variables You can pass environment variables to your Docker container using the env block in the task section: YAML task "nginx" { driver = "docker" config { image = "nginx:latest" ports = ["http"] } env { APP_ENV = "production" } } 2. Mount Volumes To mount a host directory into the container, use the volumes option in the config block: YAML task "nginx" { driver = "docker" config { image = "nginx:latest" ports = ["http"] volumes = ["/host/path:/container/path"] } } 3. Scale Containers To scale the container to multiple instances, modify the count parameter in the group section: YAML group "web-group" { count = 3 # Run 3 instances of the container } Nomad will distribute the instances across available nodes in your cluster. Tips for Beginners Test in Development Mode: Start with the nomad agent -dev command for quick local testing before deploying to a production cluster.Leverage Nomad's Web UI: Use the Nomad UI (enabled by default in dev mode) to monitor jobs and allocations visually. Access it at http://localhost:4646.Use Logs for Debugging: Check logs for troubleshooting with: Plain Text nomad logs <allocation-id> Conclusion Running Docker containers with HashiCorp Nomad is a straightforward and powerful way to orchestrate workloads. By defining jobs in simple HCL files, you can easily deploy, monitor, and scale containers across your infrastructure. Whether you're just starting or looking for an alternative to Kubernetes, Nomad’s simplicity and flexibility make it an excellent choice for managing containerized applications.

By Nilesh Jain

Management Capabilities 101: Ensuring On-Time Delivery in Agile-Driven Projects

People may perceive Agile methodology and hard deadlines as two incompatible concepts. The word “Agile” is often associated with flexibility, adaptability, iterations, and continuous improvement, while “deadline” is mostly about fixed dates, finality, and time pressure. Although the latter may sound threatening, project teams can prioritize non-negotiable deadlines and simultaneously modify those that are flexible. The correct approach is the key. In this article, we’ll analyze how deadlines are perceived within an Agile framework and what techniques can help successfully manage deadlines in Agile-driven projects. Immersing Into the Vision of a Powerful Methodology RAD, Scrumban, Lean, XP, AUP, FDD... do these words sound familiar? If you’re involved in IT, you surely must have heard them before. They all are about Agile. This methodology presupposes splitting the software creation process within a project into small iterations called sprints (each typically lasting 2-3 weeks). Agile enables regular delivery of a working product increment as an alternative to a single extensive software rollout. It also fosters openness to any changes, quick feedback for continuous IT product enhancement, and more intensive communication between teams. This approach is ideal for complex projects with dynamic requirements, frequent functionality updates, and the need for continuous alignment with user feedback. Grasping How Time Limitations Are Woven Into an Agile-Driven Landscape Although Agile emphasizes boosted flexibility, it doesn’t mean that deadlines can be neglected. They must be addressed with the same level of responsibility and attention but with a more adaptable mindset. As sprints are short, unforeseen issues or alterations are contained within that specific sprint. This helps mitigate the risks of delaying the entire project and simplifies problem-solving, as only a limited part of the project is impacted at a time. Moreover, meeting deadlines in Agile projects relies heavily on accurate task estimations. If they are off the mark, project teams risk either falling behind schedule because of overcommitting or spending time aimlessly due to an insufficient workload for the sprint. If such situations happen even once, team members must reevaluate their approach to estimating tasks to better align them with team capacity. Proven Practices for Strategic Navigation of Time Constraints Let’s have a closer look at a number of practices for ensuring timely releases throughout the entire Agile development process and keep project teams moving in the right direction: 1. Foster a Steady Dialogue The majority of Agile frameworks support specific ceremonies that ensure transparency and keep team members and stakeholders informed of all project circumstances, thus effectively managing deadlines. For instance, during a daily stand-up meeting, project teams discuss current progress, objectives, and the quickest and most impactful ways of overcoming hurdles to complete all sprint tasks on time. A backlog refinement meeting is another pivotal activity during which a product owner reviews tasks in the backlog to confirm that prioritized activities are completed before each due date. A retrospective meeting performed after each sprint analyzes completed work and considers an improved approach to addressing problems in the future to minimize their effect on hitting deadlines. 2. Set Up Obligatory Sprint Planning Before each sprint, a product owner or a Scrum master needs to conduct a sprint planning meeting, during which they collaborate with software developers to decide on the efforts for each task and prioritize which items from the backlog should be completed further. To achieve this, they analyze what objectives should be attained during this sprint, what techniques will be used to fulfill them, and who will be responsible for each backlog item. This helps ensure that team members continuously progress towards specific goals, have clarity regarding the upcoming activities, and deliver high-quality output, always staying on schedule. 3. Promote Clarity for Everyone Meeting deadlines requires a transparent work environment where everyone has quick access to the current project status, especially in distributed teams. Specific tools, such as Kanban boards or task cards, contribute to achieving this. They provide a flexible shared space that gives a convenient overview of the entire workflow of tasks with highlighted priorities and due dates. This enables team members to prioritize critical tasks without delays, control task completion time, and take full accountability for their work. 4. Implement a Resilient Change Management Framework The ability to swiftly and proficiently process probable modifications in scope or objectives within a sprint directly impacts a team’s ability to adhere to time constraints. Change-handling workflows enable teams to manage adjustments continuously, reducing the risk of downtime or missed deadlines. Therefore, key project contributors, product owners, and Scrum masters can formulate a prioritization system to define which alterations should be addressed first. They also should discuss how each adjustment corresponds to milestones and the end goal. 5. Create a Clear Definition of Done The definition of done is a win-win practice that fosters straightforward criteria for marking tasks as complete. When everyone understands these criteria, they deliver more quality achievements aligned with high standards, minimize the chance of last-minute rework, and decrease the accumulation of technical debt on the project. 6. Follow Time Limits To enhance task execution, team leaders can adopt time limits — for example, restricting daily stand-ups to 15 minutes. This helps to focus on the task and avoid distractions to meet deadlines. Final Thoughts Navigating deadlines in Agile projects is a fully attainable goal that requires an effective strategy. By incorporating practices such as regular communication, sprint planning, transparency, a change management approach, a definition of done, and timeboxing, specialists can successfully accomplish short — and long-term targets without compromising set deadlines.

By Pavel Novik

Karpenter vs. Kubernetes Cluster Autoscaler: Which Is Right for You?

As organizations scale their workloads in Kubernetes, managing cluster resources efficiently becomes paramount. Kubernetes provides built-in scaling capabilities, such as the Horizontal Pod Autoscaler (HPA) and Vertical Pod Autoscaler (VPA), but scaling the underlying nodes is where Cluster Autoscaler (CA) has been the go-to solution for years. However, Karpenter, an open-source node provisioning solution, has emerged as a strong alternative, promising more efficient and dynamic scaling. In this article, we will delve deep into the features, benefits, limitations, and use cases of Karpenter and Kubernetes Cluster Autoscaler. By the end, you will be able to determine which tool best suits your needs. Understanding Kubernetes Cluster Autoscaler Kubernetes Cluster Autoscaler (CA) is a time-tested tool that adjusts the size of a cluster to match the demands of running workloads. It achieves this by adding or removing nodes based on unschedulable pods and underutilized nodes. Here’s a breakdown of its functionality: Key Features of Cluster Autoscaler Pod-centric scaling: Detects unschedulable pods and adjusts node capacity to fit them.Node downsizing: Identifies underutilized nodes and scales them down when they no longer host any workloads.Integration with cloud providers: Works seamlessly with major cloud platforms like AWS, GCP, and Azure.Customizable behavior: Supports scaling policies, taints, tolerations, and labels to meet specific workload requirements. Strengths of Cluster Autoscaler Mature and proven: It has been part of Kubernetes since 2016, making it highly reliable.Cloud-native compatibility: Excellent support for managed Kubernetes services like Amazon EKS, Google GKE, and Azure AKS.Cost optimization: Shrinks the cluster by removing unused nodes to save costs. Challenges of Cluster Autoscaler Static scaling mechanism: Decisions are based on a set of preconfigured rules, leading to inefficiencies in dynamic workloads.Latency in scaling: CA may introduce delays in scaling, especially when managing large or bursty workloads.Limited customizability: While it supports scaling parameters, its operational model is less flexible compared to newer solutions like Karpenter. Introducing Karpenter Karpenter, developed by AWS, is an open-source, next-generation provisioning system designed to optimize Kubernetes cluster resources. Unlike Cluster Autoscaler, Karpenter focuses on flexibility, efficiency, and workload-aware scaling. Key Features of Karpenter Real-time scaling: Rapidly provisions and deprovisions nodes based on workload requirements.Dynamic instance selection: Chooses the most cost-effective and performant instances in real-time, leveraging Spot Instances when possible.Workload awareness: Supports fine-grained workload characteristics, such as GPU-based instances, ephemeral storage, and specific labels.Native integration: Deeply integrates with Kubernetes APIs, making it cloud-agnostic. Strengths of Karpenter Dynamic scaling: Optimized for real-time scaling needs, making it ideal for bursty or unpredictable workloads.Cloud-agnostic: Works with any Kubernetes cluster, including on-premise and edge environments.Cost efficiency: Supports automatic selection of Spot Instances, Reserved Instances, or Savings Plans to minimize cost.Resource optimization: Provisions nodes tailored to workload requirements, reducing resource waste. Challenges of Karpenter Newer technology: Still maturing compared to Cluster Autoscaler, with fewer production case studies.Learning curve: Requires a deeper understanding of Kubernetes workload patterns and configuration.AWS-centric focus: While cloud-agnostic, its tight integration with AWS services makes it more suitable for AWS-heavy environments. Head-to-Head Comparison Feature/Aspect Cluster Autoscaler Karpenter Scaling Speed Moderate Fast Flexibility in Node Types Limited to predefined configurations Dynamic, based on workload needs Cloud Provider Support Comprehensive Cloud-agnostic Ease of Use Simpler to configure Requires deeper expertise Cost Efficiency Moderate High, with Spot/Reserved instance optimization Maturity Established and stable Emerging and rapidly evolving Integration Works with cloud-managed Kubernetes Direct integration with Kubernetes APIs Use Cases: When to Choose What Choose Kubernetes Cluster Autoscaler if: You prefer a tried-and-true solution with broad community support.Your workloads have predictable scaling patterns.You are using a managed Kubernetes service, such as EKS, AKS, or GKE.You have limited expertise in managing node-level scaling and need a simpler tool. Choose Karpenter if: Your workloads are dynamic and bursty, requiring real-time scaling.You need fine-grained control over node provisioning and types.Cost optimization through Spot Instances or Savings Plans is a top priority.You are building cloud-agnostic or hybrid Kubernetes deployments.You operate in an AWS-heavy environment, leveraging its deep integration. Practical Considerations Setup and Configuration Cluster Autoscaler requires setting scaling policies and often depends on the node group configuration provided by the cloud provider.Karpenter requires installing a controller in the cluster and configuring it to interact with workload-specific needs dynamically. Performance Tuning Cluster Autoscaler may need fine-tuning of its thresholds and settings to avoid overprovisioning or underutilization.Karpenter adapts to real-time requirements but needs workload profiling to achieve optimal efficiency. The Verdict Both Karpenter and Kubernetes Cluster Autoscaler are powerful tools, but their utility depends on your unique requirements: Cluster Autoscaler is a reliable and straightforward choice for most managed Kubernetes clusters.Karpenter excels in environments requiring rapid scaling, workload-specific node provisioning, and cost optimization. For organizations prioritizing agility and cost efficiency, Karpenter is a forward-looking solution. For those needing stability and ease of use, Cluster Autoscaler remains a solid option. By carefully evaluating your workload patterns, cloud environment, and scaling requirements, you can choose the tool that aligns with your operational goals. Author’s Note: As the Kubernetes ecosystem evolves, scaling strategies will continue to improve. Keeping abreast of the latest developments in tools like Karpenter and Cluster Autoscaler can ensure your clusters remain efficient, cost-effective, and reliable.

By Sai Sandeep Ogety

CORE

Mastering Cloud Containerization: A Step-by-Step Guide to Deploying Containers in the Cloud

Containers have transformed how we deploy, scale, and manage applications by packaging code and dependencies in a standardized unit that can run consistently across any environment. When used in cloud environments, containers offer: Portability across development, testing, and production.Scalability to quickly adapt to traffic and demand.Efficiency with reduced overhead compared to traditional virtual machines. In this tutorial, we’ll walk through a full setup of a cloud-hosted containerized application, covering: Basics of containers and why they’re ideal for the cloud.Setting up a Dockerized application.Deploying the container to a cloud provider (using Google Cloud Platform as an example).Scaling and managing your container in the cloud. Container Basics: How Containers Fit Into Cloud Workflows Containers encapsulate all the libraries and dependencies needed to run an application. Unlike traditional virtual machines, in which each has an OS, containers share the host OS, making them lightweight and efficient. Why Containers for Cloud? Fast startup times mean quicker scaling to handle variable traffic.Consistency across environments ensures that code behaves the same from developer laptops to production.Resource efficiency enables high-density deployment on the same infrastructure. Core Components of Cloud Containerization Container Engine: Manages and runs containers (e.g., Docker, containerd).Orchestration: Ensures app reliability, scaling, and load balancing (e.g., Kubernetes, ECS).Registry: Stores container images for access across environments (e.g., Docker Hub, GCR). Setting Up a Dockerized Application We’ll start by containerizing a simple Node.js application. Step 1: Create the Application 1. In a project folder, initialize a Node.js project: Shell mkdir cloud-container-app && cd cloud-container-app npm init -y 2. Create a basic server file, app.js: JavaScript const express = require('express'); const app = express(); app.get('/', (req, res) => { res.send('Hello, Cloud Container!'); }); app.listen(3000, () => { console.log('App running on port 3000'); }); 3. Add Express to the project: Shell npm install express Step 2: Create a Dockerfile This Dockerfile specifies how to package the app in a Docker container. Dockerfile # Use the Node.js image as a base FROM node:14 # Set working directory WORKDIR /app # Copy files and install dependencies COPY . . RUN npm install # Expose the app’s port EXPOSE 3000 # Start the application CMD ["node", "app.js"] Step 3: Build and Test the Docker Image Locally 1. Build the Docker image: Shell docker build -t cloud-container-app . 2. Run the container locally: Shell docker run -p 3000:3000 cloud-container-app 3. Visit `http://localhost:3000` in your browser. You should see "Hello, Cloud Container!" displayed. Deploying the Container to Google Cloud Platform (GCP) In this section, we’ll push the container image to Google Container Registry (GCR) and deploy it to Google Kubernetes Engine (GKE). Step 1: Set Up a GCP Project 1. Create a GCP Project. Go to the [Google Cloud Console] (https://console.cloud.google.com) and create a new project. 2. Enable the Kubernetes Engine and Container Registry APIs for your project. Step 2: Push the Image to Google Container Registry 1. Tag the Docker image for Google Cloud: Shell docker tag cloud-container-app gcr.io/<YOUR_PROJECT_ID>/cloud-container-app 2. Push the image to GCR: Shell docker push gcr.io/<YOUR_PROJECT_ID>/cloud-container-app Step 3: Create a Kubernetes Cluster 1. Initialize a GKE Cluster: Shell gcloud container clusters create cloud-container-cluster --num-nodes=2 2. Configure kubectl to connect to your new cluster: Shell gcloud container clusters get-credentials cloud-container-cluster Step 4: Deploy the Containerized App to GKE 1. Create a k8s-deployment.yaml file to define the deployment and service: YAML apiVersion: apps/v1 kind: Deployment metadata: name: cloud-container-app spec: replicas: 2 selector: matchLabels: app: cloud-container-app template: metadata: labels: app: cloud-container-app spec: containers: - name: cloud-container-app image: gcr.io/<YOUR_PROJECT_ID>/cloud-container-app ports: - containerPort: 3000 --- apiVersion: v1 kind: Service metadata: name: cloud-container-service spec: type: LoadBalancer selector: app: cloud-container-app ports: - protocol: TCP port: 80 targetPort: 3000 2. Deploy the application to GKE: Shell kubectl apply -f k8s-deployment.yaml 3. Get the external IP to access the app: Shell kubectl get services Scaling and Managing Containers in GKE Step 1: Scale the Deployment To adjust the number of replicas (e.g., to handle higher traffic), use the following command: Shell kubectl scale deployment cloud-container-app --replicas=5 GKE will automatically scale the app by adding replicas and distributing the load. Step 2: Monitor and Manage Logs 1. Use the command below to view logs for a specific pod in Kubernetes: Shell kubectl logs <POD_NAME> 2. Enable the GKE dashboard to monitor pod status and resource usage and manage deployments visually. Conclusion: Leveraging Containers for Scalable Cloud Deployments This tutorial covered: Containerizing a Node.js app using Docker.Deploying to Google Kubernetes Engine (GKE) for scalable, managed hosting.Managing and scaling containers effectively in the cloud. With containers in the cloud, your applications gain scalability, portability, and efficiency — ensuring they’re ready to handle demand, adapt quickly, and remain consistent across environments.

By Kuppusamy Vellamadam Palavesam

Introducing the MERGE Command in PostgreSQL 15

Developers often need to merge data from external sources to the base table. The expectation of this merge operation is that data from external sources refresh data at the main table. Refresh means inserting new records, updating existing records, and deleting if a record is not found. Since the 9.4 version release, PostgreSQL has supported the INSERT command with the ‘ON CONFLICT’ clause. While this proved to be a workaround for the MERGE command, many features, such as conditional delete and simplicity of query, were missing. Other databases, such as Oracle and SQL Server, already support MERGE commands. Until PostgreSQL 15, it was still one of the long-awaited features for PostgreSQL users. This article discusses the advantages, use cases, and benchmarking of the MERGE command, which is newly supported by PostgreSQL 15. How the MERGE Command Works in PostgreSQL The MERGE command selects rows from one or more external data sources to UPDATE, DELETE, or INSERT on the target table. Depending on the condition used in the MERGE command, the user running the MERGE command should have UPDATE, DELETE, and INSERT privileges on the target table. The user should also have SELECT privileges on the source table(s). A typical PostgreSQL MERGE command looks like as below: SQL MERGE INTO target_table t1 USING source_table t2 ON t2.col1 = t2.col1 WHEN MATCHED THEN UPDATE/INSERT/DELETE ... WHEN NOT MATCHED THEN UPDATE/INSERT/DELETE; Example of a MERGE Command The image below explains a typical use case of the MERGE command in PostgreSQL. The source data that matches the target data can be updated or deleted, and the source data that does not match can be inserted into the target data. For example, the image below shows a classic data warehouse populated with data from various data marts. Assume the data warehouse hosts data from IOT sensor stations. These stations host data marts 1, 2, 3, and so forth at respective stations. These data marts are hosted locally on the stations. In the steps below, we’ll merge data from different stations to the main stations. Step-by-Step Step 1: Create the Data Warehouse Table (Target Station Table) Create the data warehouse table, the target station table that is loaded by other stations: SQL Postgres=> CREATE table station_main ( station_id integer primary key , data text , create_time timestamp default current_timestamp , update_time timestamp default current_timestamp ); CREATE TABLE Step 2: Insert Sample Data into the Target Table Load some sample records into the target table: SQL postgres=> INSERT into station_main VALUES (1, 'data11'), (2, 'data22'), (3, 'data44'), (4, 'data44'), (5, 'data55'); INSERT 0 5 Step 3: Verify Data Insertion Check the data to ensure it's inserted correctly: SQL postgres=> SELECT * from station_main; station_id | data | create_time | update_time ------------+--------+---------------------------+--------------------------- 1 | data11 | 2023-08-11 21:21:30.19364 | 2023-08-11 21:21:30.19364 2 | data22 | 2023-08-11 21:21:30.19364 | 2023-08-11 21:21:30.19364 3 | data44 | 2023-08-11 21:21:30.19364 | 2023-08-11 21:21:30.19364 4 | data44 | 2023-08-11 21:21:30.19364 | 2023-08-11 21:21:30.19364 5 | data55 | 2023-08-11 21:21:30.19364 | 2023-08-11 21:21:30.19364 (5 rows) Step 4: Create a Table for Various Stations Create a table for data from other stations: SQL postgres=> create table station_1 ( station_id integer , data text ); CREATE TABLE Time: 4.288 ms Step 5: Insert Sample Data into the Station Table Insert sample records into the newly created station table: SQL postgres=> INSERT INTO station_1 VALUES (1, 'data11'), (2, 'data22'), (3, 'data44'), (4, 'data44'), (5, 'data55'); INSERT 0 5 Time: 2.221 ms Step 6: Verify Data in Station 1 Check the inserted data in station_1: SQL postgres=> SELECT * from station_1; station_id | data ------------+-------- 1 | data11 2 | data22 3 | data44 4 | data44 5 | data55 (5 rows) Step 7: Insert New Data and Update Existing Data in Station 1 Assuming station_1 updates some base values and inserts new values, insert new data and update existing data: SQL postgres=> INSERT INTO station_1 VALUES (6, 'data66'); INSERT 0 1 postgres=> postgres=> UPDATE station_1 set data='data10' where station_id=1; UPDATE 1 postgres=> postgres=> SELECT * from station_1; station_id | data ------------+--------- 2 | data22 3 | data44 4 | data44 5 | data55 6 | data66 1 | data10 (6 rows) Step 8: Merge Data from Station 1 to Station Main If station_1 only inserts new data, we could have used INSERT INTO SELECT FROM command. But, in this scenario, few data have been modified, so we need to merge the data: SQL postgres=> MERGE INTO station_main sm USING station_1 s ON sm.station_id=s.station_id when matched then update set data=s.data WHEN NOT MATCHED THEN INSERT (station_id, data) VALUES (s.station_id, s.data); MERGE 6 Verify data after the merge: SQL postgres=> SELECT * from station_main ; postgres=> SELECT * from station_main ; station_id | data | create_time | update_time ------------+--------+----------------------------+---------------------------- 2 | data22 | 2023-08-11 21:27:23.076226 | 2023-08-11 21:27:23.076226 3 | data44 | 2023-08-11 21:27:23.076226 | 2023-08-11 21:27:23.076226 4 | data44 | 2023-08-11 21:27:23.076226 | 2023-08-11 21:27:23.076226 5 | data55 | 2023-08-11 21:27:23.076226 | 2023-08-11 21:27:23.076226 6 | data66 | 2023-08-11 21:29:13.354859 | 2023-08-11 21:29:13.354859 1 | data10 | 2023-08-11 21:27:23.076226 | 2023-08-11 21:27:23.076226 (6 rows) Advantages of the Merge Command MERGE command helps easily manage a set of external data files such as application log files. For example, you can host data and log files into separate tablespaces hosted at cheaper storage disks and then create a MERGE table to use them as one. Other advantages include: Better speed: Small source tables perform better than a single large table. You can split a large source table based on some clause and then use individual tables to merge into the target table.More efficient searches: Searching in underlying smaller tables is quicker than searching in one large table.Easier table repair: Repairing individual smaller tables is easier than a large table. Here, repairing tables means removing data anomalies.Instant mapping: A MERGE table does not need to maintain an index of its own, making it fast to create or remap. As a result, MERGE table collections are very fast to create or remap. (You must still specify the index definitions when you create a MERGE table, even though no indexes are created.) If you have a set of tables from which you create a large table on demand, you can instead create a MERGE table from them on demand. This is much faster and saves a lot of disk space. Use Cases of MERGE Command in PostgreSQL 15 Below are the common use cases of MERGE commands: IoT data collecting from various machines: Data coming from various machines can be saved in smaller temporary tables and can be merged with target data at certain intervals. E-commerce data: Product availability data coming from various fulfillment centers can be merge with central repository of target product table. Customer data: Customer transaction data coming from different sources can be merged with main customer data. Automation of MERGE Command in PostgreSQL Using pg_cron Automating MERGE is important for error-prone MERGE execution and reducing maintenance hassle. PostgreSQL supports pg_cron extension. This extension is used to configure scheduled tasks for a PostgreSQL database. A typical pg_cron command mentioned below shows the scheduling options: SELECT cron.schedule('Minute Hour Date Month Day of the week', 'Task'); The pg_cron job below schedules MERGE command every minute: SQL SELECT cron.schedule('30 * * * *', $$MERGE INTO station_main sm USING station_1 s ON sm.station_id=s.station_id when matched then update set data=s.data WHEN NOT MATCHED THEN INSERT (station_id, data) VALUES (s.station_id, s.data);$$); Benchmarking: MERGE vs. UPSERT In this section, we will benchmark and compare traditional UPSERT (i.e., INSERT with ‘ON CONFLICT’ clause) and the MERGE command. Setup Step 1: Create a target table with 1,000,000 records, with id column as primary key: SQL CREATE table target_table as (SELECT generate_series(1,1000000) AS id, floor(random() * 1000) AS data); ALTER TABLE target_table ADD PRIMARY KEY (id); Step 2: Create a source data table with all the contents of the target table and id column as a primary key: SQL CREATE table source_table as SELECT * from target_table; ALTER TABLE source_table ADD PRIMARY KEY (id); Step 3: Create a temporary table with additional data: SQL CREATE table source_table_temp as (SELECT generate_series(1000001,1200000) AS id, floor(random() * 1000) AS data); Step 4: Insert data from the temporary table to the source table and update 400,000 rows: Step 5: Turn timing on and run the UPSERT command. The total time taken for the UPSERT operation is 5904 ms. Now, we repeat the above steps and run the MERGE command. The total time it takes for the MERGE operation is 4,484 ms. For our specific use case, we'll find that the MERGE command performed 30% better compared to the UPSERT operation. Conclusion MERGE command has been one of the most popular features added to PostgreSQL 15. It improved the performance of table refresh and makes table maintenance easier. Further, newly supported extensions, such as pg_cron can be used to automate MERGE operation.

By Vivek Singh

A Framework for Developing Service-Level Objectives: Essential Guidelines and Best Practices for Building Effective Reliability Targets

Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems. "Quality is not an act, it's a habit," said Aristotle, a principle that rings true in the software world as well. Specifically for developers, this means delivering user satisfaction is not a one-time effort but an ongoing commitment. To achieve this commitment, engineering teams need to have reliability goals that clearly define the baseline performance that users can expect. This is precisely where service-level objectives (SLOs) come into picture. Simply put, SLOs are reliability goals for products to achieve in order to keep users happy. They serve as the quantifiable bridge between abstract quality goals and the day-to-day operational decisions that DevOps teams must make. Because of this very importance, it is critical to define them effectively for your service. In this article, we will go through a step-by-step approach to define SLOs with an example, followed by some challenges with SLOs. Steps to Define Service-Level Objectives Like any other process, defining SLOs may seem overwhelming at first, but by following some simple steps, you can create effective objectives. It's important to remember that SLOs are not set-and-forget metrics. Instead, they are part of an iterative process that evolves as you gain more insight into your system. So even if your initial SLOs aren't perfect, it's okay — they can and should be refined over time. Figure 1. Steps to define SLOs Step 1: Choose Critical User Journeys A critical user journey refers to the sequence of interactions a user takes to achieve a specific goal within a system or a service. Ensuring the reliability of these journeys is important because it directly impacts the customer experience. Some ways to identify critical user journeys can be through evaluating revenue/business impact when a certain workflow fails and identifying frequent flows through user analytics. For example, consider a service that creates virtual machines (VMs). Some of the actions users can perform on this service are browsing through the available VM shapes, choosing a region to create the VM in, and launching the VM. If the development team were to order them by business impact, the ranking would be: Launching the VM because this has a direct revenue impact. If users cannot launch a VM, then the core functionality of the service has failed, affecting customer satisfaction and revenue directly.Choosing a region to create the VM. While users can still create a VM in a different region, it may lead to a degraded experience if they have a regional preference. This choice can affect performance and compliance.Browsing through the VM catalog. Although this is important for decision making, it has a lower direct impact on the business because users can change the VM shape later. Step 2: Determine Service-Level Indicators That Can Track User Journeys Now that the user journeys are defined, the next step is to measure them effectively. Service-level indicators (SLIs) are the metrics that developers use to quantify system performance and reliability. For engineering teams, SLIs serve a dual purpose: They provide actionable data to detect degradation, guide architectural decisions, and validate infrastructure changes. They also form the foundation for meaningful SLOs by providing the quantitative measurements needed to set and track reliability targets. For instance, when launching a VM, some of the SLIs can be availability and latency. Availability: Out of the X requests to launch a VM, how many succeeded? A simple formula to calculate this is: If there were 1,000 requests and 998 requests out of them succeeded, then the availability is = 99.8%. Latency: Out of the total number of requests to launch a VM, what time did the 50th, 95th, or 99th percentile of requests take to launch the VM? The percentiles here are just examples and can vary depending on the specific use case or service-level expectations. In a scenario with 1,000 requests where 900 requests were completed in 5 seconds and the remaining 100 took 10 seconds, the 95th percentile latency would be = 10 seconds.While averages can also be used to calculate latencies, percentiles are typically recommended because they account for tail latencies, offering a more accurate representation of the user experience. Step 3: Identify Target Numbers for SLOs Simply put, SLOs are the target numbers we want our SLIs to achieve in a specific time window. For the VM scenario, the SLOs can be: The availability of the service should be greater than 99% over a 30-day rolling window.The 95th percentile latency for launching the VMs should not exceed eight seconds. When setting these targets, some things to keep in mind are: Using historical data. If you need to set SLOs based on a 30-day rolling period, gather data from multiple 30-day windows to define the targets. If you lack this historical data, start with a more manageable goal, such as aiming for 99% availability each day, and adjust it over time as you gather more information. Remember, SLOs are not set in stone; they should continuously evolve to reflect the changing needs of your service and customers. Considering dependency SLOs. Services typically rely on other services and infrastructure components, such as databases and load balancers. For instance, if your service depends on a SQL database with an availability SLO of 99.9%, then your service's SLO cannot exceed 99.9%. This is because the maximum availability is constrained by the performance of its underlying dependencies, which cannot guarantee higher reliability. Challenges of SLOs It might be intriguing to set the SLO as 100%, but this is impossible. A 100% availability, for instance, means that there is no room for important activities like shipping features, patching, or testing, which is not realistic. Defining SLOs requires collaboration across multiple teams, including engineering, product, operations, QA, and leadership. Ensuring that all stakeholders are aligned and agree on the targets is essential for the SLO to be successful and actionable. Step 4: Account for Error Budget An error budget is the measure of downtime a system can afford without upsetting customers or breaching contractual obligations. Below is one way of looking at it: If the error budget is nearly depleted, the engineering team should focus on improving reliability and reducing incidents rather than releasing new features.If there's plenty of error budget left, the engineering team can afford to prioritize shipping new features as the system is performing well within its reliability targets. There are two common approaches to measuring the error budget: time based and event based. Let's explore how the statement, "The availability of the service should be greater than 99% over a 30-day rolling window," applies to each. Time-Based Measurement In a time-based error budget, the statement above translates to the service being allowed to be down for 43 minutes and 50 seconds in a month, or 7 hours and 14 minutes in a year. Here's how to calculate it: Determine the number of data points. Start by determining the number of time units (data points) within the SLO time window. For instance, if the base time unit is 1 minute and the SLO window is 30 days: Calculate the error budget. Next, calculate how many data points can "fail" (i.e., downtime). The error budget is the percentage of allowable failure. Convert this to time: This means the system can experience 7 hours and 14 minutes of downtime in a 30-day window. Last but not least, the remaining error budget is the difference between the total possible downtime and the downtime already used. Event-Based Measurement For event-based measurement, the error budget is measured in terms of percentages. The aforementioned statement translates to a 1% error budget in a 30-day rolling window. Let's say there are 43,200 data points in that 30-day window, and 100 of them are bad. You can calculate how much of the error budget has been consumed using this formula: Now, to find out how much error budget remains, subtract this from the total allowed error budget (1%): Thus, the service can still tolerate 0.77% more bad data points. Advantages of Error Budget Error budgets can be utilized to set up automated monitors and alerts that notify development teams when the budget is at risk of depletion. These alerts enable them to recognize when a greater caution is required while deploying changes to production. Teams often face ambiguity when it comes to prioritizing features vs. operations. Error budget can be one way to address this challenge. By providing clear, data-driven metrics, engineering teams are able to prioritize reliability tasks over new features when necessary. The error budget is among the well-established strategies to improve accountability and maturity within the engineering teams. Cautions to Take With Error Budgets When there is extra budget available, developers should actively look into using it. This is a prime opportunity to deepen the understanding of the service by experimenting with techniques like chaos engineering. Engineering teams can observe how the service responds and uncover hidden dependencies that may not be apparent during normal operations. Last but not least, developers must monitor error budget depletion closely as unexpected incidents can rapidly exhaust it. Conclusion Service-level objectives represent a journey rather than a destination in reliability engineering. While they provide important metrics for measuring service reliability, their true value lies in creating a culture of reliability within organizations. Rather than pursuing perfection, teams should embrace SLOs as tools that evolve alongside their services. Looking ahead, the integration of AI and machine learning promises to transform SLOs from reactive measurements into predictive instruments, enabling organizations to anticipate and prevent failures before they impact users. Additional resources: Implementing Service Level Objectives, Alex Hidalgo, 2020 "Service Level Objects," Chris Jones et al., 2017 "Implementing SLOs," Steven Thurgood et al., 2018 Uptime/downtime calculator This is an excerpt from DZone's 2024 Trend Report, Observability and Performance: The Precipice of Building Highly Performant Software Systems.Read the Free Report

By Siri Varma Vegiraju

CORE

DevOps and CI/CD

DZone's Featured DevOps and CI/CD Resources

Top DevOps and CI/CD Experts

The Latest DevOps and CI/CD Topics