Maximizing Cloud Cost Efficiency: Intelligent Management of Non-Production Environments

Here's how to significantly reduce cloud costs for your non-production environments through intelligent resource management, along with the best practices to follow.

Aruun Kumar

Jun. 03, 25 · Tutorial

Likes (1)

Comment

Save

1.0K Views

In the fast-paced world of cloud computing, organizations continually seek ways to optimize their infrastructure spending. One of the most overlooked areas of potential cost savings lies in non-production environments, specifically the development, staging, and testing landscapes. Organizations transitioning to the cloud often carry over habits from traditional data centers, where maintaining multiple environments had minimal cost. This mindset persists despite the different cost dynamics in cloud environments.

There are also misconceptions about resource management. For instance, many believe databases can't be stopped without risking data loss. However, modern cloud technologies allow for efficient start-stop operations, state persistence, and rapid environment restoration. Unlike mission-critical production infrastructure that requires constant availability, non-production environments can be dynamically managed with minimal friction and maximum cost efficiency.

Ultimately, most organizations dramatically underestimate the potential savings achievable through intelligent resource management. Research indicates that companies can potentially reduce their cloud infrastructure costs by 60-70% in non-production environments through strategic scheduling and optimization.

Understanding Cost Drain in Non-Production Environments

The problem is quite simple. Your non-production environments often run continuously, even when no active work is being done. Despite global teams and distributed workflows, resources might be spinning up unnecessary compute power during off-hours or weekends. However, the misconception that global teams require continuous 24/7 application availability is fundamentally flawed, especially when most distributed teams typically operate across just 3-4 time zones.

Consider a typical global technology team: developers might span regions from the East Coast of the United States to India, covering a realistic time zone spread of about 12-16 hours. This means that even in a "global" context, there are substantial windows of time when active development is not occurring.

The solution to this resource inefficiency is surprisingly straightforward.

By implementing intelligent scheduling that aligns with actual team working hours, organizations can dramatically reduce unnecessary compute time. During off-peak periods, non-production environments can be safely powered down, preserving compute resources and significantly reducing cloud infrastructure costs. A mid-sized tech company reduced monthly cloud expenses by 65% through strategic schedulingl, while a global enterprise saved more than $500,000 annually by optimizing non-production environments.

Comprehensive Resource Scheduling Strategy

The foundation of intelligent resource scheduling lies not in indiscriminately terminating resources, but in understanding the unique characteristics of each resource type. Successful optimization requires a holistic approach that balances cost reduction with development team productivity, ensuring that environments can be quickly brought online when needed while remaining efficiently managed during periods of inactivity.

Identifying Targetable Cloud Resources

Not all cloud resources are created equal when it comes to intelligent scheduling. Understanding which resources can be safely and efficiently managed requires a nuanced approach that considers their purpose, state, and impact on overall system performance.

Compute Resources: Cloud compute resources represent processing power and typically include virtual machines, containers, and computational clusters. Unlike production systems that require constant availability, these environments often sit idle for significant periods. Development teams typically require intense computational resources, scaling capabilities and replicas for fail-over testing during working hours, but can then safely power down during periods of low activity.
Database and Caching Resources: Databases and caching layers have historically been considered "untouchable" when it comes to scheduling. This was simply because it was either not possible to “stop” these resources or derive any cost savings by turning them off without risking data loss or significant performance degradation. However, contemporary cloud technologies have revolutionized this perspective. Targetable resources include:
- Non-Production Database Instances & Database Clusters
- Datawarehouse instances/Clusters
- Cache instances/clusters
- Databases/Cache Replicas

Note that these storage services still incur cost for the underlying data storage and the cost savings would be from the pausing of the compute charges.

Network Resources: Network resources often represent a hidden cost in non-production environments. Load balancers, network interfaces, and auxiliary connectivity components continue consuming resources even when no active traffic is present. Potential optimization targets include:
- Load Balancers
- Staging Network Interfaces
- Network peering resources

The goal here is to maintain just enough connectivity to enable rapid environment restoration while eliminating unnecessary continuous resource allocation.

Modern cloud platforms offer sophisticated mechanisms to preserve state and rapidly restart these environments, thereby minimizing disruptions.

Scheduling Optimization Approaches

Intelligent resource scheduling can significantly reduce cloud infrastructure costs. By strategically managing non-production environments, you can minimize unnecessary resource usage during off-peak hours while maintaining team productivity.

Key optimization strategies include:

Work Week Optimization:
- Automated resource shutdown during non-working hours
- Dynamic scheduling across time zones
- Reducing active resource time to 10-12 hours daily
Weekend shutdown:
- Halting non-essential resources
- Configurable exceptions for critical systems
- Data preservation mechanisms
- Automated Monday morning restarts

These approaches strike a balance between cost efficiency and operational flexibility, ensuring minimal disruption to development processes while maximizing savings on cloud infrastructure.

Technical Implementation Approach

The ideal approach for implementation is a comprehensive, serverless solution that significantly reduces costs without compromising development and testing capabilities. Below, we shall look at how this can be implemented on the AWS Cloud.

At the heart of our solution are two AWS Step Functions state machines: one for shutdown and another for startup. Each state machine is triggered independently by Amazon EventBridge at scheduled times, initiating either the shutdown or startup sequence. These sequences are composed of multiple Lambda functions, each responsible for managing specific AWS resource types. Let's dive into the key components and their implementations:

1. AWS Step Functions State Machines

We create two separate state machines;

Shutdown state machine: Orchestrates the process of shutting down or scaling down resources.
Startup state machine: Manages the process of starting up or scaling up resources.

Each state machine contains a series of steps corresponding to different resource types (compute, database, caching, networking). This separation allows for independent scheduling and execution of shutdown and startup processes.

2. Lambda Functions

We implement separate AWS Lambda functions for each resource type. These functions are shared between both state machines but perform opposite actions based on whether they're called by the shutdown or startup process. They use the AWS SDK to interact with the respective service APIs.

3. EventBridge Scheduler

We use Amazon EventBridge to trigger our Step Functions state machines at specific times:

Shutdown trigger: Scheduled to run every day at the end of the work day (e.g., 6:00 PM)
Startup trigger: Scheduled to run every weekday morning before start of the work day (e.g., 8:00 AM)

These schedules can be adjusted based on your organization's specific working hours and needs.

4. Resource-Specific Implementations

Let's explore how you could handle different types of AWS resources:

Compute Resources

EC2 Instances: Our Lambda function uses the EC2 API to stop instances during shutdown and start them during startup. A preferred approach to identify and filter instances that need to be stopped is to use tag-based identification. Below is a sample implementation in Python:

    Python
   
   import boto3

ec2 = boto3.client('ec2')

def lambda_handler(event, context):
    instances = ec2.describe_instances(Filters=[
        {'Name': 'tag:Environment', 'Values': ['Development', 'Staging']}
    ])

    instance_ids = [i['InstanceId'] for r in instances['Reservations'] for i in r['Instances']] 

    if event['action'] == 'shutdown':
         ec2.stop_instances(InstanceIds=instance_ids)
    elif event['action'] == 'startup':
        ec2.start_instances(InstanceIds=instance_ids)

Using tags provides a lot of flexibility, as we can easily identify resources that need to be excluded and designate ways to avoid starting instances that were not stopped by the automation. For instance, you could have EC2 instances that were used for prototyping and have already been stopped to save on cost. Without adding a tag to designate the instances that were turned off by this automation, the start automation may inadvertently end up turning on that instance.

Auto Scaling Groups: For Auto Scaling Groups, we adjust the desired capacity to zero during shutdown and restore it to the original value during startup. The automation can store the original capacity in an AWS Systems Manager Parameter Store, an Amazon DynamoDB table, or simply as a tag that can be referenced later.

Amazon EKS Node Groups: For EKS clusters with node groups, we can use the EKS API to save on costs by scaling down the node groups to zero during off-hours, which in turn removes all provisioned EC2 instances in the node group. Once again, the original values of Min & Max capacity of the node group need to be stored prior to scaling down using the approach mentioned earlier for Auto scaling groups.

Amazon ECS Clusters: For EC2-backed clusters, we scale down the underlying Auto Scaling Group. For Fargate, we update the ECS service's desired count to zero after preserving the original count.

Database & Caching Resources

As stated earlier, databases and cache instances are often overlooked when it comes to cost optimization solutions, due to a general perception that data loss will occur if an instance is stopped. With database workloads on the cloud, you are charged for both the compute and storage of the instances. Hence, just stopping your database instance can provide significant cost savings without risking any data loss. Let’s examine how to achieve this within the AWS ecosystem for various database and cache services.

Amazon RDS Instances: RDS APIs support the capability to stop and start database instances or clusters. For RDS, you would stop and start instances using the RDS APIs. For Aurora clusters, you would need to stop and start the entire cluster. As a reminder, by stopping the instances, we save on the compute cost but would continue to incur storage costs.

Amazon DynamoDB: Unlike RDS and Aurora, DynamoDB does not have an option to stop instances. Instead, to optimize cost, we adjust the read and write capacity units to their minimum values for provisioned capacity tables by calling the DynamoDB update APIs after saving the original provisioned capacity values.

Amazon ElastiCache: Similar to DynamoDB, Elasticache does not provide a way to stop your instances. However, there are other ways to reduce costs. You could reduce the number of nodes to the minimum allowed or switch to the smallest instance type that can handle the data volume. This would require additional logic in your Lambda functions to handle this based on your needs. For instance, to maintain optimal performance, your application may require a 2-node cluster with a large instance type from a memory-optimized instance family. As part of the stop function, you could reduce the cluster to a single instance with a small instance type.

Amazon OpenSearch Service: Similar to Elasticache, you can scale down the number of data nodes and adjust instance types to the smallest feasible size.

Network Resources

When it comes to cost avoidance and optimization, network resources would require special handling as there is nothing really to stop and start. Instead, the option is to tear down and recreate resources based on the type and usage.

As an example, resources such as NAT gateways, Transit gateways, VPC interface endpoints are charged by the hour, whereas others such as VPC peering connections, Gateway endpoints, or NACLs are either free or do not incur hourly cost.

Hence, based on the type and costing, certain network resources can be deleted during shutdown and recreated during startup, and configuration details can be stored in either a DynamoDB table or in the parameter store. Another approach is to create a robust Infrastructure as Code (IaC) setup, allowing resources to be recreated as needed in a repeatable and consistent manner.

By leveraging serverless technologies such as Step Functions, Lambda, and EventBridge, we gain several key advantages. For starters, the cost involved in running this solution would be really trivial when compared to the savings that can be achieved. Additionally, it can automatically scale to handle environments of any size and also has minimal operational overhead.

Considerations and Best Practices

Implementing resource scheduling in your non-production environment isn't without challenges. Let’s look at some of the considerations for implementing a solution like this and best practices to follow.

Minimal Workflow Disruption

Design schedules around actual team working patterns.
Provide clear communication about environment availability.
Quick environment restart capabilities: Say your team decides to work during the weekend. In such situations, you don't want your team struggling to figure out how to bring the environment back up, or for the process to take too long. Referencing our solution discussed above, a simple way to address this would be to provide stakeholders with the ability to invoke the Startup state machine easily—this way, the stopped resources can be restarted anytime as quickly as possible.
Implement proper startup and shutdown sequences: Designing and implementing correct startup and shutdown sequences is crucial for maintaining system integrity. For shutdown, begin with application servers and compute instances, followed by databases and caches, and finally turn off or delete networking components. Even with compute instances, begin with the right type first. For example, if you use EKS node groups, you should first scale down your nodegroups before trying to scale down Auto Scaling Groups or EC2 instances. Reverse the order of resources in the shutdown process for startup. Include health checks and dependency verifications at each step.
Exclusion of critical resources: You may have critical resources that should not be turned off. Perhaps you have an EC2 instance that needs to be available as a jump host at all times, or a database table that is required for test automation that runs during the weekend. A tag-based approach for resource identification is a good practice in such situations. You could have an exclusion tag, say “DO_NOT_TURN_OFF” set to “TRUE” added to critical resources, and during the shutdown execution, resources with such tags can be ignored.
Resource Tagging: Implementing a comprehensive tagging strategy to enable cost allocation and visibility into optimization is crucial. Use tags for environment, project, cost center, and other relevant metadata, so that it is easy to identify and filter resources. Additionally, as mentioned earlier, it is also wise to add a tag to resources as part of the shutdown process and remove this tag as part of the startup process. This eliminates the guesswork involved in determining how a specific resource was turned off. The other added advantage is avoiding the inadvertent startup of resources that were already in a shutdown state.
Context Management & Handling of Stateful Resources

Develop strategies for databases and persistent storage: While database resources, such as RDS instances, can be turned off and on without data loss, other resources, like caches, may require special handling, especially if the shutdown is a permanent operation. In such cases, it is important to ensure that proper snapshots are taken before the shutdown or deletion.
Preserve instance state and configuration: For specific resources, it is essential to preserve the configuration at shutdown, allowing it to be restored to its original state and configuration at a later time. For instance, when scaling down an EKS nodegroup to a count of 0, it is important to store the minimum, maximum, and desired count configuration using some kind of a persistent storage. DynamoDB table or AWS SSM Parameter store are some good options to consider in the AWS environment.

Other Strategies & Tools to Consider

While the solution above is specific to turning off resources when they are not needed, it is also essential to consider general best practices for cost optimization. A few of the most important ones are;

Remove orphaned resources: Periodically review resources that are not in use and clean them up. These may include unused database backups, buckets, log files etc.
Avoid misconfigured or over-provisioned resources: Don't forget to release storage resources from compute instances when they are terminated, otherwise the storage costs will continue. Similarly, pay close attention to the size of the resources. Remember that it's very easy to scale up your resources in the cloud
Don't overlook newer technologies: Cloud providers often release newer features, such as instance types of serverless options, that may save you more money. Embrace these cost-saving features to achieve organizational benefits.

In addition to the AWS-native solution for turning off resources we discussed earlier, there are other open-source and third-party solutions available for cost optimization of non-production environments in the cloud.

Conclusion: Beyond Immediate Savings

The journey to cloud cost efficiency starts with reimagining how you think about and manage your development infrastructure. By implementing intelligent resource scheduling, your organization can unlock significant financial and operational benefits. At the same time, cloud cost optimization isn't just about reducing monthly bills. It's about creating more efficient development practices, promoting responsible cloud resource utilization, and enabling more sustainable technological growth.

Cloud computing

Opinions expressed by DZone contributors are their own.

Related

Trending