DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • How Retry Storms Crash API-Led Systems: Bounded Reliability Patterns for Distributed Architectures
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • How to Effectively Evaluate a Ranking ML System
  • The Only AI Test That Still Humbles Every Machine on Earth

Trending

  • Stop Debugging Glue Jobs Manually: Building an Agentic Observability Layer for Data Pipelines
  • From AI Chaos to Control: Building Enterprise-Grade LLM Gateways With MuleSoft Anypoint
  • No More Cheap Claude: 4 First Principles of Token Economics in 2026
  • Data Contracts as the "Circuit Breaker" for Model Reliability
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Configuring Autoscaling for Various Machine Learning Model Types

Configuring Autoscaling for Various Machine Learning Model Types

This tutorial helps explore options for choosing the right autoscaling policy for maximum resource utilization using AWS Sagemaker.

By 
Koushik Balaji Venkatesan user avatar
Koushik Balaji Venkatesan
·
Dec. 17, 24 · Tutorial
Likes (6)
Comment
Save
Tweet
Share
10.4K Views

Join the DZone community and get the full member experience.

Join For Free

AWS Sagemaker has simplified the deployment of machine learning models at scale. Configuring effective autoscaling policies is crucial for balancing performance and cost. This article aims to demonstrate how to set up various autoscaling policies using TypeScript CDK, focusing on request, memory, and CPU-based autoscaling for different ML model types.

Model Types Based on Invocation Patterns

At a high level, model deployment in SageMaker can be broken into three main categories based on invocation patterns:

1. Synchronous (Real-Time) Inference

Synchronous inference is suitable when immediate response or feedback is required by end users, such as when a website interaction is required. This approach is particularly well-suited for applications that demand quick response times with minimal delay. Examples include fraud detection in financial transactions and dynamic pricing in ride-sharing.

2. Asynchronous Inference

Asynchronous inference is ideal for handling queued requests when it is acceptable to process messages with a delay. This type of inference is preferred when the model is memory/CPU intensive and takes more than a few seconds to respond. For instance, video content moderation, analytics pipeline, and Natural Language Processing (NLP) for textbooks.

3. Batch Processing

Batch processing is ideal when data needs to be processed in chunks (batches) or at scheduled intervals. Batch processing is mostly used for non-time-sensitive tasks when you need the output to be available at periodic intervals like daily or weekly. For example, periodic recommendation updates, where an online retailer generates personalized product recommendations for its customers weekly. Predictive maintenance, where daily jobs are run to predict machines that are likely to fail, is another good example.

Types of Autoscaling in SageMaker With CDK

Autoscaling in SageMaker can be tailored to optimize different aspects of performance based on the model’s workload:

1. Request-Based Autoscaling

Use Case

Best for real-time (synchronous) inference models that need low latency.

Example

Scaling up during peak shopping seasons for an e-commerce recommendation model to meet high traffic.

2. Memory-Based Autoscaling

Use Case

Beneficial for memory-intensive models, such as large NLP models.

Example

Increasing instance count when memory usage exceeds 80% for image processing models that require high resolution.

3. CPU-Based Autoscaling

Use Case

Ideal for CPU-bound models that require more processing power.

Example

Scaling for high-performance recommendation engines by adjusting instance count as CPU usage reaches 75%.

Configuring Autoscaling Policies in TypeScript CDK

Below is an example configuration of different scaling policies using AWS CDK with TypeScript:

TypeScript
 
import * as cdk from 'aws-cdk-lib';
import * as sagemaker from 'aws-cdk-lib/aws-sagemaker';
import * as autoscaling from 'aws-cdk-lib/aws-applicationautoscaling';
import { Construct } from 'constructs';

export class SageMakerEndpointStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
  super(scope, id, props);

  const AUTO_SCALE_CONFIG = {
    MIN_CAPACITY: 1,
    MAX_CAPACITY: 3,
    TARGET_REQUESTS_PER_INSTANCE: 1000,
    CPU_TARGET_UTILIZATION: 70,
    MEMORY_TARGET_UTILIZATION: 80
  };

  // Create SageMaker Endpoint
  const endpointConfig = new sagemaker.CfnEndpointConfig(this, 'EndpointConfig', {
    productionVariants: [{
      modelName: 'YourModelName',  // Replace with your model name
      variantName: 'prod',
      initialInstanceCount: AUTO_SCALE_CONFIG.MIN_CAPACITY,
      instanceType: 'ml.c5.2xlarge'
    }]
  });

  const endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', {
    endpointName: 'YourEndpointName',  // Replace with your endpoint name
    endpointConfig: endpointConfig
  });

  // Set up autoscaling
  const scalableTarget = endpoint.createScalableInstanceCount({
    minCapacity: AUTO_SCALE_CONFIG.MIN_CAPACITY,
    maxCapacity: AUTO_SCALE_CONFIG.MAX_CAPACITY
  });

  this.setupRequestBasedAutoscaling(scalableTarget);
  this.setupCpuBasedAutoscaling(scalableTarget, endpoint);
  this.setupMemoryBasedAutoscaling(scalableTarget, endpoint);
  this.setupStepAutoscaling(scalableTarget, endpoint);
}

private setupRequestBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount) {
  scalableTarget.scaleOnRequestCount('ScaleOnRequestCount', {
    targetRequestsPerInstance: AUTO_SCALE_CONFIG.TARGET_REQUESTS_PER_INSTANCE
  });
}

private setupCpuBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
  scalableTarget.scaleOnMetric('ScaleOnCpuUtilization', {
    metric: endpoint.metricCPUUtilization(),
    targetValue: AUTO_SCALE_CONFIG.CPU_TARGET_UTILIZATION
  });
}

private setupMemoryBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
  scalableTarget.scaleOnMetric('ScaleOnMemoryUtilization', {
    metric: endpoint.metricMemoryUtilization(),
    targetValue: AUTO_SCALE_CONFIG.MEMORY_TARGET_UTILIZATION
  });
}

// Example configuration of step scaling.
// Changes the number of instances to scale up and down based on CPU usage
private setupStepAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
  scalableTarget.scaleOnMetric('StepScalingOnCpu', {
    metric: endpoint.metricCPUUtilization(),
    scalingSteps: [
      { upper: 30, change: -1 },
      { lower: 60, change: 0 },
      { lower: 70, upper: 100, change: 1 },
      { lower: 100, change: 2 }
    ],
    adjustmentType: autoscaling.AdjustmentType.CHANGE_IN_CAPACITY
  });
}
}


Note: CPU metrics can exceed 100% when instances have multiple cores, as they measure total CPU utilization.

Balancing Autoscaling Policies by Model Type

Autoscaling policies differ based on model requirements:

Batch Processing Models

Request- or CPU-based autoscaling is ideal here since you won't have to pay for resources when traffic is low or none.

Synchronous Models

In order to provide a swift response to spikes in real-time requests, request-based autoscaling is recommended.

Asynchronous Models

CPU-based scaling with longer cooldowns prevents over-scaling and maintains efficiency.

Key Considerations for Effective Autoscaling

1. Cost Management

Tune metric thresholds to optimize cost without sacrificing performance.

2. Latency Requirements

For real-time models, prioritize low-latency scaling; batch and asynchronous models can handle slight delays.

3. Performance Monitoring

Regularly assess model performance and adjust configurations to adapt to demand changes.

Like in the example above, we can use more than one autoscaling policy to balance cost and performance, but that can lead to increased complexity in setup and management. 

Conclusion

With AWS SageMaker's autoscaling options, you can effectively configure resource management for different types of ML models. By setting up request-based, memory-based, and CPU-based policies in CDK, you can optimize both performance and costs across diverse applications.

Autoscaling Machine learning

Opinions expressed by DZone contributors are their own.

Related

  • How Retry Storms Crash API-Led Systems: Bounded Reliability Patterns for Distributed Architectures
  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • How to Effectively Evaluate a Ranking ML System
  • The Only AI Test That Still Humbles Every Machine on Earth

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook