Configuring Autoscaling for Various Machine Learning Model Types

This tutorial helps explore options for choosing the right autoscaling policy for maximum resource utilization using AWS Sagemaker.

Dec. 17, 24 · Tutorial

Likes (6)

Comment

Save

8.0K Views

AWS Sagemaker has simplified the deployment of machine learning models at scale. Configuring effective autoscaling policies is crucial for balancing performance and cost. This article aims to demonstrate how to set up various autoscaling policies using TypeScript CDK, focusing on request, memory, and CPU-based autoscaling for different ML model types.

Model Types Based on Invocation Patterns

At a high level, model deployment in SageMaker can be broken into three main categories based on invocation patterns:

1. Synchronous (Real-Time) Inference

Synchronous inference is suitable when immediate response or feedback is required by end users, such as when a website interaction is required. This approach is particularly well-suited for applications that demand quick response times with minimal delay. Examples include fraud detection in financial transactions and dynamic pricing in ride-sharing.

2. Asynchronous Inference

Asynchronous inference is ideal for handling queued requests when it is acceptable to process messages with a delay. This type of inference is preferred when the model is memory/CPU intensive and takes more than a few seconds to respond. For instance, video content moderation, analytics pipeline, and Natural Language Processing (NLP) for textbooks.

3. Batch Processing

Batch processing is ideal when data needs to be processed in chunks (batches) or at scheduled intervals. Batch processing is mostly used for non-time-sensitive tasks when you need the output to be available at periodic intervals like daily or weekly. For example, periodic recommendation updates, where an online retailer generates personalized product recommendations for its customers weekly. Predictive maintenance, where daily jobs are run to predict machines that are likely to fail, is another good example.

Types of Autoscaling in SageMaker With CDK

Autoscaling in SageMaker can be tailored to optimize different aspects of performance based on the model’s workload:

1. Request-Based Autoscaling

Use Case

Best for real-time (synchronous) inference models that need low latency.

Example

Scaling up during peak shopping seasons for an e-commerce recommendation model to meet high traffic.

2. Memory-Based Autoscaling

Use Case

Beneficial for memory-intensive models, such as large NLP models.

Example

Increasing instance count when memory usage exceeds 80% for image processing models that require high resolution.

3. CPU-Based Autoscaling

Use Case

Ideal for CPU-bound models that require more processing power.

Example

Scaling for high-performance recommendation engines by adjusting instance count as CPU usage reaches 75%.

Configuring Autoscaling Policies in TypeScript CDK

Below is an example configuration of different scaling policies using AWS CDK with TypeScript:

    TypeScript
   
 

   import * as cdk from 'aws-cdk-lib';
import * as sagemaker from 'aws-cdk-lib/aws-sagemaker';
import * as autoscaling from 'aws-cdk-lib/aws-applicationautoscaling';
import { Construct } from 'constructs';

export class SageMakerEndpointStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
  super(scope, id, props);

  const AUTO_SCALE_CONFIG = {
    MIN_CAPACITY: 1,
    MAX_CAPACITY: 3,
    TARGET_REQUESTS_PER_INSTANCE: 1000,
    CPU_TARGET_UTILIZATION: 70,
    MEMORY_TARGET_UTILIZATION: 80
  };

  // Create SageMaker Endpoint
  const endpointConfig = new sagemaker.CfnEndpointConfig(this, 'EndpointConfig', {
    productionVariants: [{
      modelName: 'YourModelName',  // Replace with your model name
      variantName: 'prod',
      initialInstanceCount: AUTO_SCALE_CONFIG.MIN_CAPACITY,
      instanceType: 'ml.c5.2xlarge'
    }]
  });

  const endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', {
    endpointName: 'YourEndpointName',  // Replace with your endpoint name
    endpointConfig: endpointConfig
  });

  // Set up autoscaling
  const scalableTarget = endpoint.createScalableInstanceCount({
    minCapacity: AUTO_SCALE_CONFIG.MIN_CAPACITY,
    maxCapacity: AUTO_SCALE_CONFIG.MAX_CAPACITY
  });

  this.setupRequestBasedAutoscaling(scalableTarget);
  this.setupCpuBasedAutoscaling(scalableTarget, endpoint);
  this.setupMemoryBasedAutoscaling(scalableTarget, endpoint);
  this.setupStepAutoscaling(scalableTarget, endpoint);
}

private setupRequestBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount) {
  scalableTarget.scaleOnRequestCount('ScaleOnRequestCount', {
    targetRequestsPerInstance: AUTO_SCALE_CONFIG.TARGET_REQUESTS_PER_INSTANCE
  });
}

private setupCpuBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
  scalableTarget.scaleOnMetric('ScaleOnCpuUtilization', {
    metric: endpoint.metricCPUUtilization(),
    targetValue: AUTO_SCALE_CONFIG.CPU_TARGET_UTILIZATION
  });
}

private setupMemoryBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
  scalableTarget.scaleOnMetric('ScaleOnMemoryUtilization', {
    metric: endpoint.metricMemoryUtilization(),
    targetValue: AUTO_SCALE_CONFIG.MEMORY_TARGET_UTILIZATION
  });
}

// Example configuration of step scaling.
// Changes the number of instances to scale up and down based on CPU usage
private setupStepAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
  scalableTarget.scaleOnMetric('StepScalingOnCpu', {
    metric: endpoint.metricCPUUtilization(),
    scalingSteps: [
      { upper: 30, change: -1 },
      { lower: 60, change: 0 },
      { lower: 70, upper: 100, change: 1 },
      { lower: 100, change: 2 }
    ],
    adjustmentType: autoscaling.AdjustmentType.CHANGE_IN_CAPACITY
  });
}
}

  

Note: CPU metrics can exceed 100% when instances have multiple cores, as they measure total CPU utilization.

Balancing Autoscaling Policies by Model Type

Autoscaling policies differ based on model requirements:

Batch Processing Models

Request- or CPU-based autoscaling is ideal here since you won't have to pay for resources when traffic is low or none.

Synchronous Models

In order to provide a swift response to spikes in real-time requests, request-based autoscaling is recommended.

Asynchronous Models

CPU-based scaling with longer cooldowns prevents over-scaling and maintains efficiency.

Key Considerations for Effective Autoscaling

1. Cost Management

Tune metric thresholds to optimize cost without sacrificing performance.

2. Latency Requirements

For real-time models, prioritize low-latency scaling; batch and asynchronous models can handle slight delays.

3. Performance Monitoring

Regularly assess model performance and adjust configurations to adapt to demand changes.

Like in the example above, we can use more than one autoscaling policy to balance cost and performance, but that can lead to increased complexity in setup and management.

Conclusion

With AWS SageMaker's autoscaling options, you can effectively configure resource management for different types of ML models. By setting up request-based, memory-based, and CPU-based policies in CDK, you can optimize both performance and costs across diverse applications.

Autoscaling Machine learning

Opinions expressed by DZone contributors are their own.

Related

Trending

Configuring Autoscaling for Various Machine Learning Model Types

This tutorial helps explore options for choosing the right autoscaling policy for maximum resource utilization using AWS Sagemaker.

Model Types Based on Invocation Patterns

1. Synchronous (Real-Time) Inference

2. Asynchronous Inference

3. Batch Processing

Types of Autoscaling in SageMaker With CDK

1. Request-Based Autoscaling

Use Case

Example

2. Memory-Based Autoscaling

Use Case

Example

3. CPU-Based Autoscaling

Use Case

Example

Configuring Autoscaling Policies in TypeScript CDK

Balancing Autoscaling Policies by Model Type

Batch Processing Models

Synchronous Models

Asynchronous Models

Key Considerations for Effective Autoscaling

1. Cost Management

2. Latency Requirements

3. Performance Monitoring

Conclusion

Related

Partner Resources