Configuring Autoscaling for Various Machine Learning Model Types
This tutorial helps explore options for choosing the right autoscaling policy for maximum resource utilization using AWS Sagemaker.
Join the DZone community and get the full member experience.
Join For FreeAWS Sagemaker has simplified the deployment of machine learning models at scale. Configuring effective autoscaling policies is crucial for balancing performance and cost. This article aims to demonstrate how to set up various autoscaling policies using TypeScript CDK, focusing on request, memory, and CPU-based autoscaling for different ML model types.
Model Types Based on Invocation Patterns
At a high level, model deployment in SageMaker can be broken into three main categories based on invocation patterns:
1. Synchronous (Real-Time) Inference
Synchronous inference is suitable when immediate response or feedback is required by end users, such as when a website interaction is required. This approach is particularly well-suited for applications that demand quick response times with minimal delay. Examples include fraud detection in financial transactions and dynamic pricing in ride-sharing.
2. Asynchronous Inference
Asynchronous inference is ideal for handling queued requests when it is acceptable to process messages with a delay. This type of inference is preferred when the model is memory/CPU intensive and takes more than a few seconds to respond. For instance, video content moderation, analytics pipeline, and Natural Language Processing (NLP) for textbooks.
3. Batch Processing
Batch processing is ideal when data needs to be processed in chunks (batches) or at scheduled intervals. Batch processing is mostly used for non-time-sensitive tasks when you need the output to be available at periodic intervals like daily or weekly. For example, periodic recommendation updates, where an online retailer generates personalized product recommendations for its customers weekly. Predictive maintenance, where daily jobs are run to predict machines that are likely to fail, is another good example.
Types of Autoscaling in SageMaker With CDK
Autoscaling in SageMaker can be tailored to optimize different aspects of performance based on the model’s workload:
1. Request-Based Autoscaling
Use Case
Best for real-time (synchronous) inference models that need low latency.
Example
Scaling up during peak shopping seasons for an e-commerce recommendation model to meet high traffic.
2. Memory-Based Autoscaling
Use Case
Beneficial for memory-intensive models, such as large NLP models.
Example
Increasing instance count when memory usage exceeds 80% for image processing models that require high resolution.
3. CPU-Based Autoscaling
Use Case
Ideal for CPU-bound models that require more processing power.
Example
Scaling for high-performance recommendation engines by adjusting instance count as CPU usage reaches 75%.
Configuring Autoscaling Policies in TypeScript CDK
Below is an example configuration of different scaling policies using AWS CDK with TypeScript:
import * as cdk from 'aws-cdk-lib';
import * as sagemaker from 'aws-cdk-lib/aws-sagemaker';
import * as autoscaling from 'aws-cdk-lib/aws-applicationautoscaling';
import { Construct } from 'constructs';
export class SageMakerEndpointStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
const AUTO_SCALE_CONFIG = {
MIN_CAPACITY: 1,
MAX_CAPACITY: 3,
TARGET_REQUESTS_PER_INSTANCE: 1000,
CPU_TARGET_UTILIZATION: 70,
MEMORY_TARGET_UTILIZATION: 80
};
// Create SageMaker Endpoint
const endpointConfig = new sagemaker.CfnEndpointConfig(this, 'EndpointConfig', {
productionVariants: [{
modelName: 'YourModelName', // Replace with your model name
variantName: 'prod',
initialInstanceCount: AUTO_SCALE_CONFIG.MIN_CAPACITY,
instanceType: 'ml.c5.2xlarge'
}]
});
const endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', {
endpointName: 'YourEndpointName', // Replace with your endpoint name
endpointConfig: endpointConfig
});
// Set up autoscaling
const scalableTarget = endpoint.createScalableInstanceCount({
minCapacity: AUTO_SCALE_CONFIG.MIN_CAPACITY,
maxCapacity: AUTO_SCALE_CONFIG.MAX_CAPACITY
});
this.setupRequestBasedAutoscaling(scalableTarget);
this.setupCpuBasedAutoscaling(scalableTarget, endpoint);
this.setupMemoryBasedAutoscaling(scalableTarget, endpoint);
this.setupStepAutoscaling(scalableTarget, endpoint);
}
private setupRequestBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount) {
scalableTarget.scaleOnRequestCount('ScaleOnRequestCount', {
targetRequestsPerInstance: AUTO_SCALE_CONFIG.TARGET_REQUESTS_PER_INSTANCE
});
}
private setupCpuBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
scalableTarget.scaleOnMetric('ScaleOnCpuUtilization', {
metric: endpoint.metricCPUUtilization(),
targetValue: AUTO_SCALE_CONFIG.CPU_TARGET_UTILIZATION
});
}
private setupMemoryBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
scalableTarget.scaleOnMetric('ScaleOnMemoryUtilization', {
metric: endpoint.metricMemoryUtilization(),
targetValue: AUTO_SCALE_CONFIG.MEMORY_TARGET_UTILIZATION
});
}
// Example configuration of step scaling.
// Changes the number of instances to scale up and down based on CPU usage
private setupStepAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) {
scalableTarget.scaleOnMetric('StepScalingOnCpu', {
metric: endpoint.metricCPUUtilization(),
scalingSteps: [
{ upper: 30, change: -1 },
{ lower: 60, change: 0 },
{ lower: 70, upper: 100, change: 1 },
{ lower: 100, change: 2 }
],
adjustmentType: autoscaling.AdjustmentType.CHANGE_IN_CAPACITY
});
}
}
Note: CPU metrics can exceed 100% when instances have multiple cores, as they measure total CPU utilization.
Balancing Autoscaling Policies by Model Type
Autoscaling policies differ based on model requirements:
Batch Processing Models
Request- or CPU-based autoscaling is ideal here since you won't have to pay for resources when traffic is low or none.
Synchronous Models
In order to provide a swift response to spikes in real-time requests, request-based autoscaling is recommended.
Asynchronous Models
CPU-based scaling with longer cooldowns prevents over-scaling and maintains efficiency.
Key Considerations for Effective Autoscaling
1. Cost Management
Tune metric thresholds to optimize cost without sacrificing performance.
2. Latency Requirements
For real-time models, prioritize low-latency scaling; batch and asynchronous models can handle slight delays.
3. Performance Monitoring
Regularly assess model performance and adjust configurations to adapt to demand changes.
Like in the example above, we can use more than one autoscaling policy to balance cost and performance, but that can lead to increased complexity in setup and management.
Conclusion
With AWS SageMaker's autoscaling options, you can effectively configure resource management for different types of ML models. By setting up request-based, memory-based, and CPU-based policies in CDK, you can optimize both performance and costs across diverse applications.
Opinions expressed by DZone contributors are their own.
Comments