Serverless Machine Learning: Running AI Models Without Managing Infrastructure
The article empowers developers to deploy and serve ML models without needing to manage servers, clusters, or VMs, reducing time-to-market and cognitive overhead.
Join the DZone community and get the full member experience.
Join For FreeServerless machine learning refers to deploying ML inference code without provisioning or managing servers. Developers use Function-as-a-Service (FaaS) platforms (e.g., AWS Lambda, Azure Functions) to run model predictions on demand. This approach provides automatic scaling, pay-per-use billing, and low operational overhead.
Key advantages of serverless ML include:
- No infrastructure management: Focus on model logic instead of servers or clusters.
- Automatic scaling: Each request can spin up a new function instance, supporting high concurrency and variable traffic.
- Cost efficiency: You pay only for the compute time used, making it ideal for bursty or infrequent workloads.
Serverless functions have constraints: execution time limits (e.g., 15 minutes on AWS Lambda, 5–10 minutes on Azure), fixed memory/CPU, and stateless execution (requiring model reloads on cold starts). We address these in best practices. Below, we examine implementations on AWS and Azure.
Serverless ML on AWS (Lambda With SageMaker and S3)
AWS offers two common serverless ML patterns: Using AWS Lambda with SageMaker endpoints, and running inference entirely within Lambda with a model stored on S3 (or EFS).
Serverless ML architecture on AWS: A user invokes an API which triggers an AWS Lambda function. The Lambda loads the ML model either from Amazon S3 and runs the inference, then returns the prediction. In an alternative setup, the Lambda function could call an Amazon SageMaker endpoint to perform the inference. The entire pipeline scales automatically and requires no server management
AWS Lambda + SageMaker
Amazon SageMaker is a fully managed ML service that hosts models at HTTPS endpoints. In this pattern, a Lambda function acts as the front-end to invoke the SageMaker endpoint. SageMaker handles model hosting and auto-scaling (including GPU instances), while the Lambda handles incoming requests. For example, an API Gateway can forward requests to the Lambda, which then calls SageMaker.
Implementation: Train and deploy your model to a SageMaker endpoint (real-time or serverless). Then write a Node.js Lambda function that calls SageMaker Runtime’s invokeEndpoint
API. Example:
// AWS Lambda handler (Node.js) to call a SageMaker endpoint for inference
const AWS = require('aws-sdk');
const sagemakerRuntime = new AWS.SageMakerRuntime();
exports.handler = async (event) => {
// Parse input from the event (assuming HTTP API Gateway proxy integration)
const inputData = JSON.parse(event.body || "{}");
const params = {
EndpointName: "my-model-endpoint", // SageMaker endpoint name
Body: JSON.stringify(inputData), // input payload for model
ContentType: "application/json"
};
try {
// Invoke the SageMaker endpoint
const result = await sagemakerRuntime.invokeEndpoint(params).promise();
// The result Body is a binary buffer; convert to string then JSON
const prediction = JSON.parse(result.Body.toString());
return {
statusCode: 200,
body: JSON.stringify({ prediction })
};
} catch (err) {
console.error("Error invoking model endpoint:", err);
return { statusCode: 500, body: "Inference error" };
}
};
In this code, the Lambda parses the request, calls the SageMaker endpoint "my-model-endpoint
", and returns its output. Ensure the Lambda’s IAM role can invoke the endpoint and configure the endpoint name (e.g., via an environment variable).
Use this Lambda+SageMaker pattern when your model is large or requires special hardware. SageMaker handles the heavy lifting of hosting and scaling, while Lambda provides a secure, serverless API layer.
AWS Lambda + S3 (Function-hosted Model)
For smaller models or infrequent inference, you can host the model inside Lambda. Store the model file (e.g., ONNX or TensorFlow) in Amazon S3. On a cold start, the Lambda downloads the model from S3 into memory or disk, then performs inference. Subsequent warm invocations reuse the cached model, reducing latency. This yields a fully serverless pipeline using only Lambda and S3 (and optionally EFS for very large models).
Packaging options:
- Include in deployment: For tiny models, embed the model in the Lambda package or container (up to 10 GB). This avoids a download but increases cold-start time.
- Download from S3: Store the model in S3. In your code, check if the model is loaded; if not, retrieve it from S3 on the first invocation and initialize it.
- Use Amazon EFS: For very large models, mount an EFS filesystem to the Lambda. Keep the model on EFS to avoid repeated downloads.
Implementation (S3 download example): The Lambda uses the AWS SDK to fetch the model on cold start, then caches it. Example:
// Pseudocode for Lambda that loads a model from S3 and runs inference
const AWS = require('aws-sdk');
const s3 = new AWS.S3();
let model; // to cache the loaded model in memory across invocations
exports.handler = async (event) => {
if (!model) {
// On cold start, load model file from S3
console.log("Loading model from S3...");
const s3Object = await s3.getObject({
Bucket: "my-model-bucket",
Key: "models/awesome-model.onnx"
}).promise();
const modelBinary = s3Object.Body;
// Initialize ML model (using an appropriate library, e.g., ONNX Runtime)
model = initializeModelFromBinary(modelBinary);
}
// Parse input and run prediction
const inputData = JSON.parse(event.body || "{}");
const prediction = model.predict(inputData);
return {
statusCode: 200,
body: JSON.stringify({ prediction })
};
};
This code checks if model
is loaded; if not, it downloads and initializes it. The model is then cached for future calls.
Use this Lambda-only pattern for lightweight models or sporadic traffic. It avoids the overhead of SageMaker. However, for very large models or high-throughput inference, Lambda’s resource limits (memory, disk, runtime) may be reached. In those cases, a SageMaker endpoint is more appropriate.
Serverless ML on Azure (Functions With Azure ML and Blob Storage)
Azure offers analogous patterns with Azure Functions. You can use Functions with Azure Machine Learning endpoints, or load models from Blob Storage in a Function.
Azure Functions + Azure ML (Managed Endpoints)
Azure Machine Learning provides online endpoints that serve models via REST APIs. A common pattern is to use an Azure Function as the API layer. The Function receives an HTTP request, posts the data to the Azure ML endpoint, and returns the prediction.
Implementation: Deploy your model to an Azure ML online endpoint (note its REST URL and key). Create a Node.js Azure Function with an HTTP trigger. Example:
// Azure Function (HTTP trigger) in Node.js that calls an Azure ML endpoint
const axios = require('axios');
module.exports = async function (context, req) {
const inputData = req.body || {};
// URL and API key for the Azure ML online endpoint (set in app settings)
const endpointUrl = process.env.AZURE_ML_ENDPOINT_URL;
const apiKey = process.env.AZURE_ML_API_KEY; // if the endpoint uses key auth
if (!endpointUrl) {
context.res = { status: 500, body: "ML endpoint URL not configured" };
return;
}
try {
// Call the Azure ML endpoint with input data
const response = await axios.post(endpointUrl, inputData, {
headers: { 'Authorization': `Bearer ${apiKey}` }
});
const prediction = response.data;
context.res = {
status: 200,
body: { prediction } // return the prediction result as JSON
};
} catch (err) {
context.log.error("Inference call failed:", err);
context.res = { status: 502, body: "Inference request failed" };
}
};
This Function posts the incoming JSON to the Azure ML endpoint and returns the result.
Azure ML endpoints can scale down to zero when idle. Together with Azure Functions (Consumption plan), this achieves a mostly serverless solution, though check any minimum instance requirements on your endpoint.
Azure Functions + Blob Storage
Similar to AWS Lambda+S3, store the model file in Azure Blob Storage and load it in the Function. On cold start, the Function downloads and initializes the model, then serves inference. This avoids using Azure ML.
Implementation: Upload your model (e.g., my-model.onnx
) to a Blob container. In the Function, use the Azure Blob SDK to download it. Example:
// Pseudocode: Azure Function that loads a model from Blob storage
const { BlobServiceClient } = require('@azure/storage-blob');
let model;
module.exports = async function (context, req) {
if (!model) {
// On cold start, download model from Blob Storage
const blobClient = BlobServiceClient
.fromConnectionString(process.env.AZURE_STORAGE_CONNECTION_STRING)
.getContainerClient("models")
.getBlobClient("my-model.onnx");
context.log("Downloading model blob...");
const downloadResponse = await blobClient.download(0);
const modelData = await streamToBuffer(downloadResponse.readableStreamBody);
model = loadModelFromBytes(modelData); // initialize model (e.g., ONNX Runtime)
}
// Use the loaded model to predict
const input = req.body || {};
const prediction = model.predict(input);
context.res = { status: 200, body: { prediction } };
};
This Function downloads the blob on first run and initializes the model.
Azure Functions on the Consumption plan have limited memory/CPU, so very large models may be slow or fail. To mitigate, consider the Premium Plan (more memory/CPU, pre-warmed instances) or packaging the model in a custom container.
Best Practices
- Minimize cold starts: Keep the deployment package small. Load models during initialization (outside the request handler) so they persist in memory. Use AWS Provisioned Concurrency or Azure Premium plans to keep functions warm if low-latency is critical.
- Efficient model loading: Initialize the model once per container so it stays cached. For multi-model functions, load on demand but cache frequently used models. Use fast storage (EFS, Azure Files) to avoid repeated downloads.
- Right-size resources: ML inference can be CPU- and memory-intensive. Allocate sufficient memory (AWS Lambda scales CPU with memory). In Azure, choose a plan with enough memory/CPU. Monitor usage to avoid out-of-memory or long garbage collection delays.
- Optimized formats: Use inference-optimized formats (ONNX, TensorRT, TensorFlow Lite) and quantize or trim models to reduce size. Smaller, optimized models load and infer faster.
- Handle timeouts: Break large tasks into smaller functions or asynchronous steps. Serverless functions have max execution times; if inference might exceed these, consider batch processing or workflows (e.g., AWS Step Functions, Azure Durable Functions).
- Concurrency and cost: Functions scale out by default. Ensure downstream systems (databases, APIs) can handle the load. Monitor usage and cost — serverless is cost-effective for spiky traffic, but at constant high volume a dedicated server or hybrid approach might be cheaper.
- Logging and monitoring: Use AWS CloudWatch or Azure Application Insights. Track invocation counts, durations, memory usage, and errors. Use distributed tracing (AWS X-Ray, Azure Monitor) to diagnose pipeline bottlenecks.
- Security: Do not hard-code secrets. Use AWS Secrets Manager or Azure Key Vault (or environment variables) for keys. Assign the least-privilege IAM role or managed identity to your function. Validate and sanitize inputs to prevent abuse.
- Testing: Test thoroughly with realistic data and concurrent loads to expose cold-start and concurrency issues. Validate both warm and cold scenarios, preferably in staging or local emulations.
Conclusion
Serverless ML enables running AI models on-demand at scale, without managing infrastructure. By using AWS Lambda or Azure Functions with managed ML endpoints or cloud storage, teams can deploy models quickly and scale them automatically. This article outlined how to implement serverless ML on AWS (Lambda with SageMaker or S3) and on Azure (Functions with Azure ML or Blob Storage), including Node.js examples and architectural guidance.
This approach suits unpredictable workloads and microservice architectures, offering cost and maintenance advantages. However, it requires careful design to handle model size, cold starts, and execution limits. Many architectures adopt a hybrid strategy: for example, using serverless functions for lightweight or intermittent inference and dedicated endpoints for heavy continuous loads. Cloud providers continue to improve their serverless offerings, making it increasingly feasible to run even sophisticated models in a serverless fashion.
Opinions expressed by DZone contributors are their own.
Comments