How to deploy a machine learning model to AWS in 2026

How to deploy a machine learning model to AWS has more right answers in 2026 than it did three years ago. The category split between SageMaker for serious ML workloads and Lambda for everything else has gotten messier as Lambda’s memory limits grew, container support matured, and ECS Fargate emerged as a real middle ground. The wrong choice doesn’t break anything immediately but produces bills or latency surprises that show up around the time you scale past the prototype.

I’ve deployed ML models to AWS across all three main paths over the past year – a SageMaker endpoint for a real-time recommendation service, Lambda containers for a sporadic batch classification job, ECS Fargate for a model serving custom Python dependencies that didn’t fit Lambda’s constraints. They aren’t interchangeable. Each one wins on a specific axis – cost shape, deployment simplicity, control – and picking the wrong one means either fighting AWS or paying for capability you don’t use.

What follows is the working guide: the three main deployment paths, working code for each, and the decision framework for picking between them.

Quick answer: deploying ML models to AWS

For most production ML workloads with steady traffic, deploy to a SageMaker endpoint – it’s the AWS-native option with the best operational story for ML specifically. For sporadic or low-traffic inference, deploy as a Lambda container – cheapest per invocation, fast cold-start with provisioned concurrency. For custom Python dependencies or non-standard inference patterns, use ECS Fargate with FastAPI – more control, comparable cost to SageMaker for steady traffic. SageMaker uses ml.t2.medium to ml.g4dn.xlarge instances; Lambda allows up to 10GB memory and 15-minute timeouts; Fargate scales from small to large containers.

The three main AWS deployment paths

The realistic ML deployment options on AWS in 2026 collapse to three categories that cover 90% of use cases.

SageMaker endpoints are the AWS-recommended path for production ML serving. The service handles model hosting, autoscaling, monitoring integration with CloudWatch, A/B testing through endpoint variants, and built-in support for major ML frameworks (PyTorch, TensorFlow, scikit-learn, XGBoost, Hugging Face). The trade-off is operational complexity and cost – SageMaker endpoints are always-on, which means you pay for the instance whether traffic is hitting it or not.

Lambda with container images is the right path for sporadic, low-volume, or unpredictable-traffic ML workloads. Since 2020, Lambda has supported container images up to 10GB, which is enough room for most non-frontier models. You pay only for invocations, which makes this dramatically cheaper than SageMaker for workloads that aren’t steady. The trade-off is cold-start latency (mitigated with provisioned concurrency at extra cost) and the 15-minute execution limit per invocation.

ECS Fargate is the middle-ground option for ML workloads that don’t fit cleanly into either SageMaker or Lambda. Common reasons to land here: custom Python dependencies that conflict with SageMaker’s preset containers, models that need persistent state across requests, or workloads where you want container-level control without managing EC2 instances. Fargate runs serverless containers that scale on demand.

For most teams, the picking question collapses to traffic shape. Steady production traffic → SageMaker. Sporadic or low traffic → Lambda. Anything else → Fargate.

Path 1: Deploy to SageMaker

SageMaker deployment via the Python SDK takes about 15 lines of code. The example below deploys a scikit-learn model:

import sagemaker
from sagemaker.sklearn import SKLearnModel

session = sagemaker.Session()
model_artifact = session.upload_data(
    'model.tar.gz',
    bucket='my-ml-models-bucket',
    key_prefix='models/recommender',
)

model = SKLearnModel(
    model_data=model_artifact,
    role='arn:aws:iam::123456789012:role/SageMakerExecutionRole',
    entry_point='inference.py',
    framework_version='1.2-1',
)

predictor = model.deploy(
    initial_instance_count=1,
    instance_type='ml.t2.medium',
    endpoint_name='recommender-endpoint',
)

result = predictor.predict([[1.0, 2.5, 3.7, 0.2]])

The inference.py file contains four conventional functions SageMaker calls: model_fn (load model from disk), input_fn (parse request), predict_fn (run inference), and output_fn (format response). For scikit-learn, defaults often work and you only need model_fn.

Cost runs $0.05-0.40 per hour for cheapest instance types (ml.t2.medium, ml.m5.large) and scales up to several dollars per hour for GPU instances. The instance runs continuously – a $0.10/hour endpoint costs about $72/month even if traffic is minimal. SageMaker supports autoscaling based on invocations per instance with a multi-knob configuration that takes some tuning.

Path 2: Deploy as a Lambda container

Lambda with container images is the right pick when traffic is sporadic enough that paying for a continuously-running endpoint doesn’t pencil out. The deployment requires a container image with your model and inference code, pushed to ECR (Elastic Container Registry), then configured as a Lambda function.

The Dockerfile:

FROM public.ecr.aws/lambda/python:3.11

COPY model.pkl ${LAMBDA_TASK_ROOT}
COPY app.py ${LAMBDA_TASK_ROOT}

RUN pip install scikit-learn==1.4.0 numpy

CMD ["app.lambda_handler"]

The Lambda handler:

# app.py
import json
import pickle
import numpy as np

# Loaded once on cold start, reused across warm invocations
with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

def lambda_handler(event, context):
    body = json.loads(event.get('body', '{}'))
    features = np.array([body['features']])

    prediction = model.predict(features).tolist()

    return {
        'statusCode': 200,
        'body': json.dumps({'prediction': prediction}),
    }

Build, push to ECR, create the Lambda function pointing at the image, pair with API Gateway for HTTP access.

The cost math is dramatic for sporadic workloads. A Lambda handling 10,000 inferences per month at 500ms each runs about $0.10/month with the free tier covering most of it. The same workload on a SageMaker endpoint costs $72/month for the cheapest always-on instance.

Cold starts are the main trade-off. Lambda containers cold-start in 2-5 seconds for typical ML models. Provisioned concurrency eliminates cold starts but adds cost – usually still cheaper than SageMaker for sporadic traffic.

Path 3: Deploy on ECS Fargate

ECS Fargate is the right pick when you need more control than Lambda offers but don’t want SageMaker’s always-on cost or its constraints on dependencies. Common cases: custom Python packages that don’t fit Lambda’s 10GB layer limit, models needing persistent in-memory state, or inference patterns that benefit from longer-running containers.

The deployment is a containerized FastAPI app. The Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt model.pkl app.py ./
RUN pip install -r requirements.txt
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

The FastAPI handler is standard – a /predict endpoint that loads the model on startup, accepts JSON input, and returns predictions. Push the image to ECR, create an ECS task definition pointing at it, run it on a Fargate cluster behind an Application Load Balancer.

Fargate cost falls between SageMaker and Lambda. A small Fargate task (0.25 vCPU, 0.5GB memory) runs about $9-10/month if continuously available. Autoscaling brings it down for variable traffic. The flexibility is real – any Python dependency, any model framework, any inference pattern – at the cost of more setup than SageMaker requires.

When to use which AWS deployment path

The picking question maps to traffic shape and operational constraints.

Pick SageMaker for steady production ML traffic where you want the AWS-native operational story. Built-in autoscaling, A/B testing through endpoint variants, deep CloudWatch integration, and managed model framework support all matter when ML is core to your product. The always-on cost is justified by the operational maturity.

Pick Lambda containers for sporadic, low-volume, or unpredictable-traffic inference. The cost savings over SageMaker are dramatic when workloads aren’t steady – sometimes 100x lower bills. Accept cold starts (or pay for provisioned concurrency to eliminate them) and the 15-minute execution limit.

Pick ECS Fargate when neither SageMaker nor Lambda fits. Custom Python dependencies, persistent in-memory state, longer-running inference, or the desire for container-level control without managing EC2 instances all push toward Fargate. The middle-ground cost works for steady traffic when SageMaker is overkill but Lambda is too constrained.

The compression question: how steady is your traffic? Steady and high points at SageMaker. Sporadic points at Lambda. Anything that doesn’t fit those points at Fargate.

Common gotchas when deploying ML models to AWS

A few production-relevant pitfalls show up consistently.

SageMaker model packaging requires model.tar.gz not raw files. The SDK uploads whatever you give it, but the inference container expects the tarball structure. Use tar -czvf model.tar.gz model.pkl inference.py before uploading.

Lambda container size affects cold-start time linearly. A 2GB image cold-starts faster than 5GB. Trim unused dependencies aggressively – ML containers often have 500MB+ of accidental bloat.

Fargate autoscaling needs time to react. Scaling up takes 1-3 minutes from traffic spike to new tasks serving requests. For sharp spikes, keep baseline capacity warm or use SageMaker.

GPU instance availability varies by region. If you need GPU inference, check your target region has the instance type. Some regions run out of GPU capacity periodically.

FAQ

How do I deploy a machine learning model to AWS?

To deploy a machine learning model to AWS, pick a deployment path based on your traffic shape. For steady production traffic, use SageMaker endpoints with the SageMaker Python SDK – the process is upload model artifacts to S3, create a SageMaker model object, and call .deploy() to create an endpoint. For sporadic traffic, package your model in a container image, push to ECR, and run as Lambda. For custom dependencies, use ECS Fargate with FastAPI. Each path takes about 30-60 lines of code for the basic deployment.

Should I use SageMaker or Lambda to deploy a model?

Use SageMaker for steady production ML traffic where the always-on instance cost is justified by AWS-native operational features (autoscaling, A/B testing, CloudWatch integration, framework support). Use Lambda for sporadic or unpredictable traffic where Lambda’s per-invocation pricing is dramatically cheaper than SageMaker’s continuous runtime. The decision rule: if your endpoint serves more than a few requests per minute consistently, SageMaker. If it serves bursts of traffic with idle periods, Lambda. The cost difference at scale can be 10-100x in either direction depending on which fits your workload.

What’s the cheapest way to deploy an ML model to AWS?

The cheapest way to deploy an ML model to AWS depends on traffic patterns. For sporadic inference, Lambda containers cost almost nothing – 10,000 invocations per month often falls within the free tier. For steady but small workloads, ECS Fargate with a small task (0.25 vCPU, 0.5GB memory) runs about $9-10/month. For steady high-volume workloads, SageMaker is most cost-efficient because it’s purpose-built for ML serving. Avoid leaving SageMaker endpoints running idle – the always-on cost makes them expensive for unused capacity.

Can I deploy a PyTorch model on AWS Lambda?

Yes, you can deploy a PyTorch model on AWS Lambda using container images. The Lambda container image can include PyTorch and your trained model up to 10GB total uncompressed size, which fits most non-frontier PyTorch models. Use the AWS Lambda Python base image, install PyTorch, copy your model file in, and define a lambda_handler function. Cold starts are typically 5-15 seconds for PyTorch models due to PyTorch’s import time – provisioned concurrency eliminates this if predictable latency matters. For models above 10GB or needing GPU inference, ECS Fargate or SageMaker is the better path.

How do I serve ML predictions in real-time on AWS?

To serve ML predictions in real-time on AWS, deploy to a SageMaker endpoint for steady traffic, or to ECS Fargate with FastAPI for custom inference patterns. Avoid Lambda for hard real-time requirements unless you use provisioned concurrency, because cold starts add unpredictable latency. SageMaker handles autoscaling, monitoring, and operational concerns automatically. For sub-millisecond requirements, consider AWS Inferentia instances on SageMaker – they’re optimized for ML inference and produce noticeably lower latency than equivalent CPU or GPU instances.

If you’ve deployed an ML model to AWS in production and have honest numbers on what changed when you picked one path over another – cost, latency, operational friction – that writeup is the gap worth filling. AWS documentation covers the basic setup well; what’s scarce is real engineering reports on whether the path you picked was the right one in hindsight.