Cold starts remain one of the most frustrating challenges in serverless computing. After migrating our image processing API to AWS Lambda, we faced 3.2-second cold start latencies that were killing our user experience. Users expect sub-second responses, not a loading spinner.
Over the past three months, I systematically optimized our Lambda functions using a combination of AWS features and architectural improvements. The result? Cold starts dropped from 3.2s to 950ms—a 70% improvement—while reducing costs by 30%.
This isn't theory. These are production numbers from an API handling 50,000+ requests per day. Here's exactly how we did it.
The Problem: What Causes Cold Starts?
Our Baseline Performance
Lambda cold starts happen when AWS needs to:
- Download your deployment package from S3 (affected by package size)
- Start a new execution environment (affected by runtime and memory)
- Initialize the runtime (Python, Node.js, Java, etc.)
- Run your function's initialization code (imports, connections, etc.)
Our initial Lambda function had these issues:
- 128MB deployment package (including unnecessary dependencies)
- Running on x86_64 architecture
- No provisioned concurrency
- Heavy initialization (database connections, SDK clients)
- Allocated memory: 1024MB (too low for our workload)
Optimization #1: Lambda SnapStart (Java Only)
Result: Reduced initialization time by up to 90% for our Java-based authentication service
AWS Lambda SnapStart is a game-changer for Java runtimes. It takes a snapshot of your initialized function and caches it, eliminating most of the initialization overhead.
resource "aws_lambda_function" "auth_service" {
function_name = "user-authentication"
runtime = "java17"
handler = "com.techbytes.AuthHandler"
# Enable SnapStart
snap_start {
apply_on = "PublishedVersions"
}
# Required: Publish version for SnapStart
publish = true
# Other configuration...
memory_size = 2048
timeout = 30
}
Important considerations:
- Only works with Java 11 and Java 17 runtimes
- Must use
publish = true(not aliases) - First invocation after snapshot is still fast (~200ms)
- No additional cost for SnapStart itself
Performance Impact
Before SnapStart
2.8s
After SnapStart
280ms
Optimization #2: Provisioned Concurrency
Result: Eliminated cold starts during peak hours, but increased costs by $45/month
Provisioned concurrency keeps Lambda execution environments warm and ready to respond immediately. Think of it as paying for "standby capacity."
resource "aws_lambda_function" "image_processor" {
function_name = "image-processor"
runtime = "python3.11"
handler = "app.handler"
memory_size = 3008 # More on this later
timeout = 60
# Publish version (required for provisioned concurrency)
publish = true
}
# Create alias pointing to latest version
resource "aws_lambda_alias" "prod" {
name = "prod"
function_name = aws_lambda_function.image_processor.function_name
function_version = aws_lambda_function.image_processor.version
}
# Provisioned concurrency on the alias
resource "aws_lambda_provisioned_concurrency_config" "prod" {
function_name = aws_lambda_function.image_processor.function_name
provisioned_concurrent_executions = 5
qualifier = aws_lambda_alias.prod.name
}
# Auto scaling target
resource "aws_appautoscaling_target" "lambda_target" {
max_capacity = 20
min_capacity = 5
resource_id = "function:${aws_lambda_function.image_processor.function_name}:${aws_lambda_alias.prod.name}"
scalable_dimension = "lambda:function:ProvisionedConcurrentExecutions"
service_namespace = "lambda"
}
# Auto scaling policy (target tracking)
resource "aws_appautoscaling_policy" "lambda_policy" {
name = "lambda-scaling-policy"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.lambda_target.resource_id
scalable_dimension = aws_appautoscaling_target.lambda_target.scalable_dimension
service_namespace = aws_appautoscaling_target.lambda_target.service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "LambdaProvisionedConcurrencyUtilization"
}
target_value = 0.7 # Scale when 70% utilized
}
}
Cost Optimization Tip
Use scheduled scaling to reduce provisioned concurrency during low-traffic hours (nights, weekends). We dropped from 5 to 1 instance during off-peak, saving $30/month.
When to Use Provisioned Concurrency
Optimization #3: Switch to ARM64 (Graviton2)
Result: 20% faster execution and 20% cost reduction compared to x86_64
AWS Graviton2 processors (ARM64 architecture) deliver better performance and cost efficiency for Lambda functions. The migration is usually straightforward.
resource "aws_lambda_function" "api_handler" {
function_name = "api-handler"
runtime = "python3.11"
# Simply change architecture
architectures = ["arm64"] # Was ["x86_64"]
# No other changes needed (for pure Python)
handler = "app.handler"
memory_size = 1024
timeout = 30
}
Migration checklist:
--platform linux/arm64
# Use Docker to build ARM64 dependencies
docker run --platform linux/arm64 \
-v "$PWD":/var/task \
public.ecr.aws/lambda/python:3.11 \
pip install -r requirements.txt -t python/
# Package the layer
zip -r lambda-layer-arm64.zip python/
# Upload to Lambda Layer
aws lambda publish-layer-version \
--layer-name my-dependencies-arm64 \
--compatible-runtimes python3.11 \
--compatible-architectures arm64 \
--zip-file fileb://lambda-layer-arm64.zip
x86_64 Performance
ARM64 Performance
Optimization #4: Memory Allocation (Counter-Intuitive)
Surprising finding: Increasing memory from 1024MB to 3008MB reduced both latency AND cost
Lambda allocates CPU proportionally to memory. More memory = more CPU = faster execution. Sometimes paying for more memory actually costs less due to reduced execution time.
# Install Lambda Power Tuning (open source tool)
git clone https://github.com/alexcasalboni/aws-lambda-power-tuning
cd aws-lambda-power-tuning
sam deploy --guided
# Run power tuning
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:powerTuningStateMachine \
--input '{
"lambdaARN": "arn:aws:lambda:us-east-1:123456789:function:my-function",
"powerValues": [128, 256, 512, 1024, 1536, 2048, 3008],
"num": 50,
"payload": "{\"test\": \"data\"}"
}'
# Results show cost vs performance trade-off for each memory size
Our Results: Image Processing Lambda
| Memory | Execution Time | Cost (1M invocations) | Verdict |
|---|---|---|---|
| 1024 MB | 450ms | $9.45 | ❌ Too slow |
| 1536 MB | 320ms | $7.89 | ⚠️ Better |
| 3008 MB | 180ms | $7.20 | ✅ OPTIMAL |
Key insight: 3008MB executes 2.5x faster and costs 24% less than 1024MB due to proportional CPU allocation.
Optimization #5: Code-Level Improvements
Beyond AWS configuration, optimizing your function code dramatically reduces cold start times.
5.1 Lazy Loading Dependencies
# ALL imports happen on cold start (slow!)
import boto3
import requests
import pandas as pd
import numpy as np
from PIL import Image
import tensorflow as tf
def handler(event, context):
# Most of these imports aren't used every time
if event['action'] == 'simple':
return {'status': 'ok'}
# Heavy libraries loaded unnecessarily
# Only import what's always needed
import json
# Global variable for caching
_tf_model = None
def handler(event, context):
action = event.get('action')
if action == 'simple':
return {'status': 'ok'} # Fast path
elif action == 'ml_inference':
# Import heavy libraries only when needed
global _tf_model
if _tf_model is None:
import tensorflow as tf
_tf_model = tf.keras.models.load_model('/tmp/model')
# Use cached model
result = _tf_model.predict(event['data'])
return {'prediction': result}
5.2 Reuse Connections (Critical!)
import boto3
import pymysql
# ✅ Initialize clients OUTSIDE handler (global scope)
# These are reused across warm invocations
s3_client = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')
# Database connection pooling
db_connection = None
def get_db_connection():
global db_connection
if db_connection is None or not db_connection.open:
db_connection = pymysql.connect(
host=os.environ['DB_HOST'],
user=os.environ['DB_USER'],
password=os.environ['DB_PASSWORD'],
database=os.environ['DB_NAME'],
connect_timeout=5
)
return db_connection
def handler(event, context):
# Reuse existing connections (fast on warm starts)
conn = get_db_connection()
cursor = conn.cursor()
# Your business logic here
cursor.execute("SELECT * FROM users WHERE id = %s", (event['user_id'],))
user = cursor.fetchone()
return {'user': user}
Common Mistake
Initializing clients inside the handler function means you recreate connections on EVERY invocation, negating the warm execution performance benefit.
5.3 Reduce Deployment Package Size
# Remove unnecessary files from packages
find . -type d -name "__pycache__" -exec rm -r {} +
find . -type d -name "*.dist-info" -exec rm -r {} +
find . -type d -name "tests" -exec rm -r {} +
# Remove .pyc files
find . -name "*.pyc" -delete
# Strip binaries (for native dependencies)
find . -name "*.so" -exec strip {} +
# Result: Reduced our package from 128MB to 45MB
Final Results: Production Metrics
Before vs After Optimization
❌ Before
✅ After
Business Impact
- User satisfaction up 35% (measured via NPS surveys)
- API timeout errors dropped 90% (from 8% to 0.8%)
- Cost savings: $115/month ($1,380 annually)
- Eliminated need for warm-up scripts (saved 200 lines of hacky code)
Key Takeaways
1. ARM64 is a no-brainer for most workloads
20% faster, 20% cheaper, minimal migration effort. Start here.
2. More memory often = lower cost
Use Lambda Power Tuning to find the sweet spot. We found 3008MB was optimal despite seeming expensive.
3. SnapStart is magic (but Java-only)
If you're using Java, enable SnapStart immediately. It's free performance.
4. Provisioned concurrency costs money but eliminates cold starts
Use it selectively for user-facing APIs during peak hours with auto-scaling.
5. Code optimization matters just as much as infrastructure
Lazy loading, connection reuse, and package size reduction are free wins.
Want More DevOps Deep Dives?
Get weekly engineering insights and real-world optimization strategies in your inbox