Tech Bytes Engineering

Cold starts remain one of the most frustrating challenges in serverless computing. After migrating our image processing API to AWS Lambda, we faced 3.2-second cold start latencies that were killing our user experience. Users expect sub-second responses, not a loading spinner.

Over the past three months, I systematically optimized our Lambda functions using a combination of AWS features and architectural improvements. The result? Cold starts dropped from 3.2s to 950ms—a 70% improvement—while reducing costs by 30%.

This isn't theory. These are production numbers from an API handling 50,000+ requests per day. Here's exactly how we did it.

The Problem: What Causes Cold Starts?

Our Baseline Performance

3.2s

Cold Start Time

~200ms

Warm Execution

15%

Cold Start Rate

Lambda cold starts happen when AWS needs to:

Download your deployment package from S3 (affected by package size)
Start a new execution environment (affected by runtime and memory)
Initialize the runtime (Python, Node.js, Java, etc.)
Run your function's initialization code (imports, connections, etc.)

Our initial Lambda function had these issues:

128MB deployment package (including unnecessary dependencies)
Running on x86_64 architecture
No provisioned concurrency
Heavy initialization (database connections, SDK clients)
Allocated memory: 1024MB (too low for our workload)

Optimization #1: Lambda SnapStart (Java Only)

Result: Reduced initialization time by up to 90% for our Java-based authentication service

AWS Lambda SnapStart is a game-changer for Java runtimes. It takes a snapshot of your initialized function and caches it, eliminating most of the initialization overhead.

Enabling SnapStart (Terraform) HCL

resource "aws_lambda_function" "auth_service" {
  function_name = "user-authentication"
  runtime       = "java17"
  handler       = "com.techbytes.AuthHandler"

  # Enable SnapStart
  snap_start {
    apply_on = "PublishedVersions"
  }

  # Required: Publish version for SnapStart
  publish = true

  # Other configuration...
  memory_size = 2048
  timeout     = 30
}

Important considerations:

Only works with Java 11 and Java 17 runtimes
Must use publish = true (not aliases)
First invocation after snapshot is still fast (~200ms)
No additional cost for SnapStart itself

Performance Impact

Before SnapStart

2.8s

After SnapStart

280ms

Optimization #2: Provisioned Concurrency

Result: Eliminated cold starts during peak hours, but increased costs by $45/month

Provisioned concurrency keeps Lambda execution environments warm and ready to respond immediately. Think of it as paying for "standby capacity."

Provisioned Concurrency with Auto Scaling HCL

resource "aws_lambda_function" "image_processor" {
  function_name = "image-processor"
  runtime       = "python3.11"
  handler       = "app.handler"

  memory_size = 3008  # More on this later
  timeout     = 60

  # Publish version (required for provisioned concurrency)
  publish = true
}

# Create alias pointing to latest version
resource "aws_lambda_alias" "prod" {
  name             = "prod"
  function_name    = aws_lambda_function.image_processor.function_name
  function_version = aws_lambda_function.image_processor.version
}

# Provisioned concurrency on the alias
resource "aws_lambda_provisioned_concurrency_config" "prod" {
  function_name                     = aws_lambda_function.image_processor.function_name
  provisioned_concurrent_executions = 5
  qualifier                         = aws_lambda_alias.prod.name
}

# Auto scaling target
resource "aws_appautoscaling_target" "lambda_target" {
  max_capacity       = 20
  min_capacity       = 5
  resource_id        = "function:${aws_lambda_function.image_processor.function_name}:${aws_lambda_alias.prod.name}"
  scalable_dimension = "lambda:function:ProvisionedConcurrentExecutions"
  service_namespace  = "lambda"
}

# Auto scaling policy (target tracking)
resource "aws_appautoscaling_policy" "lambda_policy" {
  name               = "lambda-scaling-policy"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.lambda_target.resource_id
  scalable_dimension = aws_appautoscaling_target.lambda_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.lambda_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "LambdaProvisionedConcurrencyUtilization"
    }
    target_value = 0.7  # Scale when 70% utilized
  }
}

Cost Optimization Tip

Use scheduled scaling to reduce provisioned concurrency during low-traffic hours (nights, weekends). We dropped from 5 to 1 instance during off-peak, saving $30/month.

When to Use Provisioned Concurrency

Use it: For user-facing APIs where latency matters

Use it: Functions with consistent traffic patterns

Skip it: For infrequent batch jobs

Skip it: Event-driven background processing

Optimization #3: Switch to ARM64 (Graviton2)

Result: 20% faster execution and 20% cost reduction compared to x86_64

AWS Graviton2 processors (ARM64 architecture) deliver better performance and cost efficiency for Lambda functions. The migration is usually straightforward.

Migrating to ARM64 HCL

resource "aws_lambda_function" "api_handler" {
  function_name = "api-handler"
  runtime       = "python3.11"

  # Simply change architecture
  architectures = ["arm64"]  # Was ["x86_64"]

  # No other changes needed (for pure Python)
  handler     = "app.handler"
  memory_size = 1024
  timeout     = 30
}

Migration checklist:

Pure Python/Node.js/Java: Usually works immediately

Native dependencies: Rebuild with ARM64 toolchain

Docker images: Use --platform linux/arm64

Test thoroughly: Some libraries behave differently on ARM

Building Lambda Layers for ARM64 Bash

# Use Docker to build ARM64 dependencies
docker run --platform linux/arm64 \
  -v "$PWD":/var/task \
  public.ecr.aws/lambda/python:3.11 \
  pip install -r requirements.txt -t python/

# Package the layer
zip -r lambda-layer-arm64.zip python/

# Upload to Lambda Layer
aws lambda publish-layer-version \
  --layer-name my-dependencies-arm64 \
  --compatible-runtimes python3.11 \
  --compatible-architectures arm64 \
  --zip-file fileb://lambda-layer-arm64.zip

x86_64 Performance

Cold Start: 1.2s

Warm Execution: 180ms

Cost (1M requests): $4.80

ARM64 Performance

Cold Start: 950ms

Warm Execution: 140ms

Cost (1M requests): $3.84

Optimization #4: Memory Allocation (Counter-Intuitive)

Surprising finding: Increasing memory from 1024MB to 3008MB reduced both latency AND cost

Lambda allocates CPU proportionally to memory. More memory = more CPU = faster execution. Sometimes paying for more memory actually costs less due to reduced execution time.

Finding Optimal Memory with Lambda Power Tuning Bash

# Install Lambda Power Tuning (open source tool)
git clone https://github.com/alexcasalboni/aws-lambda-power-tuning
cd aws-lambda-power-tuning
sam deploy --guided

# Run power tuning
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:123456789:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:123456789:function:my-function",
    "powerValues": [128, 256, 512, 1024, 1536, 2048, 3008],
    "num": 50,
    "payload": "{\"test\": \"data\"}"
  }'

# Results show cost vs performance trade-off for each memory size

Our Results: Image Processing Lambda

Memory	Execution Time	Cost (1M invocations)	Verdict
1024 MB	450ms	$9.45	❌ Too slow
1536 MB	320ms	$7.89	⚠️ Better
3008 MB	180ms	$7.20	✅ OPTIMAL

Key insight: 3008MB executes 2.5x faster and costs 24% less than 1024MB due to proportional CPU allocation.

Optimization #5: Code-Level Improvements

Beyond AWS configuration, optimizing your function code dramatically reduces cold start times.

5.1 Lazy Loading Dependencies

❌ Bad: Import Everything Upfront Python

# ALL imports happen on cold start (slow!)
import boto3
import requests
import pandas as pd
import numpy as np
from PIL import Image
import tensorflow as tf

def handler(event, context):
    # Most of these imports aren't used every time
    if event['action'] == 'simple':
        return {'status': 'ok'}
    # Heavy libraries loaded unnecessarily

✅ Good: Lazy Load Heavy Dependencies Python

# Only import what's always needed
import json

# Global variable for caching
_tf_model = None

def handler(event, context):
    action = event.get('action')

    if action == 'simple':
        return {'status': 'ok'}  # Fast path

    elif action == 'ml_inference':
        # Import heavy libraries only when needed
        global _tf_model
        if _tf_model is None:
            import tensorflow as tf
            _tf_model = tf.keras.models.load_model('/tmp/model')

        # Use cached model
        result = _tf_model.predict(event['data'])
        return {'prediction': result}

5.2 Reuse Connections (Critical!)

✅ Initialize Outside Handler (Reused Across Invocations) Python

import boto3
import pymysql

# ✅ Initialize clients OUTSIDE handler (global scope)
# These are reused across warm invocations
s3_client = boto3.client('s3')
dynamodb = boto3.resource('dynamodb')

# Database connection pooling
db_connection = None

def get_db_connection():
    global db_connection
    if db_connection is None or not db_connection.open:
        db_connection = pymysql.connect(
            host=os.environ['DB_HOST'],
            user=os.environ['DB_USER'],
            password=os.environ['DB_PASSWORD'],
            database=os.environ['DB_NAME'],
            connect_timeout=5
        )
    return db_connection

def handler(event, context):
    # Reuse existing connections (fast on warm starts)
    conn = get_db_connection()
    cursor = conn.cursor()

    # Your business logic here
    cursor.execute("SELECT * FROM users WHERE id = %s", (event['user_id'],))
    user = cursor.fetchone()

    return {'user': user}

Common Mistake

Initializing clients inside the handler function means you recreate connections on EVERY invocation, negating the warm execution performance benefit.

5.3 Reduce Deployment Package Size

Optimizing Python Packages Bash

# Remove unnecessary files from packages
find . -type d -name "__pycache__" -exec rm -r {} +
find . -type d -name "*.dist-info" -exec rm -r {} +
find . -type d -name "tests" -exec rm -r {} +

# Remove .pyc files
find . -name "*.pyc" -delete

# Strip binaries (for native dependencies)
find . -name "*.so" -exec strip {} +

# Result: Reduced our package from 128MB to 45MB

Final Results: Production Metrics

Before vs After Optimization

❌ Before

Cold Start: 3.2s

Warm Execution: 220ms

P95 Latency: 2.8s

Monthly Cost: $380

Cold Start Rate: 15%

✅ After

Cold Start: 950ms (-70%)

Warm Execution: 140ms (-36%)

P95 Latency: 680ms (-76%)

Monthly Cost: $265 (-30%)

Cold Start Rate: < 1%

Business Impact

User satisfaction up 35% (measured via NPS surveys)
API timeout errors dropped 90% (from 8% to 0.8%)
Cost savings: $115/month ($1,380 annually)
Eliminated need for warm-up scripts (saved 200 lines of hacky code)

Key Takeaways

1. ARM64 is a no-brainer for most workloads

20% faster, 20% cheaper, minimal migration effort. Start here.

2. More memory often = lower cost

Use Lambda Power Tuning to find the sweet spot. We found 3008MB was optimal despite seeming expensive.

3. SnapStart is magic (but Java-only)

If you're using Java, enable SnapStart immediately. It's free performance.

4. Provisioned concurrency costs money but eliminates cold starts

Use it selectively for user-facing APIs during peak hours with auto-scaling.

5. Code optimization matters just as much as infrastructure

Lazy loading, connection reuse, and package size reduction are free wins.

Want More DevOps Deep Dives?

Get weekly engineering insights and real-world optimization strategies in your inbox

Subscribe to Newsletter → Explore Free Tools →