Agents 6 min2026-06-03

Why We Deploy AI Systems on Modal Instead of AWS Lambda

Serverless GPU changed what's economically viable for production AI. Cold-start under 1 second, pay per millisecond of GPU time, scale to zero — Modal makes inference infrastructure a non-issue for mid-market AI systems.

The problem isn't building an AI model; it's getting it to production without drowning in infrastructure costs or operational complexity. We've seen countless promising AI pilots at Verel Systems die a slow death, not because the models were bad, but because the deployment strategy was a mess. Companies accumulate "AI debt"—tangled prompt chains, unmonitored agents, demo-quality RAG—and a major contributor is the mismatch between traditional cloud compute and the variable, GPU-hungry nature of real-world AI workloads.

AWS Lambda, for all its serverless glory, has no GPU. That immediately disqualifies it for the vast majority of serious AI work. When you're dealing with embedding generation, image processing, or batch inference, CPU-only serverless functions are a non-starter. The alternative, for many, is a painful choice between overengineering or overspending.

The AI Deployment Dilemma: Overkill or Overcost

Consider the typical AI workload: it's often bursty. You might need to process 100,000 documents for a RAG pipeline, generate a few hundred images, or run an hourly batch inference job. These tasks demand significant GPU power, but only for short, unpredictable durations.

This is where the standard cloud offerings fall short:

▸AWS Lambda: As mentioned, no GPU. End of discussion for AI.
▸ECS/EKS (Container Orchestration): These are powerful tools, undeniably. But for variable AI workloads, they're often overengineered. Setting up and maintaining Kubernetes or even ECS clusters for occasional GPU jobs introduces immense operational overhead. We've seen mid-market companies spend months trying to get a stable GPU-enabled EKS cluster running, only to have their AI project stall in "pilot purgatory" because the infrastructure became a full-time job. It's a prime example of AI debt building up before a single model even sees production.
▸Provisioned GPU Instances (EC2 G5, G4dn): This is the most straightforward option, but also the most wasteful for intermittent tasks. An A10G instance (like a g5.xlarge) costs around $1.006 per hour on-demand in us-east-1. If your embedding pipeline runs for 3 hours a day, you're paying $3.018 for compute. But if that instance sits idle for the remaining 21 hours, you're still paying for it. Over a month, that's $724.32 for an instance that's actively working less than 15% of the time. For many companies, especially those without dedicated MLOps teams, this cost is prohibitive and makes scaling AI projects financially unsustainable. They can't afford dedicated GPU servers for occasional inference jobs, leading to abandoned POCs and wasted budget.

This fundamental mismatch is why we, at Verel Systems, began looking for a better way to operationalize AI. We needed something that offered GPU compute, scaled automatically, and charged only for what we used.

Modal: Serverless GPU Compute That Just Works

Modal emerged as the clear solution. It's a serverless GPU compute platform built for Python, designed from the ground up to handle the very problems described above. It removes the infrastructure burden, letting our engineers focus on the AI, not the YAML.

Here’s what Modal delivers:

▸Python-Native Serverless GPU: You write standard Python functions, decorate them, and Modal handles the rest. It provides access to high-performance GPUs like the NVIDIA A10G and H100, crucial for any serious deep learning task.
▸Sub-Second Cold Starts: This is a major differentiator. For pre-warmed containers, Modal can spin up a GPU function in under a second. This responsiveness is critical for user-facing applications or interactive workloads, bridging the gap between traditional serverless and dedicated instances.
▸Simple Decorators: Converting any Python function into a cloud-deployable, GPU-accelerated function is as simple as adding @stub.function(gpu="A10G"). This abstracts away Dockerfiles, Kubernetes manifests, and cloud-specific SDKs, drastically reducing development and deployment time.
▸Automatic Container Builds: Modal intelligently builds your container image based on your requirements.txt or explicit pip_install commands within your image definition. This eliminates the need for manual Dockerfile maintenance for most use cases, speeding up iteration cycles.
▸Secrets Management: Production systems require secure handling of API keys and credentials. Modal's modal.Secret object integrates seamlessly, allowing you to inject secrets into your functions without hardcoding them or relying on environment variables.
▸Job Queues and Automatic Scaling: While not explicitly exposed as a separate service, Modal’s serverless model inherently provides job queueing and automatic scaling. You simply call a remote function, and Modal provisions the necessary compute, scales up for concurrent tasks, and scales down to zero when idle, ensuring you only pay for active execution time.
▸Pay-per-millisecond Billing: This is the economic game-changer. Instead of paying for an instance that sits idle, you are billed only for the exact milliseconds your code runs on the allocated GPU.

Where Verel Systems Uses Modal

At Verel, we've integrated Modal across several critical parts of our AI production pipelines, helping our clients avoid AI debt and move beyond stalled pilots.

▸Embedding Pipelines: Our most frequent use case. When we ingest large datasets—say, 100,000 documents for a RAG system—we need to generate embeddings. This is a highly parallelizable, GPU-intensive task, but it’s not continuous. Spinning up an A10G instance for a few hours, then tearing it down, is exactly the kind of manual, error-prone process Modal eliminates. We define our embedding function, set it to run on an A10G, and let Modal handle the scaling and execution. The cost efficiency here is unmatched; we pay only for the actual compute time, not for idle GPU hours.
▸Image Generation Jobs: For clients requiring custom image generation (e.g., for marketing assets, product mockups, or internal tools), Modal is ideal. Stable Diffusion or similar models are GPU hogs, but the jobs are often bursty. A client might generate 500 images in an hour, then nothing for the next three. Modal scales up instantly for the demand and scales down to zero, ensuring cost-effectiveness.
▸Model Inference Endpoints for Clients: Many of our mid-market clients can't justify the expense of a dedicated GPU server for occasional inference. Modal allows us to expose highly specific model inference endpoints via its webhooks, providing production-grade performance without the client needing to manage any infrastructure. This enables them to operationalize AI without the upfront capital expenditure or ongoing operational burden. It’s how we move them from a demo to a deployed, valuable service.
▸Scheduled ML Jobs: Tasks like daily model performance monitoring, weekly data drift detection, or periodic retraining checks also fit Modal perfectly. These are scheduled, often short-duration, GPU-accelerated tasks that don't warrant a continuously running server. We define these as Modal functions, schedule them, and forget about the underlying infrastructure.

A Concrete Example: S3-Triggered Embedding Pipeline

Let's look at a simplified example of how we might set up an embedding pipeline on Modal. This function processes new documents uploaded to an S3 bucket, generates embeddings using a GPU, and stores them in Qdrant.

</>View technical implementation · عرض التفاصيل التقنية

import modal
import boto3
from qdrant_client import QdrantClient, models
from qdrant_client.http.models import Distance, VectorParams
from sentence_transformers import SentenceTransformer
import os

# Define a Modal Stub for our application
stub = modal.Stub(name="s3-embedding-pipeline")

# Define the image for our Modal functions
# We'll use a slim Debian base and install our Python dependencies.
# The `force_build` ensures a fresh build if requirements change.
embedding_image = (
    modal.Image.debian_slim(python_version="3.10")
    .pip_install(
        "boto3",
        "qdrant-client",
        "sentence-transformers",
        "torch", # Required for sentence-transformers with GPU
        "transformers",
        "accelerate", # For better GPU utilization
    )
    .apt_install("git") # Needed by some models
)

# Define a Modal Secret for Qdrant API key and S3 credentials
# These would be configured in Modal's UI or CLI
qdrant_secret = modal.Secret.from_name("qdrant-api-key")
aws_secret = modal.Secret.from_name("aws-credentials")

# The GPU-accelerated embedding function
@stub.function(
    image=embedding_image,
    gpu="A10G", # Allocate an A10G GPU
    secrets=[qdrant_secret, aws_secret],
    timeout=600, # 10 minutes timeout
    memory=1024 # 1GB memory
)
def process_document_for_embedding(s3_uri: str):
    print(f"Processing document from S3: {s3_uri}")

    # Initialize S3 client
    s3_client = boto3.client(
        's3',
        aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
        aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"]
    )
    bucket_name = s3_uri.split('/')[2]
    object_key = '/'.join(s3_uri.split('/')[3:])

    # Download document from S3
    try:
        response = s3_client.get_object(Bucket=bucket_name, Key=object_key)
        document_content = response['Body'].read().decode('utf-8')
        print(f"Downloaded {len(document_content)} bytes from {s3_uri}")
    except Exception as e:
        print(f"Error downloading {s3_uri}: {e}")
        return

    # Initialize embedding model (runs on GPU automatically if available)
    # Using a common, efficient model for demonstration
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    print("Embedding model loaded.")

    # Generate embeddings
    embeddings = model.encode([document_content], convert_to_tensor=True)
    embeddings_list = embeddings.tolist()[0] # Convert tensor to list for Qdrant

    # Initialize Qdrant client
    qdrant_client = QdrantClient(
        url=os.environ["QDRANT_HOST"],
        api_key=os.environ["QDRANT_API_KEY"],
    )
    collection_name = "document_embeddings"

    # Ensure collection exists
    try:
        qdrant_client.recreate_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(size=len(embeddings_list), distance=Distance.COSINE),
        )
        print(f"Collection '{collection_name}' recreated (or created).")
    except Exception as e:
        # If collection already exists and recreate is not desired, handle accordingly
        print(f"Could not recreate collection, might already exist: {e}")

    # Store embeddings in Qdrant
    try:
        qdrant_client.upsert(
            collection_name=collection_name,
            points=[
                models.PointStruct(
                    id=hash(s3_uri) % (2**63 - 1), # Simple ID generation
                    vector=embeddings_list,
                    payload={"s3_uri": s3_uri, "text_preview": document_content[:200]}
                )
            ]
        )
        print(f"Embeddings for {s3_uri} stored in Qdrant.")
    except Exception as e:
        print(f"Error storing embeddings for {s3_uri} in Qdrant: {e}")

# This local entrypoint simulates triggering the function
@stub.local_entrypoint()
def main():
    # Example S3 URIs to process
    s3_documents = [
        "s3://my-document-bucket/path/to/doc1.txt",
        "s3://my-document-bucket/path/to/doc2.pdf", # Assume PDF parsing happens before this step
        "s3://my-document-bucket/path/to/doc3.md"
    ]
    # Trigger the remote function for each document
    # For large batches, consider using .remote_batch()
    for doc_uri in s3_documents:
        process_document_for_embedding.remote(doc_uri)
    print("All documents queued for processing.")

This code snippet illustrates several key Modal advantages:

▸Declarative GPU Allocation: gpu="A10G" is all it takes to specify the hardware.
▸Dependency Management: embedding_image.pip_install(...) handles all Python package installations.
▸Secrets Integration: secrets=[qdrant_secret, aws_secret] ensures sensitive data is handled securely.
▸Scalability: Calling process_document_for_embedding.remote() automatically queues the job and scales up instances as needed. If we had 100,000 documents, Modal would spin up multiple A10G instances concurrently to process them, then scale down to zero when done.

Modal vs. The Alternatives: A Cost Breakdown

Let's compare the costs for a typical 100,000 document embedding job. Assume this job requires 1 hour of active A10G GPU compute.

▸Modal: An A10G on Modal costs approximately $0.54 per hour ($0.00015/sec). For 1 hour of compute, the cost is $0.54. This is pure compute cost; no idle time.
▸
AWS EC2 (g5.xlarge - A10G equivalent): On-demand, a g5.xlarge costs $1.006 per hour.
- ▸If you manage to perfectly provision and de-provision for exactly 1 hour: $1.006.
- ▸More realistically, if you keep the instance running for a full day (8 hours) just to be ready: $1.006 * 8 = $8.048.
- ▸If you keep it running 24/7 for a month for occasional jobs: $1.006 * 24 * 30 = $724.32. This is the AI debt we talk about.
▸RunPod: Offers on-demand A10G instances for around $0.49-$0.69 per hour. While cheaper than EC2, you still manage the instance lifecycle, cold boots, and software setup. For 1 hour: ~$0.60. Still better than EC2, but not fully serverless.
▸Replicate: A serverless GPU platform, often higher-level for specific models. For custom embedding models, it might not be as flexible or could be more expensive per invocation. Assuming a similar A10G-equivalent, a 1-hour job could cost $1.00 - $2.00 depending on their pricing model (which often includes a premium for API simplicity).
▸AWS SageMaker: A fully managed ML platform. While powerful, it introduces its own complexity and cost structure. For a custom embedding job, you'd likely use a SageMaker endpoint or a processing job. A ml.g5.xlarge instance for a processing job might cost around $1.40 per hour. The managed overhead often makes it more expensive than raw EC2 for the same compute, and it comes with platform lock-in. For 1 hour: ~$1.40+.

Conclusion: For bursty, intermittent AI workloads, Modal is unequivocally the most cost-effective and operationally simple solution. It removes the need to worry about idle GPU time, which is the primary driver of AI infrastructure debt.

When Modal is NOT the Right Answer

No tool is a silver bullet. While Modal excels for many AI workloads, there are specific scenarios where we recommend alternative approaches:

▸Persistent Low-Latency Inference (Sub-100ms P99): For truly real-time, high-throughput inference endpoints that demand extremely low P99 latencies (e.g., less than 100ms), a dedicated, always-on instance with pre-loaded models is usually a better choice. While Modal's cold starts are fast (under 1 second), they are not zero. For consistent, millisecond-level response times under heavy load, the overhead of even a fast cold start can be too much. This is where you might tolerate the cost of an always-on EC2 instance or a highly optimized SageMaker endpoint.
▸On-Premises Requirements: Modal is a cloud-native platform. If your organization has strict on-premises data processing requirements due to regulatory compliance or internal policy, Modal is not an option. There's no hybrid or self-hosted version.
▸Strict Data Sovereignty/Residency: For clients in regions with stringent data sovereignty laws (e.g., some Gulf Cooperation Council countries, specific EU regulations), where data cannot leave a particular geographic boundary, Modal might not be suitable if its underlying cloud infrastructure doesn't reside in that specific region. While Modal leverages major cloud providers, direct regional control can be a constraint. This is a common hurdle we navigate with international deployments, often leading to custom private cloud setups.

The Economics: RAG Ingestion on Modal vs. Reserved Instance

Let's quantify the economics with a common scenario: a daily RAG ingestion job that requires 2 hours of A10G GPU compute.

▸
Modal Cost:
- ▸2 hours/day * $0.54/hour (A10G) = $1.08/day
- ▸Monthly cost: $1.08/day * 30 days = $32.40/month
▸
AWS EC2 (g5.xlarge - On-Demand):
- ▸If you try to stop/start the instance daily, you still pay for minimum billing increments and the operational overhead of managing that lifecycle. Even then, you might incur costs for the instance being "on" for longer than 2 hours due to startup/shutdown times. This approach is prone to human error and adds significant management burden.
- ▸If you leave a g5.xlarge instance running 24/7 for a month (often the default for teams without strict MLOps): $1.006/hour * 24 hours/day * 30 days/month = $724.32/month.

The difference is stark: $32.40 vs $724.32 per month for the same amount of active GPU work. This is the core of why Modal saves money for bursty AI workloads. It eliminates the cost of idle infrastructure.

When Modal saves money: For any workload that isn't running close to 24/7. The more intermittent, the more Modal shines. If your GPU usage averages less than, say, 12-16 hours per day, Modal is almost certainly the more cost-effective choice.

When Modal doesn't save money: If you have a truly continuous, high-throughput workload that keeps an A10G (or similar GPU) busy for 20+ hours a day, 7 days a week, then a reserved EC2 instance will eventually become cheaper due to bulk discounts and the absence of per-invocation overheads. However, such consistently heavy batch AI workloads are rarer than you might think; most deep learning training falls into this category, but not most inference or data processing jobs.

The Pragmatic Choice for Production AI

At Verel Systems, our mission is to take AI from spaghetti to production. That means building systems that are not just performant, but also sustainable, cost-effective, and easy to operate. For the vast majority of AI workloads outside of truly persistent, ultra-low-latency serving, Modal is the pragmatic choice. It allows our engineers to focus on model quality and pipeline logic, rather than wrestling with Kubernetes or fretting over idle GPU costs.

The decision is clear: if you're building AI systems that require GPU compute but don't need a dedicated server running 24/7, Modal is how you get to production without accumulating crippling AI debt. Choose the right tool for the job, and your AI projects will move from pilot purgatory to delivered value.

Related services

Enterprise RAG Engines AI Agent Systems