Daft Is What Pandas Should Have Been for AI Data Pipelines
Most RAG and ML pipelines use Pandas or custom scripts for data prep. At scale, this breaks. Daft is a Rust-native distributed dataframe engine built for AI workloads — multimodal, GPU-aware, and petabyte-capable.
The problem isn't just that Pandas is slow. It's that it fundamentally breaks when you try to use it for serious AI data pipelines. You hit memory walls, your GPUs sit idle, and your entire data processing strategy devolves into a series of duct-taped scripts. We see this all the time at Verel Systems. Clients come to us with pilot projects stuck in purgatory – RAG systems that can't scale beyond a few thousand documents, multimodal models that choke on their own data, and embedding jobs that crash with OutOfMemoryError every other day. This is AI debt, pure and simple: demo-quality code pushed into production, and it costs companies millions in wasted compute and abandoned initiatives.
Pandas, for all its utility in exploratory data analysis and small-to-medium datasets, was never built for the realities of modern AI. It’s a single-machine, in-memory tool. That design choice, once a strength, is now a critical vulnerability for anything involving large-scale embeddings, multimodal data, or distributed processing.
Pandas' Fatal Flaws for AI Scale
Let's be direct about why Pandas fails for AI workloads:
- ▸Memory Exhaustion is Inevitable: Try processing a batch of 100,000 documents for RAG, each needing to be chunked and then embedded. Or consider a dataset of 50GB of images. Pandas loads everything into RAM. Your 64GB or even 128GB machine runs out of memory fast, especially when dealing with the high-dimensional vectors that embeddings produce. A typical
text-embedding-3-largeembedding for a 512-token chunk is 3072 floats. Multiply that by hundreds of thousands or millions of chunks, and you're looking at terabytes of data. Pandas simply cannot handle this without constant, manual memory management – which means writing more spaghetti code. - ▸No Native GPU Support: GPUs are the workhorses of AI. For batch inference, fine-tuning, or even just vector database operations, you need to push data to the GPU and execute operations there. Pandas is CPU-bound. If you have a dataframe of image paths and want to run them through a vision transformer, you're forced to pull each image, load it, send it to the GPU, get the output, and then manually re-integrate it into your dataframe. This is incredibly inefficient, serial, and wastes the very hardware designed to accelerate your AI tasks. Your expensive A100s or H100s sit idle while your CPU churns through Python loops.
- ▸Single-Threaded ETL Bottlenecks: Data loading, cleaning, and transformation for AI often involves I/O-heavy operations (reading from S3, parsing JSON, resizing images) or computationally intensive ones (complex regex, tokenization). Pandas executes these operations on a single CPU core. Even with
applyormap, the underlying engine doesn't leverage all available cores efficiently for complex, user-defined functions. This means your data ingestion pipeline for a RAG system, processing 100,000+ documents, can take hours, not minutes, creating a significant bottleneck in your development and deployment cycles. - ▸No First-Class Multimodal Data Types: AI isn't just about structured tables anymore. We're dealing with images, audio, video, point clouds. Pandas has no native understanding or efficient representation for these data types. You store paths to files, or perhaps raw bytes as blobs, but you can't perform operations like "resize all images," "extract audio features," or "decode video frames" directly on a Pandas DataFrame column with any efficiency. This forces developers to manage these transformations outside the dataframe abstraction, leading to disconnected, hard-to-maintain codebases that are impossible to optimize.
These limitations are why 80-95% of AI projects die in pilot purgatory. You build a demo that works on a small sample, then hit these fundamental data engineering roadblocks when you try to scale.
Daft: What Pandas Should Have Been for AI
Enter Daft. If Pandas was designed for single-machine, tabular analysis, Daft was designed from the ground up for distributed, multimodal AI data pipelines. It’s the data engine you need to take AI from spaghetti to production.
At its core, Daft is a high-performance distributed dataframe library written in Rust. This isn't just an implementation detail; it's a foundational choice that enables everything else:
- ▸Rust Core & Apache Arrow Zero-Copy: The entire data plane of Daft is built in Rust, leveraging Apache Arrow for in-memory data representation. This means operations are incredibly fast, and memory layouts are optimized for columnar processing. Crucially, Daft uses Arrow's zero-copy semantics. When data moves between Daft operations or even to other Arrow-compatible libraries (like PyTorch or Polars), it often doesn't need to be copied, just referenced. This drastically reduces memory overhead and improves throughput. We've seen Daft use 5x less memory than alternatives for complex operations, which is critical when processing large embedding batches.
- ▸Lazy Evaluation for Efficiency: Unlike Pandas, Daft uses lazy evaluation. When you chain operations (
filter().map().select()), Daft doesn't execute them immediately. Instead, it builds an optimized execution plan. Only when you explicitly ask for results (e.g.,df.collect()) does it execute the entire graph. This allows Daft to fuse operations, push down predicates, and optimize memory usage across the pipeline. It also means you get sub-second local iteration on your laptop for development, even when working with petabyte-scale datasets. You can iterate on your logic quickly without waiting for full dataset scans. - ▸Distributed Execution (Laptop to Cluster): Daft's API is designed to be identical whether you're running on your local machine with a few cores or on a large Kubernetes cluster with hundreds of nodes. The same
df.map_batches()call that works locally will seamlessly scale out across your cluster. You write your pipeline once, and Daft handles the distribution, scheduling, and fault tolerance. This is a game-changer for moving from POC to production, eliminating the need to rewrite code for different environments. - ▸Native GPU Scheduling for Batch Inference: This is where Daft truly shines for AI. Daft can directly schedule work onto GPUs. You can define a function that performs GPU inference (e.g., running a batch of images through a vision model) and apply it to a Daft DataFrame. Daft will automatically manage batching, data transfer to/from the GPU, and parallel execution across multiple GPUs if available. This means you can finally use your expensive GPU resources efficiently for batch embedding, multimodal feature extraction, and other inference tasks, integrating them cleanly into your data pipeline.
- ▸First-Class Multimodal Data Types: Daft understands images, audio, and video as native types, not just blobs or file paths. You can apply operations like
df["image_col"].image.decode()ordf["video_col"].video.decode_frames()directly. This allows you to build end-to-end multimodal pipelines within the dataframe abstraction, from raw media files to processed features, without resorting to messy external scripts.
Daft isn't just a faster Pandas; it's a fundamentally different approach to data processing, built for the unique demands of AI at scale.
Where We Use Daft at Verel Systems
At Verel, we build production AI systems. That means rescuing failed POCs, cleaning up AI debt, and delivering solutions that actually work in enterprise environments. Daft is a critical tool in our arsenal for several key areas:
- ▸RAG Ingestion Pipelines for 100K+ Documents: For clients building advanced RAG systems, the ingestion pipeline is often the first bottleneck. We use Daft to process hundreds of thousands, sometimes millions, of documents.
- ▸Cleaning and Normalization: We read raw PDFs, HTML, or Markdown, extract text, and then use Daft to apply a series of cleaning functions (e.g., removing boilerplate, standardizing whitespace) across the entire corpus in parallel.
- ▸Chunking: Daft's
map_batchesandexplodeoperations are perfect for creating overlapping text chunks from documents. We can define our chunking logic, apply it, and then expand the DataFrame to have one chunk per row, ready for embedding. - ▸Deduplication: Before embedding, we often deduplicate chunks to avoid redundant work and improve retrieval quality. Daft's distributed capabilities allow us to efficiently compute hashes and filter duplicates across massive datasets. This entire process, which would OOM or take days with Pandas, runs in hours on a cluster with Daft.
- ▸Batch Embedding Jobs with Native GPU Scheduling: This is where Daft pays for itself. Instead of writing custom distributed inference scripts or relying on clunky orchestrators, we define our embedding function and let Daft handle the GPU scheduling.
- ▸We load our chunked text data into a Daft DataFrame.
- ▸We define a Python function that takes a batch of text and runs it through an embedding model (e.g.,
sentence-transformers/all-MiniLM-L6-v2ortext-embedding-3-large) on a GPU. - ▸We then apply this function using
df.map_batches(embedding_fn, num_gpus=1). Daft automatically manages loading the model onto the GPU, sending batches of text, and collecting the resulting embeddings. This ensures our GPUs are fully utilized, drastically cutting down inference time and cost for millions of embeddings.
- ▸Multimodal Pipelines with Images and Structured Data: Many of our clients are moving beyond text-only AI. We build systems that combine product images with descriptions, sensor data with video feeds, or medical scans with patient records. Daft's native multimodal support simplifies these pipelines.
- ▸We can read a DataFrame containing image paths and structured metadata (e.g., product IDs, categories).
- ▸Use
df["image_path"].image.decode()to load images directly into the DataFrame. - ▸Apply a vision transformer (e.g., CLIP, ViT) using
map_batcheswith GPU scheduling to extract image features. - ▸Perform joint processing or fusion of these features with structured data, all within the same Daft DataFrame, leading to cleaner, more efficient, and easier-to-debug code.
Without an engine like Daft, these types of pipelines quickly become a tangled mess of Python scripts, shell commands, and custom distributed logic – precisely the "spaghetti" we rescue clients from.
A Concrete Code Example: RAG Ingestion with Daft to Qdrant
Let's look at a simplified, yet representative, pipeline for ingesting documents into a vector database like Qdrant using Daft. This pipeline reads Parquet files, chunks text, generates embeddings, and writes them to Qdrant.
import daft
from daft import col, udf
from qdrant_client import QdrantClient, models
import os
import torch
from transformers import AutoTokenizer, AutoModel
# --- Configuration ---
PARQUET_PATH = "s3://my-enterprise-data/docs_raw/*.parquet"
QDRANT_HOST = "localhost"
QDRANT_PORT = 6333
QDRANT_COLLECTION = "enterprise_rag_chunks"
EMBEDDING_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2" # Example model
EMBEDDING_DIM = 384 # Dimension for all-MiniLM-L6-v2
CHUNK_SIZE = 256
CHUNK_OVERLAP = 50
# --- UDF for Text Chunking ---
@udf(return_type=daft.DataType.string())
def chunk_text_udf(text_series):
"""
UDF to chunk a series of text documents into smaller, overlapping chunks.
This runs on Daft workers.
"""
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)
all_chunks = []
for text in text_series:
if not text:
continue
tokens = tokenizer.encode(text, add_special_tokens=False)
for i in range(0, len(tokens), CHUNK_SIZE - CHUNK_OVERLAP):
chunk_tokens = tokens[i : i + CHUNK_SIZE]
chunk_text = tokenizer.decode(chunk_tokens)
all_chunks.append(chunk_text)
return all_chunks
# --- UDF for Embedding (GPU-accelerated) ---
@udf(return_type=daft.DataType.float32())
def embed_batch_udf(chunk_series):
"""
UDF to generate embeddings for a batch of text chunks using a GPU.
Daft will ensure this runs on a GPU-enabled worker if num_gpus is set.
"""
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)
model = AutoModel.from_pretrained(EMBEDDING_MODEL_NAME).to(device)
model.eval()
embeddings = []
# Process in smaller batches if `chunk_series` is very large
# (Daft's internal batching handles this at a higher level, but good practice)
for i in range(0, len(chunk_series), 32): # Internal batching for the model
batch_chunks = list(chunk_series[i : i + 32])
if not batch_chunks:
continue
encoded_input = tokenizer(
batch_chunks, padding=True, truncation=True, return_tensors="pt"
).to(device)
with torch.no_grad():
model_output = model(**encoded_input)
# Mean pooling to get sentence embeddings
sentence_embeddings = model_output.last_hidden_state.mean(dim=1)
embeddings.extend(sentence_embeddings.cpu().numpy())
return embeddings
# --- Daft Pipeline Definition ---
def build_rag_pipeline():
# 1. Read raw documents from Parquet files
# Daft handles reading from S3 directly and distributing it.
df = daft.read_parquet(PARQUET_PATH)
# 2. Filter out any documents with empty or short text content
df = df.filter(col("text").apply(lambda t: t is not None and len(t) > 50))
# 3. Chunk the text content using the UDF
# The UDF returns a list of chunks, so we explode to get one chunk per row.
df = df.with_column("chunks", chunk_text_udf(col("text")))
df = df.explode(col("chunks")) # Each chunk becomes a new row
# 4. Filter out any empty chunks that might have resulted
df = df.filter(col("chunks").apply(lambda c: c is not None and len(c) > 10))
# 5. Generate embeddings for each chunk using the GPU-accelerated UDF
# Daft will schedule this UDF on GPU workers.
df = df.with_column("embedding", embed_batch_udf(col("chunks"), num_gpus=1))
# 6. Select relevant columns and convert to Pandas for Qdrant ingestion (or directly write with a custom sink)
# For very large datasets, you'd write a custom Daft sink for Qdrant.
# For demonstration, we collect to Pandas in batches.
df = df.select(col("doc_id"), col("chunks").alias("text_chunk"), col("embedding"))
return df
# --- Main Execution ---
if __name__ == "__main__":
pipeline_df = build_rag_pipeline()
# Initialize Qdrant client
qdrant_client = QdrantClient(host=QDRANT_HOST, port=QDRANT_PORT)
qdrant_client.recreate_collection(
collection_name=QDRANT_COLLECTION,
vectors_config=models.VectorParams(size=EMBEDDING_DIM, distance=models.Distance.COSINE),
)
print("Starting Daft pipeline execution and Qdrant ingestion...")
# Collect results in batches and ingest into Qdrant
# In a real production system, you might implement a custom Daft sink
# or use a more sophisticated batching strategy for Qdrant.
for batch_idx, batch_df in enumerate(pipeline_df.to_pandas_batches(batch_size=1024)):
points = []
for _, row in batch_df.iterrows():
points.append(
models.PointStruct(
id=f"{row['doc_id']}-{batch_idx}-{_}", # Unique ID for chunk
vector=row["embedding"],
payload={"doc_id": row["doc_id"], "text_chunk": row["text_chunk"]},
)
)
if points:
qdrant_client.upsert(
collection_name=QDRANT_COLLECTION,
wait=True,
points=points
)
print(f"Ingested batch {batch_idx+1} with {len(points)} points into Qdrant.")
print("Daft pipeline completed and data ingested into Qdrant.")
This code snippet illustrates how Daft's API allows you to define a complex, distributed pipeline with GPU acceleration in a clear, declarative manner. The embed_batch_udf will run on your GPUs, managed by Daft. The chunk_text_udf will parallelize across CPU cores. The read_parquet operation will distribute data loading. This is how you build production RAG ingestion, not with a Pandas script that crashes after 10,000 documents.
Daft vs. Spark vs. Pandas for AI
Choosing the right data engine is critical. Here’s how we position Daft against its common counterparts:
- ▸Pandas: Use it for small datasets (< 1 million rows), local development, and quick exploratory analysis where memory fits comfortably on a single machine and GPUs aren't a concern. If your AI project is still a tiny POC with static, curated data, Pandas is fine. The moment you hit real-world data volumes or need to use a GPU, Pandas becomes AI debt.
- ▸Apache Spark: A mature, general-purpose big data processing engine. If you already have a massive Spark cluster and an existing data lake infrastructure, and your primary concern is large-scale ETL that isn't specifically AI-centric (e.g., complex SQL joins, batch processing of diverse formats), Spark is a valid choice. However, Spark's Python UDFs can be slow, its GPU integration is less native and often requires more boilerplate, and its overhead for highly iterative AI workflows can be significant. For new, AI-native builds focused on performance and multimodal data, Spark can be overkill or underpowered in specific areas.
- ▸Daft: The clear choice for AI-native workloads and new builds where performance, GPU utilization, and multimodal data are paramount.
- ▸When you need to process terabytes or petabytes of data for RAG, multimodal models, or large-scale batch inference.
- ▸When you need to efficiently utilize GPUs for embedding generation, feature extraction, or inference, integrating them directly into your data pipeline.
- ▸When you are dealing with images, audio, or video as first-class citizens in your data pipeline.
- ▸When you want a unified API that scales from your laptop to a distributed cluster without code changes.
- ▸When you are building production-grade AI systems and cannot afford memory errors, slow iteration, or manual cluster management.
The Production Reality: What Breaks Without a Real Data Engine
Without a proper data engine like Daft, your AI projects are doomed to stay in pilot purgatory. We've seen it repeatedly:
- ▸Memory Errors as a Feature, Not a Bug: Your "data scientist" spends more time debugging
OutOfMemoryErrormessages than building models. Every time the input data size changes, or a new feature is added, the pipeline breaks. This isn't engineering; it's whack-a-mole. - ▸Reprocessing from Scratch: No lazy evaluation, no optimized execution plans, no checkpointing. If your Pandas script crashes halfway through a 10-hour run, you restart from the beginning. This wastes compute, developer time, and delays product launches. It's the hallmark of unmonitored agents and demo-quality RAG.
- ▸No Checkpointing or Fault Tolerance: Production systems need to be resilient. If a worker fails, the job should recover. Pandas has no concept of this. Your entire pipeline is a single point of failure. Daft, by contrast, with its distributed execution, offers fault tolerance and the ability to resume from partial failures, which is non-negotiable for enterprise deployments.
- ▸Unnecessary Compute Costs: Idle GPUs, inefficient CPU utilization, and constant reprocessing add up. What looks like a small script on your laptop can balloon into an expensive, inefficient cloud bill when scaled. This is the financial cost of AI debt.
Verel Systems exists to rescue clients from this reality. We take AI from spaghetti to production. That means implementing robust data pipelines that can handle the scale and complexity of real-world AI.
When Daft is the Right Call vs. Overkill
Daft is not always the answer. If you're building a proof-of-concept for a few hundred text documents, or your data fits comfortably in memory and doesn't require GPUs, Pandas is perfectly adequate. It's fast to get started with, and the ecosystem is mature.
However, Daft becomes the right call the moment you encounter any of these:
- ▸Your data volume exceeds single-machine RAM (tens of GBs or more).
- ▸You need to perform batch inference or feature extraction on GPUs.
- ▸You are working with multimodal data (images, video, audio) at scale.
- ▸Your team is spending significant time debugging memory errors or optimizing Python loops for performance.
- ▸You need to move an AI pipeline from local development to a distributed production environment without rewriting core logic.
- ▸You're building a critical RAG system, a multimodal search engine, or any AI application where data ingestion and processing performance directly impact user experience or business outcomes.
Don't let your AI projects accumulate debt. The choice of your data engine is a foundational decision that determines whether your AI initiative will thrive in production or die in pilot purgatory. For modern AI workloads, Daft provides the performance, scalability, and developer experience that Pandas simply cannot.
