What we learned shipping production multimodal AI

Do not index

Five lessons from building a video, audio, and document intelligence platform that survived contact with production. None of them are about which model to pick.

We've spent the last few years building production multimodal AI for customers in markets where being wrong is expensive: sports and entertainment intelligence, regulated industries, enterprise platforms with real users and real downstream decisions. The work has taught us a fairly specific set of things — most of them not what we expected to learn.

A short list of lessons we keep underlining for ourselves, and for the engineering teams we work with.

1. The model is rarely the bottleneck

This is the lesson we wish someone had handed us in writing on day one.

Every multimodal AI project we've seen — including the ones we've been hired to rescue — over-indexed on the model layer at the start. Which model. Which provider. Which fine-tuning approach. Which context window. We had the same instinct early on. It was wrong.

The model is the most replaceable part of the system. It is, at any given moment, the part that's improving fastest. If your platform survives swapping out the model six months from now, you've probably built it correctly. If swapping the model would require rebuilding the system, you've built the wrong thing.

The bottleneck is almost always the pipeline around the model. Ingestion that breaks the first time a source schema drifts. Validation that wasn't designed in from the beginning and has to be retrofitted at scale. Vector indexes that work fine for the first ten thousand documents and fall over at a million. Evaluation infrastructure that doesn't exist, so nobody knows whether the system is getting better or worse week-over-week.

The first six months of any multimodal AI project should over-invest in everything that isn't the model. By the time you have ingestion, validation, indexing, and evaluation working, your model decision is going to be the easy part — and probably a different model than you would have picked at the start anyway.

2. RAG without monitoring is a demo

Retrieval-augmented generation is one of the most powerful patterns in production AI. It is also one of the most quietly fragile. The failure mode isn't catastrophic; it's gradual.

A RAG pipeline ships. The retrieval works. The model generates plausible answers grounded in retrieved context. Stakeholders are happy. The pipeline goes into production.

Six weeks later, content has drifted. New documents are in the corpus. Old documents have been updated. The embedding distribution has shifted. The retrieval that was working at launch is now returning subtly worse context, and the model — being faithful to the context it's given — is generating subtly worse answers. Nobody notices because the answers still sound right.

This is the failure mode that breaks more multimodal AI deployments than any model regression we've seen.

The fix is unglamorous and architectural: continuous retrieval evaluation, drift detection on embeddings, sample-based human review of generated outputs, and structured feedback loops that flow evaluation results back into ranking and re-indexing decisions. This is the scaffolding that determines whether a RAG system survives year two. Most teams under-build it because it's not the part that demos well. The teams whose systems are still in production a year later built it from the start.

3. CV alone is brittle. CV plus an LLM is not.

For visual workloads in production — anything where the system has to look at images or video and produce structured outputs — pure computer vision pipelines are surprisingly fragile in the wild.

CV models are good at the thing they were trained on, but the real world is full of edge cases that the training distribution didn't cover. Lighting that's different from the training set. Occlusion patterns the labeler didn't anticipate. Compositions that require contextual reasoning the model doesn't have. Each individual edge case is rare; the aggregate of edge cases is the long tail that determines whether the system is trusted.

The pattern that has worked consistently for us: CV does the localization, an LLM does the validation.

CV identifies candidate regions, candidate frames, candidate detections. An LLM with appropriate context — what the system is looking for, what counts as a valid detection, what the edge cases are — validates each candidate before it becomes a structured output. The CV layer keeps the volume manageable; the LLM layer keeps the precision high.

This is how we got 90%+ reductions in manual review on production visual pipelines. Not by training a better CV model. By accepting that CV models have a precision ceiling and putting a reasoning layer downstream of them.

4. The data architecture decision is the architecture decision

Most multimodal AI platforms we've been hired to fix have the same root cause: a data architecture that was made up as the platform grew, instead of designed up front.

The decision that matters more than any other is: where does your raw data live, where does your processed data live, where does your indexed data live, and what are the contracts between those layers? If those contracts are explicit and stable, the rest of the platform composes cleanly. If those contracts were never made explicit — if the indexed embeddings were generated from a query against the warehouse that nobody documented, and now nobody knows what subset of the data is actually represented in the vector store — the platform is one outage away from being unrecoverable.

We deploy variations of the same data architecture for nearly every multimodal AI engagement. Raw multimodal data in object storage. Structured derivations in an analytical warehouse. Embeddings and indexes in a vector store. Streaming layer between sources and warehouse for the data that needs to be real-time. Orchestration layer that knows about all of them. Transformation discipline (dbt or equivalent) between them, version-controlled, tested, with lineage maintained from source to consumed table.

The specifics will vary. The discipline of making the contracts between layers explicit will not.

5. Agentic systems are operational systems

The agentic AI patterns most engineering teams are trying to ship — multi-step tool use, planning, autonomous execution against external systems — are not really model problems. They're operational systems problems wearing the costume of model problems.

Once an LLM is calling tools, executing actions, and orchestrating work across multiple steps, the system you're building is closer to a distributed-systems problem than to anything that came out of the NLP literature. You need observability across every step. You need the ability to replay agent traces against new model versions to understand whether behavior has changed. You need clear escalation paths to humans for any decision the agent isn't authorized to make autonomously. You need audit trails that hold up under enterprise compliance scrutiny. You need evaluation infrastructure that knows the difference between an agent succeeding for the right reasons and an agent succeeding by accident.

None of that is in the prompt. None of it is in the model. All of it is the engineering work that has to surround the model for the system to be production-grade.

The teams shipping agentic systems that work are the teams treating them as operational systems first and AI systems second. The teams treating them as AI systems first are mostly still in pilot.

The thread running through all five of these is the same: production multimodal AI is mostly not about AI. The AI part is real and important and improving fast — but the work that determines whether a multimodal platform is still alive a year after launch is data engineering, evaluation discipline, and operational systems thinking. The teams that win in this market are the teams that internalize that early.

We write more about the systems work behind production AI deployments at dhlabs.ai. If you're building a multimodal AI platform and the lessons above sound familiar, get in touch — we've shipped enough of them to know quickly whether what we'd do matches what you need.

DeHaze Labs builds production AI and data platforms for the physical economy.