From perception to reasoning over perception

Do not index

Computer vision and large language models used to be different conversations. In production, they're becoming the same architecture — and the consequences for how teams build perception systems are larger than most teams have noticed.

For most of the last decade, "computer vision" and "language models" were different engineering disciplines that mostly didn't talk to each other.

CV teams built models that turned pixels into structured outputs — bounding boxes, classifications, tracks, segmentation masks. The output was a structured representation of what was in an image or a video. That output got handed downstream to a rules engine, a database, or a human operator. The CV team's job ended at the structured output; what happened to that output afterward was someone else's problem.

Language model teams worked on a different stack. Text in, text out. RAG, agents, summarization, structured extraction. The work happened in a world where the input was already symbolic — already in a form a model could reason about — and the question was how to extract value from it.

These two worlds are merging. Not because anyone planned the merger, but because production systems keep ending up at the same architecture independently. We've been watching this convergence happen across our own engagements — in brand intelligence pipelines, in smart-city perception platforms, in industrial sensor systems, in operational telemetry — and the pattern is consistent enough that we want to write it down.

The pattern is: perception produces candidates; reasoning validates them.

And the consequences for how teams should be building perception systems, given this pattern, are larger than most teams have noticed.

What "perception produces candidates, reasoning validates them" means in practice

Take any production system that has to look at the physical world and produce decisions.

A brand intelligence system has to identify when a brand appears in a video — visually as a logo, verbally as a sponsorship mention. A traffic enforcement system has to identify when a vehicle commits a violation in a busy intersection. An industrial inspection system has to identify when a piece of equipment is showing a fault pattern. A logistics system has to identify when a container is loaded incorrectly.

In each of these, a CV pipeline can identify candidates with high recall — patches of pixels that might be a logo, vehicles that might have committed a violation, equipment readings that might indicate a fault, container configurations that might be wrong.

The hard part is precision. Each candidate has to be evaluated in context. Was that patch of pixels actually a logo, or just a similar-looking shape? Did that vehicle actually commit the violation, or is the camera angle deceiving the bounding-box detector? Is that fault pattern a real fault, or a known sensor artifact? Is that container configuration actually wrong, or an unusual but acceptable load?

These context-dependent questions are where pure CV pipelines tend to plateau. A CV model can be excellent at the visual task and still produce too many false positives — not because the model is bad, but because the world has too many edge cases for any visual classifier to handle alone.

The pattern that works is to put a reasoning layer downstream of the perception layer. The reasoning layer reads the candidate, the surrounding context, the relevant taxonomy or knowledge base, and decides whether the candidate is real. CV does the localization. LLM does the validation.

This isn't a theoretical claim. We've shipped this pattern across multiple engagements, and the same architectural shape produces the same kind of result every time: meaningful precision improvements without compromising the recall that makes the perception layer valuable in the first place.

What this means for how to build perception systems

If perception is going to feed a reasoning layer, the requirements on the perception system change.

The output schema needs to be designed for downstream consumption. Most CV systems we've seen were built to output bounding boxes and class labels. Once a reasoning layer is in the picture, the output needs more — confidence scores, candidate context (the surrounding region, the temporal neighborhood, the spatial relationships), and metadata that the reasoning layer can use to decide. The CV team that designs the output schema for downstream reasoning ships a more useful system than the CV team that ships bounding boxes alone.

Recall becomes more important than precision at the perception layer. When the perception layer is the final word, precision and recall both matter. When the perception layer feeds a reasoning layer, the right tradeoff shifts: catch everything that might be relevant (recall), and let the reasoning layer prune the false positives (precision). This changes how you train, how you validate, and how you set thresholds. Most production CV systems we look at are still tuned as if they were the final word.

The boundary between CV and reasoning becomes a real engineering interface. When CV and LLM teams worked separately, the interface was a database table — bounding boxes go in, decisions come out. When CV and LLM are part of the same system, the interface is an active conversation. The reasoning layer needs to query the perception layer ("show me the surrounding frames"), the perception layer needs to expose richer outputs to the reasoning layer ("here's the context you asked for"), and both need to share an evaluation framework so improvements at one layer don't degrade the other. This is a real engineering effort, and it doesn't happen by accident.

The same pattern, different verticals

The interesting thing is how cleanly this pattern generalizes across verticals once you start looking for it.

In multimodal content intelligence, the perception layer detects logos in video frames or extracts brand mentions from transcripts. The reasoning layer validates against a brand asset library, distinguishes sponsorships from casual mentions, and produces structured outputs.

In smart city perception, the perception layer detects vehicles, pedestrians, and infrastructure events in real-world video. A reasoning layer can contextualize the detections, explain anomalies, route exception cases for human review, and reduce the operational burden on the people consuming the system's outputs.

In industrial inspection, the perception layer detects fault patterns in sensor or vision data. A reasoning layer reasons over the historical context, the operational state, and the known artifact patterns, distinguishing real faults from sensor noise.

In operational telemetry (the kind of work EdgeTelemetry does), the "perception" layer is a unified telemetry stream rather than CV. But the same pattern applies: the substrate produces signal, the reasoning layer validates and acts on it.

In every case, the architectural shape is the same: a perception or detection layer that produces high-recall candidates, a reasoning layer that prunes them with context, and an interface between the two that's been thought through as a first-class engineering concern.

Where this is going

We think this convergence is most of what "real" production AI is going to look like for the next several years.

The pure-CV systems will continue to exist for narrow tasks where context isn't required. The pure-LLM systems will continue to exist for purely textual workloads. But the high-value production AI systems — the ones that have to operate in physical-world contexts, in domains where false positives are expensive and edge cases dominate — are going to converge on the perception-plus-reasoning pattern. They already are.

The teams that recognize this convergence early have a real advantage. They build their perception systems with reasoning in mind. They build their reasoning systems with perception inputs in mind. They build the interface between them as an explicit engineering surface, not an accidental one. They evaluate the system as a system, not as two separately optimized parts. And they ship platforms that get more valuable as both layers improve, instead of platforms that have to be rebuilt every time a new model lands.

For us, this is a lot of what we do across our practice. The Edge & Perception work and the Multimodal work used to be separate conversations with different customers; they aren't anymore. The customers who've shipped serious production AI in the last year understand that perception and reasoning are different stages of the same pipeline, and they're hiring for it accordingly.

If your team is building perception systems and the reasoning layer hasn't entered the conversation yet, it will. Better to design for it now than to retrofit it later.

DeHaze Labs builds production AI and data platforms for the physical economy — including the perception, reasoning, and integration layers behind systems that have to operate in the real world. Get in touch at hello@dhlabs.ai.