Industrial data pipelines, built to carry AI

Do not index

Why we keep telling industrial customers the same thing: the data foundation is the project. Everything else is the consequence.

The pattern goes like this.

A manufacturing or industrial operator decides they want to do something with AI. Maybe it's predictive maintenance. Maybe it's supply chain optimization. Maybe it's a copilot for plant operators or supplier risk reasoning over long-form contracts. They've read the same articles everyone else has. They have a sense of what's possible. They've talked to vendors with impressive demos. Now they want to put a project in motion.

We get the call. The conversation almost always starts with the AI use case. We almost always end up talking about data engineering.

Not because we're trying to upsell into a bigger engagement. Because the AI use case they want is, in nearly every case, blocked by the data substrate underneath it. The ERP doesn't talk to the warehouse management system. The transportation management system updates nightly when the operations team needs hourly. Sensor and IoT feeds live in their own world, structured by the OEM that shipped the equipment, with a schema that doesn't connect to anything else the company runs. The supplier integrations are a patchwork of EDI files, vendor portals, and one spreadsheet someone updates by hand on Tuesdays.

Until that substrate is unified, validated, and queryable in real time, the AI project is going to be a pilot forever. Not because the model can't do the thing — because the data to feed the model can't be assembled reliably enough to put any output into production.

This post is about one of those engagements. We can't name the customer — they operate in a competitive vertical and the engagement is under standard confidentiality — but the architecture pattern is portable, and most industrial operators we talk to recognize their own situation in it.

What was actually broken

Before we built anything, the customer's operational data lived in roughly six places that didn't agree with each other.

The ERP held authoritative master data — what materials they made, who their customers were, what shipped when. The warehouse management system held inventory state. The transportation management system held shipment status, but updated on its own cadence. Several supplier integrations brought in data from outside the company in formats negotiated bilaterally with each supplier. Sensor feeds from production lines came through a historian system originally installed for compliance, not analytics. And there was a layer of operational decisions captured in spreadsheets and emails that never made it into any system at all.

Reporting was a multi-day exercise involving multiple analysts, a lot of manual reconciliation, and a frankly heroic level of institutional memory about which system to trust for which question. Anything that needed real-time visibility — line-level performance, in-flight shipment status, supplier delivery health — was either approximated from yesterday's batch data or asked of a human who walked over and looked.

The AI use cases the customer wanted were unblockable in this state. None of them. Not because the use cases were unrealistic, but because the data to support them couldn't be assembled fast enough or trusted reliably enough.

What we built

We integrated four things into a unified real-time data platform: Snowflake as the analytical core, Snowstreams (the customer was already invested in the Snowflake ecosystem) for real-time ingestion, Airflow for orchestration, and dbt for transformation discipline.

Snowflake as the analytical core. A centralized analytical warehouse, modeled to support both operational queries and downstream analytics workloads. The schema was designed for queries that didn't exist yet — because we knew the operational team would discover questions to ask once the data was actually queryable, and we wanted the model to handle that gracefully rather than require a rebuild.

Real-time streams via Snowstreams. This was the capability that turned the platform from "data warehouse" into "operational data layer." Operational data started flowing in as it happened, not in nightly batches. CDC from the source systems where it was supported, change-event streams where the source system could push, and polling on the systems too old to do better. Each source had its own ingestion contract — speed, reliability, schema — and we documented those contracts explicitly so the team knew what to expect.

Airflow for orchestration. Workflow orchestration across ingestion, transformation, and downstream pipelines. Reliable scheduling, dependency management, and observability across the entire data lifecycle. The unsexy plumbing that determines whether the platform survives the first time a source system has an outage at 2 a.m. on a Saturday.

dbt for transformation discipline. dbt as the transformation layer between raw ingestion and the analytical schema. Modular SQL transformations, tested and version-controlled, with clear lineage from source to consumed table. The discipline that prevents the data platform from devolving into illegible spaghetti as scope grows — and scope always grows.

What changed for the operations team

The headline outcome wasn't the AI use case. It was the operations team.

Before the platform, the operations team spent a meaningful fraction of their week assembling pictures of what was happening — pulling reports from three systems, reconciling against the spreadsheet someone had updated on Tuesday, calling the warehouse to verify what shipped against what was supposed to ship. The platform didn't eliminate that work. It collapsed it. The picture they used to assemble manually was now a query against the unified data model. The reconciliation work that used to take a day was minutes.

Once that floor was in place, the AI use cases the customer originally called us about became scoping conversations rather than feasibility conversations. We no longer had to start each one with "first we need to fix the data." The data was fixed. The conversation could start with what to build with it.

This is the lesson we keep underlining for industrial customers, and we're going to keep underlining it because it keeps mattering: the customers who invest in unified data foundations get to do real AI six months later. The customers who skip this step run pilots forever.

The architecture pattern, in shorthand

If you're contemplating something like this, the shape we converge on for industrial customers looks roughly like:

SOURCES → erp · wms · tms · supplier_integrations

sensor_historian · operational_spreadsheets

REAL-TIME INGESTION → cdc · stream_events · scheduled_polling

(each source with explicit reliability contract)

ANALYTICAL WAREHOUSE → snowflake (or equivalent) modeled for query flexibility

TRANSFORMATION LAYER → dbt · tested · version-controlled · clear_lineage

ORCHESTRATION → airflow · dependency-managed · observable

CONSUMED BY → operational_reporting · ai_workloads

real_time_decisions · supplier_analytics

Each layer is intentionally boring. The art isn't in any one component — it's in the discipline of treating the whole pipeline as a system, with explicit contracts at each boundary and lineage maintained from source to consumer. Most industrial data platforms we've seen fail not because any individual choice was wrong but because the boundaries between layers were never made explicit.

Where this goes next

The natural next chapter for platforms like this one is the reasoning layer. A unified industrial data platform is the precondition for every interesting AI workload that comes after it: natural-language operational queries against the data model, supplier risk reasoning over long-form contracts, anomaly explanation across heterogeneous sensor and operational signals, operator copilots for plant managers and ops leads.

Frontier LLMs with appropriate tool use are well-suited to this — long-context reasoning over operational state, structured outputs that survive enterprise compliance scrutiny, exception handling that pure rule-based logic can't do well. The architectural discipline is the same as for any production AI system: model-agnostic interfaces, clear human-escalation paths, evaluation and monitoring at every step.

We'll write about the reasoning layer specifically once we have a deployment we can talk about in detail. The point worth making clearly today is the foundational one: AI on top of fragmented industrial data isn't going to work, no matter how good the model gets. The data foundation is the project. Everything else is the consequence.

If you're running an industrial operation and the AI projects you've been looking at keep stalling at the data layer, we'd like to talk. We've done this enough times to recognize the shape quickly, and we're frank about whether a given engagement is the right fit before any contract gets signed.

DeHaze Labs builds production AI and data platforms for the physical economy. Get in touch at hello@dhlabs.ai.