Production ML for logistics: the operational integration is the project

Do not index

Why "ML for logistics" projects mostly fail at the workflow integration, not at the modeling — and what we built when a US logistics platform asked us to fix that.

There's a pattern we've watched play out enough times in the logistics space that we've stopped being surprised by it.

A logistics operator decides to invest in ML. They have lots of data — driver locations, container states, port and warehouse conditions, customer SLAs, historical shipment timing. They've talked to vendors with promising demos. The ML use case is obvious in principle: routing optimization, dispatch decisions, container unloading workflows, predictive ETAs. The data is there. The use case is clear. The team is bought in.

A year later, the project has produced a model with reasonable benchmark accuracy and zero operational impact.

The reason this happens isn't that the model was wrong. It's that the model never got connected to the operational decision flow in a way that actually changed what dispatchers, drivers, and warehouse leads did. The model produced predictions. The dispatchers ignored them, or used them inconsistently, or used them for the wrong decisions. The container unloading sequence kept running on the same logic it had before. The optimizer became a dashboard everyone admired and no one acted on.

This post is about an engagement where we got the call after exactly this pattern. The customer — a US logistics platform whose operations cover trucking, container moves, and port-warehouse coordination — had real data, a real ML team, and an unusable result. They needed someone to get the ML actually into the operational workflow.

We can't name the customer; the engagement was under standard confidentiality. But the architectural pattern we deployed there is portable enough that most logistics operators reading this will recognize their own situation in it.

The actual problem

The customer's problem statement, when we got the call, was the obvious one: we built a routing optimizer, it works on benchmark data, but it isn't producing operational impact.

The deeper problem, when we looked at it, was more interesting.

The routing optimizer was producing recommendations in a context the dispatcher couldn't trust. The recommendations were shown in a separate dashboard from the dispatcher's primary tool. The recommendations didn't show their reasoning, so a dispatcher who disagreed had no way to argue with them or refine them. The recommendations were updated on a cadence that didn't match the cadence of dispatcher decisions. The recommendations didn't account for soft constraints the dispatchers knew about (a difficult driver, a finicky customer, a yard with unreliable equipment) and were never going to be in the model's training data. So the dispatchers either rejected recommendations on instinct, or accepted recommendations on instinct, but in either case the model wasn't doing the work it was supposed to be doing.

This is the operational integration problem. It almost never gets diagnosed as the operational integration problem, because it manifests as "model accuracy" or "dispatcher adoption" or "user training." It's neither. It's a question about how ML output gets shaped into a form that an operational decision-maker can actually use, and whether the workflow around the output supports that use.

What we built

We built three things. None of them were a better model.

A reasoning layer that shaped optimizer outputs into dispatcher-usable recommendations. This is the most consequential thing we built. The raw optimizer output was a ranking — possible routings, scored. The reasoning layer took that ranking, the dispatcher's current operational state, the soft constraints we'd encoded from dispatcher interviews, and any context we could pull about the specific drivers, customers, and yards involved. It produced a recommendation in the form a dispatcher actually thinks in: "Recommend driver X for shipment Y because of timing and yard fit; the optimizer's second choice was Z but here's why we'd flag it; here are the two soft constraints you should verify before accepting." This shape — a recommendation that exposes its reasoning and acknowledges its uncertainty — is what makes the difference between a model the dispatcher uses and a model the dispatcher overrides.

Integration into the dispatcher's primary workflow. The recommendations had to live where the dispatcher already worked, on the cadence the dispatcher already operated on, with the data the dispatcher already trusted. Building the model into the existing dispatcher tool, rather than alongside it, was the second-most-consequential change. Dispatchers are not going to flip between two tools in a real-time operational context. They will use one tool. We made that tool the one with the ML inside it.

Container unloading workflow optimization with the same pattern applied. The same architectural shape worked for the container unloading use case. The optimizer figures out the best unloading sequence given equipment availability, dock assignments, and downstream truck schedules. The reasoning layer translates that sequence into instructions the warehouse lead can act on, with reasoning exposed and uncertainty acknowledged. The same operational discipline — model lives in the lead's primary tool, recommendations expose their reasoning, edge cases route to humans — produced the same kind of result.

What changed for the operations team

The headline outcome wasn't that the model got more accurate. The model got modestly better through retraining on production feedback, but most of its improvement was already there before we showed up.

The headline outcome was that the dispatchers and warehouse leads started using the recommendations. The optimizer's output started actually shaping operational decisions. Routing decisions that used to take a dispatcher several minutes of phone calls and mental modeling now took seconds, with the model handling the first-pass analysis and the dispatcher providing the soft-constraint judgment the model couldn't.

This is a less satisfying headline than "we improved the model." It is, in our experience, the more honest one for ML-for-logistics work.

The architectural pattern, in shorthand

If you're contemplating production ML for an operational logistics workflow, the shape we converge on looks roughly like:

SOURCES → driver_state · vehicle_state · yard_state

customer_data · historical_shipments

port_warehouse_conditions

OPTIMIZER → ml_model producing scored recommendations

(the part most teams over-invest in)

REASONING LAYER → reshape recommendations for the operator

expose reasoning · acknowledge uncertainty

route edge cases to human

OPERATOR INTEGRATION → recommendation lives in operator's primary tool

cadence matches operator's decision cadence

feedback flows back to optimizer

OUTCOME → operational decisions actually informed by ML

(the part most teams under-invest in)

Each layer is replaceable. The optimizer can be a classical OR-tools solver, an ML-trained ranker, or a hybrid; what matters is that the output is structured enough for the reasoning layer to consume. The reasoning layer can be a frontier LLM with appropriate context, a smaller specialized model, or a rules engine; what matters is that the output is shaped for an operator. The operator integration can be a dispatcher tool, a warehouse management system, a driver mobile app; what matters is that the recommendation lives in the operator's primary surface.

The discipline of treating the operator integration as the project is the part that doesn't change.

Where this generalizes

This pattern generalizes well beyond logistics. Any production ML system whose value comes from changing operational decisions has roughly the same shape: the model produces output, the operator decides whether to act on it, and the difference between a useful ML deployment and a decorative one is whether the operator's decision context was designed for the model's output to land in.

The customers we see succeed in production ML are the ones who treat operational integration as a first-class engineering problem on the same level as the model itself. The customers we see stuck in pilot are the ones who treat operational integration as someone else's responsibility, downstream of the "real ML work."

For logistics operators specifically: if your ML investments have produced impressive demos and underwhelming operational impact, the diagnosis is almost certainly not at the model layer. We'd be glad to help you figure out where it actually is.

DeHaze Labs builds production AI and data platforms for the physical economy. Get in touch at hello@dhlabs.ai.