Do not index
How we turned weeks of GPU rack onboarding into hours of automated validation — and built the foundation for autonomous data center operations along the way.
Every GPU data center, colocation provider, and infrastructure operator we've worked with has the same problem.
Telemetry exists. There's plenty of it — GPU drivers stream metrics, host operating systems emit logs, cooling sensors report continuously, power systems publish state, network fabric chatters at every layer. The data is there. It's just fragmented across a dozen vendor schemas, inconsistent in sample rate, spread across systems that don't talk to each other, and slow to become trusted enough to act on.
The result is what you'd expect. Operators run partially blind during deployment. Bringing a new rack online takes weeks of manual checks, vendor-specific dashboards, and engineers reading logs across five different consoles. Debugging is reactive — by the time something is visible in the dashboard the customer has already noticed. The autonomous operations everyone is talking about — systems that diagnose their own faults, plan their own remediations, and only escalate to humans when they actually need to — stay theoretical because the data substrate isn't there to support them.
We watched team after team build this unified telemetry layer from scratch. Each one got it about 70% right, shipped it as a one-off, and moved on to the next thing. After we'd done it ourselves enough times across customer engagements, we realized the same pattern was a product, not a consulting deliverable.
That's EdgeTelemetry.
What it does
EdgeTelemetry is a real-time edge control layer for modern GPU and data center environments. It does four things, in order:
1. Heterogeneous source ingestion. Real-time data from GPU telemetry, host metrics, cooling sensors, power systems, and network fabric — pulled from each source via the most reliable channel available. Kafka where the source supports it. Vendor APIs where it doesn't. NiFi for the legacy stuff. The collection layer is intentionally boring; resilience comes from accepting that every source has a different reliability profile and engineering for that.
2. Unified schema normalization. Everything that gets ingested is normalized into a single operational schema before it's stored. Validation, enrichment, lineage tracking. The point isn't elegant data modeling — it's that downstream consumers should never have to write per-vendor parsing logic again. The schema is the abstraction that makes everything else possible.
3. Readiness validation. Before a rack gets declared operationally ready, EdgeTelemetry runs it through a configurable battery of validation checks. Are the GPUs reporting expected temperatures? Are power draws within bounds? Is network latency to peers what it should be? Are dependencies between subsystems behaving correctly? This is the capability that turns rack onboarding from a multi-week manual process into an automated gate.
4. A reasoning layer interface. EdgeTelemetry exposes the unified telemetry stream and validation state through an interface designed to be consumed by a reasoning layer — currently we point this at frontier LLMs (Claude is our canonical example) for tasks like anomaly explanation, root cause hypothesis, and remediation planning. The reasoning layer sits above EdgeTelemetry; we don't bake any single model into the platform.
The architecture, briefly
The pattern is straightforward, and that's the point. The hard work isn't the architecture diagram — it's getting each layer to be production-grade in isolation so the whole thing actually holds up under real conditions.
SOURCES → gpu_telemetry · host_metrics · cooling_sensors
power_systems · network_fabric
UNIFIED SCHEMA → normalize · validate · enrich · stamp_lineage
READINESS VALIDATION → check_thresholds · check_dependencies
gate_for_operations
REASONING LAYER → interpret_state · explain_anomaly
(LLM, e.g. Claude) plan_remediation · escalate_with_context
ACTION → diagnose · remediate · escalate
Each layer is replaceable. Source connectors are the easiest things to add or swap. The schema evolves over time as new vendor formats arrive — versioned, with a migration discipline that doesn't break downstream consumers. The validation logic is configurable per-customer, because what counts as "ready" looks different for a hyperscaler capacity partner than for an enterprise colocation customer. And the reasoning layer is intentionally model-agnostic, because we expect frontier models to keep improving and we don't want the platform to require a forklift upgrade every time a better one ships.
Why the rack-onboarding number matters
The headline outcome we point to is GPU rack onboarding — from weeks to hours.
That number is doing more work than it looks like at first. The reason rack onboarding takes weeks today isn't that any individual check is slow. It's that the checks are spread across vendor consoles that don't talk to each other, run by humans who have to assemble the picture by hand, and gated by the slowest piece of vendor documentation in the loop. EdgeTelemetry collapses all of that into automated validation against a unified schema. You don't get an order of magnitude speedup by making any single thing faster — you get it by removing the coordination cost between systems that should never have been separate in the first place.
The downstream effect compounds. Customers who deploy EdgeTelemetry start with rack onboarding, but the same unified telemetry stream drives capacity planning, incident triage, predictive maintenance, and eventually autonomous operations. The rack-onboarding metric is the one we lead with because it's measurable and obvious. The platform value is the rest of it.
Where we're headed
We're working with a small number of operators on early access right now. The current focus is on operationalizing the readiness validation tier — turning what works in our environments into something that holds up across the diversity of vendor configurations our customers actually run.
The longer arc is autonomous operations. The reasoning layer interface in EdgeTelemetry is designed for it from day one — not bolted on after the fact. Once a reasoning layer (a frontier LLM with appropriate tool use) sits on top of unified, validated telemetry, the path to systems that diagnose their own faults and execute well-defined remediation playbooks is more architectural than research. The hard parts have always been the substrate underneath. That's what EdgeTelemetry exists to solve.
We'll write more about the reasoning-layer side of this in a follow-up post once we have customer deployments to talk about specifically. For now, the foundational claim is the one worth making clearly: the AI buildout is the largest capex cycle in tech, the operational layer beneath it is underbuilt, and unified telemetry is the precondition for everything interesting that comes next.
If you're operating GPU capacity, running a colo serving AI workloads, or building infrastructure for hyperscaler partners — and you're tired of rebuilding the same telemetry layer from scratch every time — we'd like to talk.
EdgeTelemetry is built by DeHaze Labs. Get in touch at hello@dhlabs.ai.