Enterprise AI at Scale: Why Infrastructure, Observability and Fault‑Tolerance Drive ROI

NextFin News - On November 11, 2025 Clockwork hosted a 60‑minute virtual panel, "Enterprise AI at Scale: The Infrastructure Behind Real‑World Impact," bringing together industry practitioners and analysts to discuss how to translate AI ambition into reliable, efficient infrastructure. The event was recorded at 10:00 a.m. Pacific / 1:00 p.m. Eastern and presented by Clockwork Systems (Clockwork.io). (campaigns.clockwork.io)

The conversation was moderated by Suresh Vasudevan, CEO of Clockwork, and featured Dylan Patel (CEO & Chief Analyst, SemiAnalysis), Neil Pinto (VP of AI Infrastructure, formerly LinkedIn) and Jack Hogan (VP, Advanced Growth Technologies, SHI). The panel examined the operational, networking and data‑architecture causes of low utilization in large GPU fleets and practical steps operators can take to raise ROI. (campaigns.clockwork.io)

Why infrastructure efficiency matters: three sources of wasted GPU value

The panel opened with a shared framework for why large GPU clusters often run at only 30–50% utilization. The three causes named repeatedly were: (1) GPU allocated utilization — whether software is actually running on purchased GPU hours; (2) effective training time ratio — how much of running time the GPUs spend computing versus waiting on communication or I/O; and (3) model flops utilization — how closely model operations use a GPU’s theoretical peak. As the moderator framed it, "there are billions and billions, tens of billions, hundreds of billions being deployed in AI infrastructure… and yet study after study shows cluster utilizations in large scale GPU clusters run at 30 to 50%." (Moderator paraphrase from the discussion.)

Enterprise deployment: supply chain, allocation and the move to observability (Neil Pinto)

Neil Pinto described LinkedIn’s experience moving from V100/A100 deployments to the H100 era, and the operational consequences after the arrival of large language models in 2022. He recounted a rapid influx of capacity requests and the first challenge: sourcing H100 hardware through supply‑chain constraints. Neil explained that after an initial period of operational scramble the organization formalized allocation by prioritizing teams and ROI, observing that "we had to come up with a roadmap in terms of how we deployed" and to ration scarce capacity to teams that produced business return. (Transcript: Neil Pinto.)

On utilization, Neil described three concrete changes that moved the needle: introducing observability into GPU use, improving orchestration to avoid long queues built for big‑data systems, and staging data closer to GPUs (moving data off spinning stores and into faster caches) so "you're continuously feeding the beast." He reported that these steps helped increase utilization across LinkedIn’s fleet and that by the time of the panel they had deployed roughly 15,000 GPUs across four data centers. (Transcript: Neil Pinto.)

NeoCloud economics, ClusterMAX and the business of speed and reliability (Dylan Patel)

Dylan Patel framed the NeoCloud market as highly competitive and margin‑sensitive, noting that over 200 NeoCloud providers have emerged in recent years. He described the ClusterMAX benchmarking effort that SemiAnalysis publishes to evaluate GPU cloud providers across metrics like lifecycle management, reliability, networking and security. Dylan emphasized that the operators who ranked highest on ClusterMAX also booked far more revenue, and that the three differentiators for NeoClouds were the ability to deploy GPUs fastest, to maintain high reliability (minimizing credits and downtime), and to demonstrate end‑to‑end performance for prospective customers. (Transcript: Dylan Patel; see ClusterMAX 2.0.) (newsletter.semianalysis.com)

Dylan stressed the economics of delayed deployment: "three months delay is a lot of the useful life of these GPUs" and therefore an erosion of potential profit. He also mapped the margin spread: top operators running smart infrastructure and observability can achieve gross margins in the mid‑30s to 40s, while many NeoClouds that lack those practices are losing money. (Transcript: Dylan Patel.)

Data architecture and the AI stack: portability, metadata and feeding the factory (Jack Hogan)

Jack Hogan focused on the AI stack and the data layer that feeds GPU factories. He argued that where workloads land is determined by where the data lives, and that data often sits in many silos — S3 buckets, NFS shares, block storage — which must be stitched together so that processing can happen without long waits. Jack urged an approach that treats the stack as portable: "when that stack is portable, it can run in a CSP hyperscaler, it can run in a Neo cloud, it can run in an on‑premises environment" and that portability lets operators choose the best landing zone for uptime and ROI. (Transcript: Jack Hogan.)

Jack further highlighted that much of what is needed for fast iteration is metadata rather than moving full datasets: "in a lot of cases, you don't need to move the data, you just need the metadata to process that fuel." He described global data estates and resilient data layers as critical for checkpoints, fine‑tuning, and minimizing scheduling delays. (Transcript: Jack Hogan.)

Networks, fabrics and the long arc from InfiniBand to Ethernet (Dylan Patel and panel)

The panel debated the evolving role of Ethernet versus RDMA fabrics. Dylan and others recounted that early high‑performance clusters often required InfiniBand to avoid leaving performance on the table, but that Ethernet has been extensively tuned and increasingly matches RDMA performance in many modern deployments. Dylan pointed to broad trends where major labs and cloud deployments have moved from InfiniBand toward Ethernet or Ethernet variants (for example, Spectrum‑based Ethernet) and argued the industry is converging on software approaches that permit fabric convergence and programmability. (Transcript: Dylan Patel.)

Panelists agreed that running multiple network technologies (GPU‑to‑GPU RDMA, front‑end TCP, storage RDMA) increases complexity, and that software‑driven fabrics that partition and guarantee quality of service can enable converged deployments without physical segregation. As Dylan put it, the long‑term vision is for "software‑driven approaches" that allow fabrics to be converged while preserving performance and isolation. (Transcript: Dylan Patel.)

Reliability, fault‑tolerance and non‑disruptive workload migration (Neil Pinto)

Reliability dominated the discussion as a primary driver of utilization loss. Neil described the operational pain of link flaps and failing optics, and how a single failed GPU or a single failed network transceiver can cascade into large job restarts and lost cluster time. He recounted Clockwork’s practical impact at LinkedIn: by routing around link flaps the system could preserve most capacity instead of failing whole jobs, and that visibility into network congestion and telemetry was crucial. Neil urged that operators deploy these capabilities early: "deploy this from day one because trying to solve these problems later on is a pain point." (Transcript: Neil Pinto.)

The panel also discussed the value of non‑disruptive workload migration and fault‑tolerant training libraries that allow jobs to continue when GPUs disappear or links fail, reducing costly reruns and scheduling contention. (Transcript: multiple panelists.)

Practical prescriptions: observability, co‑design and future proofing

Each panelist closed with a practical recommendation. Jack emphasized observability as the first step: measure what you run and where the bottlenecks are. Dylan promoted hardware‑software co‑design and smarter software that avoids brute‑force approaches (for example, KV cache offload and avoiding unnecessary prefill compute) to materially cut inference costs. Neil urged long‑term infrastructure planning — future‑proof racks, cooling, network fungibility, and hiring technicians who understand high‑density GPU systems. Together their prescriptions echoed a single theme: instrument the full stack, design fabrics and data planes for resilience, and prioritize fault‑tolerant automation so clusters keep computing instead of restarting. (Transcript: multiple panelists.)

References and further viewing

Replay and panel details: Clockwork — Enterprise AI at Scale: The Infrastructure Behind Real‑World Impact (Recorded Nov 11, 2025). (campaigns.clockwork.io)

ClusterMAX 2.0 (SemiAnalysis benchmark referenced by Dylan Patel): ClusterMAX™ 2.0 — The Industry Standard GPU Cloud Rating System. (newsletter.semianalysis.com)

Clockwork product and AI infra materials: Clockwork — AI Infra Summit / Software‑driven fabrics. (clockwork.io)

For readers wanting to follow the conversation directly, the full recorded webinar and the ClusterMAX write‑up provide the event video and the benchmarking detail discussed on the panel. (campaigns.clockwork.io)

Explore more exclusive insights at nextfin.ai.

Enterprise AI at Scale: Why Infrastructure, Observability and Fault‑Tolerance Drive ROI

Why infrastructure efficiency matters: three sources of wasted GPU value

Enterprise deployment: supply chain, allocation and the move to observability (Neil Pinto)

NeoCloud economics, ClusterMAX and the business of speed and reliability (Dylan Patel)

Data architecture and the AI stack: portability, metadata and feeding the factory (Jack Hogan)

Networks, fabrics and the long arc from InfiniBand to Ethernet (Dylan Patel and panel)

Reliability, fault‑tolerance and non‑disruptive workload migration (Neil Pinto)

Practical prescriptions: observability, co‑design and future proofing

References and further viewing

Insights

What are core technical principles behind efficient AI infrastructure?

What operational challenges do large GPU clusters face in utilization?

What recent trends are shaping the NeoCloud market?

How has LinkedIn adapted its GPU deployment strategy in response to market changes?

What are the latest developments in GPU cloud provider benchmarking?

What future technological advancements are expected in AI infrastructure?

What challenges do operators face in achieving high reliability in AI systems?

How do GPU utilization rates impact profitability in the AI industry?

What are the implications of software-driven network fabrics in AI deployments?

How do observability practices enhance GPU utilization?

What comparisons can be made between traditional and modern networking technologies in AI?

What operational lessons can be learned from LinkedIn's GPU deployment experience?

What role does metadata play in optimizing AI data architecture?

What are the major bottlenecks identified in the AI stack for GPU processing?

What are the key factors differentiating successful NeoCloud providers?

How does fault tolerance affect the efficiency of AI workload management?

What evidence supports the transition from InfiniBand to Ethernet in high-performance clusters?

What recommendations did the panelists provide for future-proofing AI infrastructure?

How can AI operators address supply chain constraints for GPU hardware?