Jobs & Community
- Post your job here!
  Anywhere, anytime - Telecom Ramblings
- Featured Job Listings
  List one or more jobs!
  Reach thousands of industry insiders!
- Visit our Jobs Board

Community Resources

How Will Data Center Networks Hold Up Under an Avalanche of AI Workloads?

May 17th, 2024 by Industry Viewpoints · Leave a Comment

This Industry Viewpoint was authored by Aniket Khosla, Spirent’s Vice President of Wireline Product Management

For a technology limited to sci-fi movies just a few years ago, artificial intelligence (AI) is suddenly everywhere. In a McKinsey survey conducted in April 2024, less than a year after ChatGPT and other first-generation tools debuted, 55% of respondents said they were already using AI. More than two thirds reported that they were planning to increase AI investments.

Ready or not, AI is about to bring big changes to every industry. And nowhere will these changes be felt more quickly or directly than in the data centers tasked with processing AI workloads.

Hyperscale operators are already investing in huge numbers of specialized central processing units (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and other “xPU” accelerators to build out AI clusters. Now, they need to connect those clusters at speeds and scales that far exceed what they can achieve with conventional infrastructures. How they’ll do it is still an open question, though one that networking vendors are working furiously to help answer.

What’s so different about AI workloads that pushes traditional data center networks to the breaking point? And how are hyperscalers and vendors responding? Let’s take a closer look.

New Applications, New Network Demands

You’d think that no one would be better prepared to meet the demands of emerging AI applications than the companies running the world’s largest data centers. Yet it’s hyperscalers sounding the alarm about the limitations of current infrastructures and urging vendors to accelerate timetables for new solutions. To understand why, let’s review what happens inside a typical AI cluster.

AI workload processing encompasses two distinct types of work:

Model training: In this phase, AI clusters ingest huge data sets for AI models to train on. The model analyzes relationships between inputs and outputs in the training data and, over time, learns how to classify data and make predictions with ever-greater accuracy.
Inferencing: In this phase, AI models apply their training to new data. This can encompass a variety of real-time scenarios, such as AI chatbots responding to customer service queries, large language models (LLMs) generating images from text prompts, or AI vision applications monitoring live video in manufacturing plants to identify production flaws.

Both training and inferencing can involve extremely compute- and data-intensive workloads, demanding thousands of specialized processors and the ability to support thousands of synchronized jobs in parallel. It’s in the inferencing phase, however, where the need for extreme speeds and throughput is most urgent and where conventional networking approaches break down.

Compared to training, inferencing generates five times more traffic per xPU and requires networks with five times more bandwidth. Given the real-time nature of many AI inferencing scenarios, workloads must also progress through many nodes as quickly as possible, demanding extremely low latency at massive scales. After all, depending on the application, any lag in an AI cluster’s response could lead to timeouts, poor customer experiences, expensive errors, or worse.

These are not hypothetical concerns. According to Meta, 33% of current AI workload processing time is spent waiting for the network. And that’s for current AI models, which are effectively still in their infancy and growing 1,000x more complex every three years.

Scaling Data Center Networks

Data center operators can’t meet exploding AI application requirements by simply adding more servers and fiber runs as they have in the past. They need new architectural approaches, especially for inferencing workloads that can require tens of thousands of xPUs processing billions—soon trillions—of dense parameters. Here, operators are deploying dedicated back-end infrastructures and adopting the latest, fastest networking technologies as quickly as vendors can deliver them.

Indeed, customer demand for AI is so great, hyperscalers have no choice to but to start upgrading infrastructures now, even if that means implementing new high-speed networking silicon long before standards are finalized. At the same time, there is still no consensus around the optimal back-end cluster design or network technologies to adopt. Google, Microsoft, and Amazon, for example, are all currently pursuing different strategies.

Fortunately, we’re starting to get clues about how the market for AI networking technologies will ultimately shake out. In back-end networks where operators face the most urgent need, the transition to next-generation interfaces is already well under way, and according to Dell’Oro Group, nearly all ports will be 800G by 2027. For front-end connectivity to AI clusters (which initially will be driven by the need for high-speed data ingestion), upgrades will progress more slowly. Dell’Oro expects one third of all Ethernet ports will be 800G or higher in the same timeframe.

Which interface technology hyperscalers will prefer for front- and back-end networks is also still an open question—one that likely won’t have a clear answer anytime soon. Here, operators can choose high-speed Ethernet, with its broad industry acceptance and the economics that come with standardization, or proprietary technologies like InfiniBand that deliver lossless performance. Dell’Oro expects Ethernet to dominate front-end connectivity, while back-end networks will include both Ethernet and InfiniBand for the foreseeable future.

Looking Ahead

Which technologies and architectural approaches will the industry ultimately settle on to meet evolving AI requirements? For now, no one can say for sure, for two big reasons.

First, these questions don’t have a one-size-fits-all answer. Whether an operator chooses Ethernet, InfiniBand, or both will depend on multiple factors including size and number of clusters, mix of AI applications, preference for standardized versus proprietary technologies, comfort with a given technology’s roadmap, and other considerations. Additionally, AI is evolving so quickly, we just don’t know exactly what the future will hold. This uncertainty poses significant challenges for operators, who risk making the wrong calls—and potentially finding their infrastructure obsolete within just a few years.

As always, the best way to mitigate that risk—especially when implementing standards that are still being defined—is via rigorous testing. As hyperscalers and vendors speed next-generation solutions through development, they’ll need next-generation tools to test and validate them. For vendors, the ability to emulate hyperscale AI infrastructures (instead of attempting to rebuild them in the lab) will also be essential. Fortunately, test and emulation providers are innovating alongside vendors and are ready to address these needs.

While many questions remain, there’s one we can answer definitively: However quickly vendors and hyperscalers can implement new AI-optimized infrastructures, customers will be ready to use them.

If you haven't already, please take our Reader Survey! Just 3 questions to help us better understand who is reading Telecom Ramblings so we can serve you better!

Discuss this Post

Ramblings’ Jobs

Post a Job - Just $99/30days
Event Calendar

PNW on The Bear Has a Book: Bandwidth to Drop Soon!: “I worked for a short time as a GM for Zayo, reporting to one of his managers Glenn which was…” Apr 16, 14:16

Priyanka Patil on Trenching Your Own Fiber: “Wow, that’s some serious commitment! Not sure I’d dig my own fiber trench, but I do love when services give…” Apr 14, 03:32

Mary Bridges on Trenching Your Own Fiber: “Fascinating approach by Lyse Tele—empowering customers to take ownership of the last-mile connection truly builds long-term loyalty. It’s a brilliant…” Apr 14, 02:01

Peter Radizeski on The Bear Has a Book: Bandwidth to Drop Soon!: “Rob, you even got a mention in the book.” Mar 27, 11:46

Vin Terraciano on 2025 – The Year of Practical AI: Chain of thought is more powerful, productive, & present in our everyday lives: “What do we make of Zayo acquiring Crown Castle, is that a rumor or will it happen?” Mar 14, 11:15

Fred Thomasson on The Bear Has a Book: Bandwidth to Drop Soon!: “Well, I read the book and I decided to play a drinking game by taking a shot of tequila every…” Mar 14, 10:28

Fred Thomasson on The Bear Has a Book: Bandwidth to Drop Soon!: “This site has always been a shill for Dan Caruso, going back to his time at Zayo. This book will…” Feb 17, 22:41

Eli on Metro Fiber Maps: “Here is an updated link to the ACD fiber map: https://acd.net/fibermap/index.cfm And another for residential fiber to the home: https://acd.net/Residential/viewMap.cfm…” Feb 10, 16:53

sprintpark on What the Golden Resignation Will Mean for Companies, and How Out of Band Can Help: “Great insights, thanks for sharing! Network Engineering Services” Feb 7, 04:55

Rob Powell on Zayo to Buy Crown Castle’s Fiber?: “I imagine that if objections arise, they would likely just be required to divest some assets in certain markets. Especially…” Jan 30, 10:01

Telecom Ramblings

Jobs & Community

Community Resources

Featured Articles

How Will Data Center Networks Hold Up Under an Avalanche of AI Workloads?

May 17th, 2024 by Industry Viewpoints · Leave a Comment

Discuss this Post

Leave a Comment

Ramblings’ Jobs

Event Calendar

Recent Comments