virtualizationvelocity
  • Home
  • About
  • VMware Explore
    • VMware Explore 2025
    • VMware Explore 2024
    • VMware Explore 2023
    • VMware Explore 2022
  • VMworld
    • VMworld 2021
    • VMworld 2020
    • VMworld 2019
    • VMworld 2018
    • VMworld 2017
    • VMworld 2016
    • VMWorld 2015
    • VMWorld 2014
  • vExpert
  • The Class Room
  • VMUG Advantage
  • AI Model Compute Planner
  • AI-Q Game
  • Video Hub
  • Tech-Humor
  • Contact

Your Definitive Source for Actionable Insights on Cloud, Virtualization & Modern Enterprise IT

The Hidden Bottlenecks in LLM Inference

1/24/2026

0 Comments

 

Why TFLOPs and VRAM Are the Least Interesting Parts of Production AI

Picture

Introduction: The GPU Fallacy

When organizations plan large-scale LLM inference, the conversation almost always starts with hardware:
  • How many GPUs?
  • How much VRAM?
  • How many TFLOPs?
  • What’s the max tokens per second?
Those numbers matter — but they are not where most production latency or cost comes from.

This fixation on raw compute is a textbook example of what I’ve previously called the AI Illusion: the belief that advanced infrastructure automatically produces outcomes. In reality, inference performance is determined far more by the system's behavior than by GPU specs.
​
This article breaks down the hidden bottlenecks that dominate real-world LLM inference and explains why architects who only model TFLOPs and VRAM are consistently surprised in production.
​This post breaks down the hidden bottlenecks in LLM inference in detail.
If you want the architectural overview, watch the video above.
If you want the deep dive, keep reading below.

Inference Is Not a Single Step — It’s a Pipeline

Most mental models of inference look like this:
Prompt → GPU → Response
​
Production inference actually looks more like:
Inference Pipeline (What Actually Happens)

Client
 → Tokenization (CPU)
 → Request queue
 → Scheduler & batching
 → KV-cache lookup
 → Network hops
 → GPU execution
 → KV-cache update
 → Detokenization
 → Token streaming response
Only one of those steps is dominated by GPU math.

Everything else is where latency, jitter, and cost quietly accumulate.

1. Tokenization: The First Invisible Latency Tax

Tokenization is almost always:
  • CPU-bound
  • Poorly parallelized
  • Repeated on every request

Why this matters
  • Long prompts can add tens of milliseconds before inference even starts
  • Multi-tenant systems often serialize tokenization under load
  • Tokenization throughput frequently becomes the first scaling wall

Common architectural mistake: Tokenization is rarely included in latency budgets or capacity models. Teams benchmark GPU throughput while quietly ignoring the CPU path that feeds it.
​
This is why many inference stacks show excellent GPU utilization but still miss latency SLAs.

2. KV-Cache: Where VRAM Actually Goes

Picture
KV-cache is the single most misunderstood component of inference.
It:
  • Grows linearly with sequence length × layers × heads
  • Consumes VRAM faster than most architects expect
  • Determines maximum concurrency more than the model size does

What breaks in production
  • Fragmentation reduces usable VRAM
  • Cache eviction introduces unpredictable latency spikes
  • High concurrency forces tradeoffs between batch size and context length
In many real deployments, KV-cache memory exceeds model weights.
​
Architectural illusion“Model fits in memory” does not mean “system scales.”

3. Networking: Death by a Thousand Microseconds

Inference traffic is fundamentally different from training traffic:
  • East-west, not north-south
  • Bursty, not steady
  • Latency-sensitive, not throughput-optimized

Hidden costs
  • Token streaming dramatically increases packet counts
  • Multi-GPU and multi-node inference introduce synchronization delays
  • CPU↔GPU↔NIC handoffs add jitter under load

Common mistake
Designing inference networks like training fabrics — or worse, like general IT traffic — guarantees inconsistent tail latency.

4. Contention: The Bottleneck Nobody Benchmarks

Contention exists everywhere in inference systems:
  • CPU cores handling tokenization and scheduling
  • PCIe lanes shared across accelerators
  • Memory bandwidth during concurrent KV-cache access
  • Network queues under burst traffic

Why benchmarks lie
Most benchmarks:
  • Run in isolation
  • Avoid multi-tenant contention
  • Measure averages instead of P95/P99
This explains why proofs-of-concept look great while production deployments feel “mysteriously slow.”
This pattern shows up repeatedly when organizations move from discovery to AI outcomes — the exact transition where architectural shortcuts are exposed.

5. Batching Policies: Throughput vs. User Experience

Picture
Batching improves GPU efficiency — but at a cost.
  • Larger batches increase time-to-first-token (TTFT)
  • Interactive workloads suffer
  • Tail latency becomes unpredictable

The real tradeoff
  • Optimize for throughput → unhappy users
  • Optimize for responsiveness → idle GPUs

Most teams optimize for averages and are shocked when P99 latency explodes.

​This is a classic Amplification Trap: small inefficiencies scale linearly with usage and rapidly dominate cost.

6. Runtime Choices: Same Model, Radically Different Results

Inference behavior varies wildly depending on the runtime stack.
Differences emerge in:
  • Scheduler design
  • KV-cache layout
  • Tensor parallelism strategy
  • Memory allocation behavior
  • Token streaming architecture

Two teams can deploy the same model on the same GPUs and see 2–5× differences in latency and cost.

Treating the inference runtime as an “implementation detail” is one of the most expensive mistakes teams make.

The Real Bottleneck Stack (What Architects Should Model)

Instead of starting with GPUs, architects should model:
  1. Prompt length distributions
  2. Tokenization throughput per CPU core
  3. KV-cache growth vs. concurrency
  4. Queueing and scheduling behavior
  5. Network topology and jitter
  6. Batching policies aligned to SLAs
  7. Runtime memory behavior

Only after this does TFLOPs become relevant.

​This aligns directly with the failure patterns outlined in Why AI Projects Fail – The 5 Pillars: inference failures are rarely about models alone. They are architectural, operational, and economic failures.

Why This Keeps Happening

Because:
  • Hardware specs are easy to reason about
  • GPUs are visible on budgets
  • Software bottlenecks don’t show up on invoices

​The result is a familiar pattern:
Plenty of GPUs, poor latency, and rising inference costs with no obvious explanation.

Inference Is a Systems Problem

LLM inference is not:
  • A model problem
  • A GPU problem
  • Even strictly an ML problem

It is a distributed systems problem with tight latency constraints and brutal cost sensitivity.

​This is why inference architecture fits naturally into an AI Factory mindset: inference must be designed, measured, governed, and optimized as a production system — not bolted onto general infrastructure after the fact.

If you only size for TFLOPs and VRAM, you’re optimizing the least interesting part of the stack.

Related Reading on Virtualization Velocity

  • The AI Illusion: Why Most AI Investments Don’t Deliver Outcomes
  • From Discovery to AI Outcomes: A Proven Framework for Enterprise AI
  • Why AI Projects Fail – The 5 Pillars
  • The Amplification Trap: How AI Scales Cost Faster Than Value
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

      Join Our Community

    Subscribe

    Categories

    All
    Artificial Intelligence
    Automation & Operations
    Certification & Careers
    Cloud & Hybrid IT
    Enterprise Technology & Strategy
    General
    Hardware & End-User Computing
    Virtualization & Core Infrastructure

    Recognition

    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture

    RSS Feed

    Follow @bdseymour

Virtualization Velocity

© 2025 Brandon Seymour. All rights reserved.

Privacy Policy | Contact

Follow:

LinkedIn X Facebook Email
  • Home
  • About
  • VMware Explore
    • VMware Explore 2025
    • VMware Explore 2024
    • VMware Explore 2023
    • VMware Explore 2022
  • VMworld
    • VMworld 2021
    • VMworld 2020
    • VMworld 2019
    • VMworld 2018
    • VMworld 2017
    • VMworld 2016
    • VMWorld 2015
    • VMWorld 2014
  • vExpert
  • The Class Room
  • VMUG Advantage
  • AI Model Compute Planner
  • AI-Q Game
  • Video Hub
  • Tech-Humor
  • Contact