virtualizationvelocity
  • Home
  • About
  • VMware Explore
    • VMware Explore 2025
    • VMware Explore 2024
    • VMware Explore 2023
    • VMware Explore 2022
  • VMworld
    • VMworld 2021
    • VMworld 2020
    • VMworld 2019
    • VMworld 2018
    • VMworld 2017
    • VMworld 2016
    • VMWorld 2015
    • VMWorld 2014
  • vExpert
  • The Class Room
  • VMUG Advantage
  • AI Model Compute Planner
  • AI-Q Game
  • Video Hub
  • Tech-Humor
  • Contact

Real-World Lessons in Rightsizing AI Workloads with VMware Cloud Foundation

8/29/2025

0 Comments

 
Picture
At VMware Explore 2025, I sat in on one of the most practical and eye-opening technical sessions of the week: Real-World Lessons in Rightsizing VMware Cloud Foundation for On-Premises AI Workloads (INVB1300LV), led by Frank Denneman (Chief Technologist AI, Broadcom) and Johan van Amersfoort (Chief Evangelist & AI Lead, ITQ).
​
This Technical 300-level breakout wasn’t about flashy demos or abstract strategy — it was about the nuts and bolts of running real AI workloads on VMware Cloud Foundation (VCF) with NVIDIA GPUs, and how to right-size infrastructure to meet the growing demands of developers and enterprises alike.

Speakers and Session Context

Picture
Session: Real-World Lessons in Rightsizing VMware Cloud Foundation for On-Premises AI Workloads (INVB1300LV), presented by Frank Denneman (Broadcom) and Johan van Amersfoort (ITQ).

Developers vs. IT Admins: Two Worlds, One Reality

​Frank and Johan started by illustrating a cultural tension that’s shaping AI adoption:
  • Modern Developers: trained to consume cloud-native services with the assumption that resources are unlimited and build times are measured in seconds.
  • IT Admins: responsible for CapEx-heavy private clouds, where resources feel finite, build times stretch longer, and every GPU allocation must be justified.
Picture
​IT Admins view cloud through the lens of private infrastructure — finite, resource-heavy, and CapEx-driven.
Picture
Modern Developers expect infinite resources and cloud-speed build times measured in seconds.
​
​Bridging this gap requires FinOps for AI and early conversations about sizing, scalability, and the impact of model choices on infrastructure.

The RAG Baseline: Why Most Enterprises Start Here

The presenters made it clear: Retrieval-Augmented Generation (RAG) is the most common enterprise AI workload today.
  • Ingestion Pipeline: Data sources flow through a CPU-driven framework, enriched by GPU-based embedding models, and stored in vector databases.
  • Retrieval Pipeline: User prompts are transformed by embeddings, searched against the vector DB, and routed through LLMs and optional reranker models for precision.
Picture
RAG Ingestion: Data frameworks, embeddings, and vector DBs form the backbone of enterprise AI pipelines.
Picture
RAG Retrieval: User prompts flow through embeddings, LLMs, and reranker models for accuracy and precision.

​The lesson: RAG spans CPUs and GPUs, and right-sizing requires a holistic view of both ingestion and retrieval stages.

Understanding Memory: Static vs. Dynamic

One of the most important takeaways was the distinction between static and dynamic memory consumption in AI models:
  • Static Memory: The baseline cost of loading model weights into GPU memory (e.g., a 70B parameter model at FP16 = ~140 GB).
  • Dynamic Memory: Scales with context window size, architecture, and key-value caches (KVCache). At scale, KVCache can dwarf static memory requirements.
Picture
Static load is only the beginning — dynamic memory usage (KVCache) grows with context size and can dwarf baseline requirements.

​👉 For example: a LLaMA 70B model with a 32K context window can require up to 860 GB of dynamic memory — far beyond the static footprint.

This is why GPU sizing must account for peak usage, not just model parameters.

Scale Up or Scale Out?

The session provided a clear framework:
  • Scale Up (more GPUs per model):
    • Complex workloads (e.g., Mixture of Experts).
    • Long context windows.
    • Latency-sensitive use cases.
    • Requires NVLink or NVSwitch interconnects.
  • Scale Out (more model replicas):
    • High user concurrency.
    • Short to moderate context windows.
    • Load-balanced architectures.
    • Works without GPU interconnects.
Picture
A practical guide: scale up with more GPUs per model for complex workloads and long context windows; scale out with replicas for high concurrency.

GPUs and Interconnects Matter

Choosing the right GPU (and interconnect) can make or break performance:
  • L40S (48 GB PCIe): Great for small models, notebooks, and testing.
  • H100 SXM (80 GB): Standard in 4–8 GPU HGX servers with NVLink/NVSwitch.
  • H100 NVL (94 GB): PCIe with NVLink — standard for large LLM clusters.
  • H200 NVL (141 GB): Flagship 2025 option for long-context models.
  • AMD MI300X (192 GB OAM): Powerful, but not supported under VMware VCF/vGPU today.
Picture
GPU selection isn’t just about memory size — interconnect bandwidth determines efficiency at scale.
Picture
NVLink and NVSwitch interconnects dramatically increase bandwidth for scale-up LLM clusters.

Tooling: Model-to-GPU Sizing Toolkit

A highlight of the session was VMware’s GPU Sizing Toolkit, which connects to Hugging Face to pull model specs and calculate:
  • Static vs. dynamic memory requirements.
  • Tokens/sec throughput.
  • GPU recommendations based on context windows and session concurrency.
Picture
The GPU Memory Calculator provides quick insights into static and dynamic model requirements.
Picture
The Model-to-GPU Sizing Toolkit helps map Hugging Face models to real infrastructure decisions.

​The roadmap includes support for gated models (Nemotron), advanced KVCache calculations, and coverage for more workflows like training, computer vision, and agentic AI.

Bridging the Gap & Key Takeaways

Picture
Bridging the cultural divide: sizing and scaling decisions require Dev experience + IT discipline.
  1. RAG is where most enterprises start — and it requires balancing CPU and GPU workloads.
  2. GPU sizing must account for both static and dynamic memory, especially KVCache growth.
  3. Scale Up vs. Scale Out depends on workload type: complex long-context LLMs vs. high concurrency, short-context applications.
  4. Interconnects are critical — NVLink and NVSwitch unlock scale-up efficiency.
  5. Tools like the GPU Sizing Toolkit help translate model specs into real infrastructure decisions.
  6. Bridging the Dev vs IT gap is as important as technical right-sizing. Developers want instant cloud-like experiences; IT leaders must balance budgets and capacity.
This session delivered exactly what the title promised: real-world lessons. If you’re architecting AI workloads on VMware Cloud Foundation, the message is clear: get sizing right, or you’ll pay for it in performance, cost, or both.

References:

GPU Memory Calculator

0 Comments



Leave a Reply.

    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture

    RSS Feed

Virtualization Velocity

© 2025 Brandon Seymour. All rights reserved.

Privacy Policy | Contact

Follow:

LinkedIn X Facebook Email
  • Home
  • About
  • VMware Explore
    • VMware Explore 2025
    • VMware Explore 2024
    • VMware Explore 2023
    • VMware Explore 2022
  • VMworld
    • VMworld 2021
    • VMworld 2020
    • VMworld 2019
    • VMworld 2018
    • VMworld 2017
    • VMworld 2016
    • VMWorld 2015
    • VMWorld 2014
  • vExpert
  • The Class Room
  • VMUG Advantage
  • AI Model Compute Planner
  • AI-Q Game
  • Video Hub
  • Tech-Humor
  • Contact