|
At VMware Explore 2025, I sat in on one of the most practical and eye-opening technical sessions of the week: Real-World Lessons in Rightsizing VMware Cloud Foundation for On-Premises AI Workloads (INVB1300LV), led by Frank Denneman (Chief Technologist AI, Broadcom) and Johan van Amersfoort (Chief Evangelist & AI Lead, ITQ). This Technical 300-level breakout wasn’t about flashy demos or abstract strategy — it was about the nuts and bolts of running real AI workloads on VMware Cloud Foundation (VCF) with NVIDIA GPUs, and how to right-size infrastructure to meet the growing demands of developers and enterprises alike. Speakers and Session ContextSession: Real-World Lessons in Rightsizing VMware Cloud Foundation for On-Premises AI Workloads (INVB1300LV), presented by Frank Denneman (Broadcom) and Johan van Amersfoort (ITQ). Developers vs. IT Admins: Two Worlds, One RealityFrank and Johan started by illustrating a cultural tension that’s shaping AI adoption:
IT Admins view cloud through the lens of private infrastructure — finite, resource-heavy, and CapEx-driven. Modern Developers expect infinite resources and cloud-speed build times measured in seconds. Bridging this gap requires FinOps for AI and early conversations about sizing, scalability, and the impact of model choices on infrastructure. The RAG Baseline: Why Most Enterprises Start HereThe presenters made it clear: Retrieval-Augmented Generation (RAG) is the most common enterprise AI workload today.
RAG Ingestion: Data frameworks, embeddings, and vector DBs form the backbone of enterprise AI pipelines. RAG Retrieval: User prompts flow through embeddings, LLMs, and reranker models for accuracy and precision. The lesson: RAG spans CPUs and GPUs, and right-sizing requires a holistic view of both ingestion and retrieval stages. Understanding Memory: Static vs. DynamicOne of the most important takeaways was the distinction between static and dynamic memory consumption in AI models:
Static load is only the beginning — dynamic memory usage (KVCache) grows with context size and can dwarf baseline requirements. 👉 For example: a LLaMA 70B model with a 32K context window can require up to 860 GB of dynamic memory — far beyond the static footprint. This is why GPU sizing must account for peak usage, not just model parameters. Scale Up or Scale Out?The session provided a clear framework:
A practical guide: scale up with more GPUs per model for complex workloads and long context windows; scale out with replicas for high concurrency. GPUs and Interconnects MatterChoosing the right GPU (and interconnect) can make or break performance:
GPU selection isn’t just about memory size — interconnect bandwidth determines efficiency at scale. NVLink and NVSwitch interconnects dramatically increase bandwidth for scale-up LLM clusters. Tooling: Model-to-GPU Sizing ToolkitA highlight of the session was VMware’s GPU Sizing Toolkit, which connects to Hugging Face to pull model specs and calculate:
The GPU Memory Calculator provides quick insights into static and dynamic model requirements. The Model-to-GPU Sizing Toolkit helps map Hugging Face models to real infrastructure decisions. The roadmap includes support for gated models (Nemotron), advanced KVCache calculations, and coverage for more workflows like training, computer vision, and agentic AI. Bridging the Gap & Key TakeawaysBridging the cultural divide: sizing and scaling decisions require Dev experience + IT discipline.
This session delivered exactly what the title promised: real-world lessons. If you’re architecting AI workloads on VMware Cloud Foundation, the message is clear: get sizing right, or you’ll pay for it in performance, cost, or both. References:
0 Comments
Leave a Reply. |
RSS Feed