virtualizationvelocity
  • Home
  • About
  • VMware Explore
    • VMware Explore 2025
    • VMware Explore 2024
    • VMware Explore 2023
    • VMware Explore 2022
  • VMworld
    • VMworld 2021
    • VMworld 2020
    • VMworld 2019
    • VMworld 2018
    • VMworld 2017
    • VMworld 2016
    • VMWorld 2015
    • VMWorld 2014
  • vExpert
  • The Class Room
  • AI Model Compute Planner
  • AI-Q Game
  • Video Hub
  • Tech-Humor
  • Contact

Your Definitive Source for Actionable Insights on Cloud, Virtualization & Modern Enterprise IT

The Double Descent: Why Bigger Models Demand Smarter Infrastructure

4/4/2026

0 Comments

 
Picture
For a long time, there was a rule everyone in modeling followed—whether you were in finance, statistics, or early machine learning:

Keep the model simple.

The reasoning was straightforward. If you added too many parameters, your model would overfit—memorize the past instead of learning something that generalizes. Simpler models were safer. More stable. Easier to trust.

That rule shaped decades of thinking in finance in particular. Factor models stayed small. Linear relationships dominated. Parsimony wasn’t just a preference; it was doctrine.
But something has changed.

Recent work in financial machine learning—and increasingly, real-world practice—has revealed a pattern that directly contradicts that intuition:

Models with more parameters than data points can perform better out of sample.
​
This isn’t just theory. At the Future Alpha quant event, in a session on Machine Learning, Market Risk, and the Future of Asset Pricing, the message was clear: leading firms are moving away from small, interpretable models toward highly parameterized ones that better reflect the actual structure of markets.
​
Picture
“Where the shift toward model complexity is being actively discussed in finance.
To understand why, you must start by questioning the original assumption.

The Hidden Assumption Behind Simplicity

​When we say, “keep models simple,” we’re implicitly assuming something deeper:
That the system we’re modeling is simple enough to be captured that way.
​
In finance, that assumption doesn’t hold.

Markets are not governed by clean, linear relationships. The effect of one variable depends on the state of others. Signals interact. Regimes shift. Noise dominates.

Take something as basic as predicting returns. A simple model might assume that valuation or momentum independently explains returns. But in reality, those relationships are conditional. Momentum behaves differently in high-volatility environments than in low-volatility ones. Liquidity, macro conditions, and positioning all interact.

A linear model flattens all of that into additive effects. It doesn’t fail loudly; it fails quietly, by missing structure.

For years, that failure was interpreted as noise in the data.

But increasingly, it looks like something else:
The model wasn’t too complex. It was too simple.
Picture
“We don’t know the true function—so we approximate it.”

What Double Descent Actually Mean

​The concept of double descent gives us a way to understand what has changed.

In the traditional view of modeling, there is a tradeoff between simplicity and overfitting. As a model becomes more complex, its performance improves at first because it can capture more patterns in the data. But beyond a certain point, adding more parameters was expected to hurt performance. The model becomes too flexible, starts memorizing the training data, and fails to generalize. This produces the familiar U-shaped curve.

Double descent shows that the story does not end there.

As model complexity continues to increase, something unexpected happens. After the point where the model has just enough capacity to perfectly fit the training data—the most unstable point—performance does not keep getting worse. Instead, it begins to improve again. The curve doesn’t simply go down and then up. It goes down, spikes, and then descends a second time, often reaching lower error than simpler models ever achieved.

To make this more concrete, it helps to define a simple ratio:
​C = number of parameters ÷ number of data points

This ratio determines the regime your model is operating in.

When C is less than 1, the model does not have enough capacity to fully capture the structure of the data. This is the classical regime—stable but often underfit.

As C approaches 1, the model reaches a critical point. It now has just enough parameters to perfectly interpolate the training data. This is where instability peaks. Small changes in the data can lead to large changes in the model, and generalization suffers. This is the “danger zone” traditional approaches were designed to avoid.

But when C becomes much greater than 1, the behavior changes again. The model enters an overparameterized regime where it is flexible enough to represent many possible solutions. Instead of locking into a fragile fit, the learning process implicitly favors solutions that generalize better.

This is the second descent—and the point where traditional intuition breaks down.

A useful way to think about it is this:
The most dangerous model is often not the biggest one.
​
It is the one sitting right at the edge of having just enough capacity to fit the data, but not enough scale to become stable again.
Picture
“Performance improves again as models become highly complex.”

Why Bigger Models Don’t Behave the Way We Expected

At first glance, this seems impossible. More parameters should mean more variance, more instability, more overfitting.

But that intuition assumes each parameter behaves independently.

In large models, that’s not what happens.

Instead, the model distributes information across many parameters. No single parameter carries the burden of explaining the data. The system becomes redundant in a useful way. Small errors in one part are absorbed by others.

A helpful way to think about it is structural.

A small model is like a rigid frame. It either fits or it doesn’t. There’s no flexibility.

A large model is more like a flexible mesh. It can conform to the underlying structure of the data without relying on any single component.

What emerges is something that looks like regularization—but isn’t explicitly designed that way. It’s a property of scale.
​
This is what the research describes as implicit shrinkage. The model becomes both more expressive and more stable at the same time.
Picture
“Large models stabilize through implicit shrinkage.”

What This Looks Like in Finance

​This isn’t abstract shows up directly in financial modeling.

Consider return prediction using a standard set of predictors—valuation metrics, spreads, momentum signals. In traditional models, these are fed into a linear regression. Each variable contributes independently.

Now take the same inputs and pass them through a nonlinear model—say, a neural network. You haven’t added new data. You’ve changed how the data can be used.

What happens is not just a better fit. The model begins to capture interactions: when signals reinforce each other, when they cancel out, when they matter only in certain regimes.

Empirically, what you see is that as you increase the number of parameters—holding the input data fixed—out-of-sample performance improves and then stabilizes. It doesn’t collapse.

The same pattern appears in asset pricing. Traditional factor models use a handful of linear factors. When those same factors are used in a high-dimensional nonlinear model, performance improves dramatically—not because the inputs changed, but because the representation did.

The limitation was never the data.
​
It was the model’s capacity to use it.
Picture
“The tradeoff: simplicity vs representational power.”

The Infrastructure Reality: Every Parameter Has a Cost

​This is where the conversation shifts from modeling to systems.

A parameter is not an abstract concept. It is a number that must be stored, moved, and accessed during computation.

That means:
Every parameter must be loaded into memory to be used.

As models grow—often an order of magnitude year over year—their memory footprint grows with them. A model with 100 billion parameters requires on the order of hundreds of gigabytes of memory just to hold the weights in FP16.

That doesn’t fit on a single GPU. It doesn’t even fit comfortably across a few.
So, the problem becomes architectural.

You have to shard the model across devices. You must move activations between GPUs. You must coordinate computation across nodes. At that point, the limiting factor is no longer raw compute.

Its memory capacity and memory bandwidth.
​
This is why the real bottlenecks in modern AI systems are:
  • VRAM capacity
  • interconnect speed (NVLink, InfiniBand)
  • communication overhead
Not FLOPS.

Data Isn’t the Limiting Factor We Thought It Was

In finance, this creates a particularly interesting tension.

Data is scarce. You don’t get millions of independent samples. You get time series—hundreds of observations, maybe thousands if you’re lucky.

By classical logic, that should force you into small models.

But the empirical evidence shows the opposite. Larger models still perform better.

The reason is subtle but important:
Data determines how much information is available.
Model capacity determines how much of that information you can extract.

A small model leaves signal on the table. A larger model can capture structure that would otherwise be lost—not by adding data, but by using it more effectively.

Model vs System Complexity

​This is where the discussion benefits from refinement.

It’s tempting to say, “larger models mean more complexity.” But that’s not quite right.

A model—even a large one—is still just a function. It maps inputs to outputs. It can be complex in representation, but it is conceptually self-contained.

The real operational complexity shows up elsewhere.

As highlighted in work from Berkeley AI Research, modern AI applications are often compound systems—pipelines that involve multiple models, retrieval steps, tools, and orchestration layers.

That’s where engineering complexity explodes:
  • dependencies between components
  • failure modes across steps
  • latency accumulation
  • state management

A system built from many small pieces can become extremely complex to operate.

This leads to a more precise framing:
Model complexity is intentional. System complexity is emergent.
​

And that leads to a real design decision.

The Tradeoff We’re Actually Making

​You don’t eliminate complexity in AI systems.

You decide where it lives.

If you use small models, you often compensate with:
  • manual feature engineering
  • multiple pipelines
  • rule-based logic

The complexity doesn’t disappear. It moves into the system.

If you use large models, more of that complexity is absorbed into the learned representation. The system around it can often be simpler.

So, the question becomes:
Do you want complexity expressed in code and infrastructure—or learned inside the model?

The Real Advantage

​This is where the original statement needs to be refined.

It’s not that complexity itself is valuable.

Unnecessary complexity is always a liability.

But in systems that are inherently complex—like financial markets, insufficient model capacity is also a liability.

The advantage comes from knowing how to balance the two:
​​placing complexity where it can be managed and where it creates value.

Final Thought

​The shift we’re seeing isn’t from simple systems to complex ones.
It’s from manually constructed simplicity to learned complexity.

For years, we simplified problems to fit our models.
Now, we are building models capable of fitting the problem.

That changes where the burden of complexity lives.

It no longer sits in handcrafted features, brittle pipelines, and layers of rules.
It moves into the model itself—where it can be learned, optimized, and continuously improved.

The organizations that win won’t be the ones with the simplest models, or the most elaborate systems.

They will be the ones that understand this distinction—and act on it.

Great systems minimize operational complexity.
Great models absorb real-world complexity.

And the real advantage?

Knowing where complexity belongs—and having the infrastructure to support it once you put it there.

References:

Primary Source (Financial Modeling & Core Thesis)
Kelly, B. (2023). The Virtue of Complexity in Return Prediction. The Journal of Finance, 78(6), 3109–3159.

Event Context (Industry Application)
Kelly, B. (2026). The Virtue of Complexity. Presented at Future Alpha: Machine Learning, Market Risk, and the Future of Asset Pricing.
(Concepts in this article are informed by this session and related research.)

Machine Learning & Double Descent
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences (PNAS).
Nakkiran, P., et al. (2020). Deep Double Descent: Where Bigger Models and More Data Hurt. arXiv.

Financial Machine Learning (Empirical Support)
Gu, S., Kelly, B., & Xiu, D. (2020). Empirical Asset Pricing via Machine Learning. Review of Financial Studies.
Goyal, A., & Welch, I. (2008). A Comprehensive Look at The Empirical Performance of Equity Premium Prediction. Review of Financial Studies.

Foundations of Statistical Modeling
Box, G. E. P., & Jenkins, G. M. (1970). Time Series Analysis: Forecasting and Control.
Statistical Model — foundational definition of models used throughout statistics and machine learning

System vs Model Complexity (Modern AI Systems)
Berkeley AI Research (2024). Compound AI Systems.
https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/

0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

      Join Our Community

    Subscribe

    Categories

    All
    Artificial Intelligence
    Automation & Operations
    Certification & Careers
    Cloud & Hybrid IT
    Enterprise Technology & Strategy
    General
    Hardware & End-User Computing
    Virtualization & Core Infrastructure

    Recognition

    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture
    Picture

    RSS Feed

    Follow @bdseymour

Virtualization Velocity

© 2025 Brandon Seymour. All rights reserved.

Privacy Policy | Contact

Follow:

LinkedIn X Facebook Email
  • Home
  • About
  • VMware Explore
    • VMware Explore 2025
    • VMware Explore 2024
    • VMware Explore 2023
    • VMware Explore 2022
  • VMworld
    • VMworld 2021
    • VMworld 2020
    • VMworld 2019
    • VMworld 2018
    • VMworld 2017
    • VMworld 2016
    • VMWorld 2015
    • VMWorld 2014
  • vExpert
  • The Class Room
  • AI Model Compute Planner
  • AI-Q Game
  • Video Hub
  • Tech-Humor
  • Contact