The Illusion of Deep Learning: Why "Stacking Layers" Is No Longer Enough
With nested learning, Google Research proposes shifting from static AI to a dynamic architecture, inspired by brain frequencies.
Join the DZone community and get the full member experience.
Join For FreeHave we reached the limit of what we can achieve with our current AI models? At the very heart of the race for parameters and power conducted by Big Tech players, a fundamental question emerges: Do our AIs truly understand the changing world, or are they simply reciting a frozen past?
In the study shared by the Google Research team in their paper "Nested Learning: The Illusion of Deep Learning Architectures" (1), the finding is unequivocal. According to them, our large language models (LLMs) suffer from "anterograde amnesia syndrome." Like patient Henry Molaison, a famous clinical case (2), who was incapable of forming new memories after his operation, our models, once their training is complete, are frozen.
To move past this, stacking neural layers is no longer enough. Since 2020, the industry has followed the "Scaling Laws" (3), which reveal a direct correlation between model size and performance. To fully appreciate this, let’s consider how, in 2020, with GPT-3 and 175 billion parameters, OpenAI stacked 96 successive neural layers. In a way, every word we sent it then had to get through that 96-story tower to be processed. In 2024, Llama 3.1 from Meta, equipped with 405 billion parameters, pushes the logic of density to the extreme with 126 layers of depth.
From Static (Colossal!) Architecture to Nested Optimization
To understand the origin of the problem, we must look under the hood of LLMs, at the very heart of their architecture: deep learning. Until now, we have built DL like a mille-feuille in which we superimpose technical layers to gain abstractive capability. But this approach freezes a reality that proves to be dynamic. The "Nested Learning" paper proposes a radically different vision. Instead of seeing a model as a monolithic block, NL decomposes it into a series of nested optimization problems, operating in parallel or in a hierarchy.
The major revelation for technical experts lies in the redefinition of the optimizer. We were used to viewing the optimization algorithm as a simple tool for updating weights. It operated as a fixed-rule mechanism, applying principles such as gradient descent to adjust parameters and reduce error. Nested Learning demonstrates that an optimizer is actually a form of associative memory. Concretely, when an optimizer "learns," it does more than just apply a formula. To be precise, we realize that, more accurately, it compresses gradients (past error signals) to adjust the model. It's not just a calculation. It’s an active memorization of the learning experience. The idea is to design "Deep Optimizers," components that do more than follow a fixed rule, but which possess their own "depth" and capacity for adaptation.
Bio-Inspiration: Synchronizing AI's Clocks
Since we often compare AI's performance to that of our brains, we may wonder, when our brains are so efficient at continual learning without forgetting the past, why do AIs fail? Most intriguingly, the answer perhaps lies in the rhythm. The human brain does not operate on a single clock (4). It is governed by neuronal oscillations of different frequencies: Delta waves (slow) for consolidation, and higher frequencies (such as Gamma and Theta) for encoding and immediate processing. This temporal diversity allows for simultaneously managing short-term memory and long-term storage.
NL applies this biomimetic principle to neural network architecture. Instead of updating all parameters at the same speed, the model is divided into components operating at different frequencies. Thus, "high-frequency" neurons react instantly to the context (this is the equivalent of working memory) while "low-frequency" neurons integrate the information slowly, consolidating long-term knowledge.
The HOPE architecture presented in the study was designed on this principle. It combines a self-modifying model with a continuous memory system. HOPE does more than just process information; it learns to modify its own update algorithm in real time.
Why Is This Important for Decision-Makers?
This new trajectory holds concrete strategic implications for decision-makers and IT management.
First, frugal efficiency (5). The results are impressive since an HOPE model of only 760 million parameters challenges much heavier classical Transformer architectures on complex reasoning and language modeling tasks. And starting from 1.3 billion parameters (which remains very small compared to the Llama 70 or 405 billion models), HOPE's superiority becomes striking and clearly surpasses these classical Transformers (6).
These concrete examples demonstrate that intelligence does not reside solely in the model's size, but in the dynamics of its learning.
Next, transparency. By decomposing the model into mathematically defined optimization sub-problems, NL offers a transparent approach that facilitates real-world adoption. Unlike the impenetrable "opaque box" of classical deep learning, where gradient flows are hidden, NL makes each context flow explicit. For regulated industries, this is a promising avenue towards a more explainable AI (7).
Finally, continual learning is able to meet the demand for AI evolvability. Where current models remain frozen in time after their initial training, a "Nested" architecture is designed to evolve constantly. It becomes capable of integrating new knowledge without overwriting the old. For companies, this potentially means the end of massive, long, and costly retraining cycles, benefiting an agile AI that adapts continuously.
Conclusion
If deep learning has revolutionized information processing, nested learning is now tackling memory and adaptability. We are leaving the era of static architecture engineering to progressively enter that of temporal dynamics.
The paradigm shift is real. The study suggests that the optimizer should no longer be viewed as a tool, but almost as a memory in its own right. By drawing inspiration not only from the brain's structure but also from its learning rhythms, we are executing a genuine technological leap.
As for the horizon of AGI, if it is attainable, it probably no longer resides in the gigantism of models, but in a machine’s capacity to master its own temporality.
Sources and References
- A. Behrouz, M. Razaviyayn, P. Zhong, V. Mirrokni: “Nested Learning: The Illusion of Deep Learning Architectures” [link]
- W. Beecher Scoville and B. Milner: “Loss of recent memory after bilateral hippocampal lesions” [link]
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei: “Scaling Laws for Neural Language Models” [link]
- D. B. Headley, D. Paré: “Common oscillatory mechanisms across multiple memory systems” [link]
- F. Jacquet: “Frugal AI: How Efficiency is Reshaping the Future of Tech” [link]
- A. Behrouz, V. Mirrokni: “Introducing Nested Learning: A new ML paradigm for continual learning” [link]
- M.A. Iehl, F. Jacquet: “Toward Explainable AI (Part I): Bridging Theory and Practice—Why AI Needs to Be Explainable” [link]
Opinions expressed by DZone contributors are their own.
Comments