Series (2/4): Toward a Shared Language Between Humans and Machines — From Multimodality to World Models: Teaching Machines to Experience
Explores technologies attempting to bridge the gap through perception: multimodal systems, digital twins, and research efforts to create World Models.
Join the DZone community and get the full member experience.
Join For FreeWhat if the key to a shared language lay in experience itself?
Researchers are now exploring approaches that connect text with images, sounds, and interactions within a three-dimensional world. Sensorimotor grounding, multimodal perception, and world models, all these paths aim to give machines the kind of anchoring they still so painfully lack.
Since the machine shares neither our cultural memory nor our perception of the world, several ways can be imagined to bridge that gap.
Connecting Language With Real-World Experience
The first is to “root symbols in sensorimotor experience.” As early as the 1990s, Stevan Harnad proposed a hybrid model: words should be linked to both iconic representations (images, direct perceptions) and categorical ones (learned invariants), rather than floating in a purely symbolic space.
To understand “cat,” then, is not merely to manipulate the word, but above all to connect its use to a perceptual experience, to see it, and ideally one day, to hear and even to touch it.
This idea now inspires multimodal approaches, where text and vision are combined to bring linguistic processing closer to grounding in the real world.
In practical terms, this amounts to giving the machine a richer form of “experience.” For example, if a model is shown thousands of images of cats accompanied by the caption “cat,” it learns to associate the word not only with other words but also with shapes, colors, and postures. When later asked to describe a photo, it no longer merely manipulates text; it retrieves visual features that refer to a perceptual experience. This combination is what now allows a multimodal model to recognize that “a cat is sleeping on a couch,” instead of merely predicting a string of words unrelated to the image.
But here again, the gap between the cognitive abilities of a human and a machine is enormous. As research in vision and cognition reminds us, a young child can recognize a new category with very few examples, sometimes just one, while artificial systems require dozens, hundreds, or even thousands of examples.
Teaching Machines to Understand the World
In this perspective, spatial intelligence and “world models” play a central role in research. Fei-Fei Li emphasizes the need for AI to reason within a 3D universe, where objects have permanence and physical laws impose constraints. Yann LeCun extends this vision with the concept of “world models”: internal representations that allow systems to simulate, predict, and plan before acting. IBM is part of this same dynamic, working on digital twins for industry and medical research.
In concrete terms, these digital twins do not merely represent a “snapshot” of a system, but its “movie.” They make it possible to model both the shape and the evolution of a phenomenon, whether it involves tracking atmospheric currents or understanding how genes interact with one another.
All these approaches aim to bring machines closer to the way humans connect language, perception, and action.
For my part, I am convinced that these approaches are not mutually exclusive but complementary. None of them will be enough on its own: it is likely by combining embodied perception, efficient processing capabilities, and a solid ethical framework that we will truly be able to move forward.
Other lines of research aim instead to adapt operational languages to human intentions. The TransCoder project, for instance, has shown that AI can perform accurate translations between different programming languages (C++, Java, Python) without human supervision. To achieve this, it learned on its own to align its structures and libraries. The level of difficulty is lower compared to understanding human language, since between “machine” languages, meaning is operational and strictly defined.
From that starting point, one can hope that it will one day be possible to build an analogous bridge between human language and machine language. The idea would not be to try to imitate our emotions, but to formalize our intentions within an executable protocol.
To Be Continued...
These approaches outline a future where artificial language would finally be linked to a world of perceptions and actions. But other researchers are choosing a radically different path: rather than imitating human experience, they seek to go beyond its limits by harnessing the power of quantum computing. In the next part, we will dive into the emerging world of Quantum Natural Language Processing.
Links to the previous articles published in this series:
- Series: Toward a Shared Language Between Humans and Machines
- Series (1/4): Toward a Shared Language Between Humans and Machines — Why Machines Still Struggle to Understand Us
References
- Abbaszade, Mina; Zomorodi, Mariam; Salari, Vahid; Kurian, Philip. "Toward Quantum Machine Translation of Syntactically Distinct Languages". [link]
- Brodsky, Sascha. "World models help AI learn what five-year-olds know about gravity". IBM. [link]
- Gubelmann, Reto. "Pragmatic Norms Are All You Need – Why The Symbol Grounding Problem Does Not Apply to LLMs". [link]
- Harnad, Stevan. "The Symbol Grounding Problem". [link]
- LEO (Linguist Education Online). "Human Intelligence in the Age of AI: How Interpreters and Translators Can Thrive in 2025". [link]
- Meta AI. "Yann LeCun on a vision to make AI systems learn and reason like animals and humans". [link]
- Opara, Chidimma. "Distinguishing AI-Generated and Human-Written Text Through Psycholinguistic Analysis". [link]
- Qi, Zia; Perron, Brian E.; Wang, Miao; Fang, Cao; Chen, Sitao; Victor, Bryan G. "AI and Cultural Context: An Empirical Investigation of Large Language Models' Performance on Chinese Social Work Professional Standards". [link]
- Roziere, Baptiste; Lachaux, Marie-Anne; Chanussot, Lowik; Lample, Guillaume. "Unsupervised Translation of Programming Languages". [link]
- Strickland, Eliza. "AI Godmother Fei-Fei Li Has a Vision for Computer Vision". IEEE Spectrum. [link]
- Trott, Sean. "Humans, LLMs, and the symbol grounding problem (pt. 1)". [link]
- Nature. “Chip-to-chip photonic quantum teleportation over optical fibers, 2025”. [link]
Opinions expressed by DZone contributors are their own.
Comments