Currently, generative AI is sweeping the entire society, and large language models (LLMs) have achieved amazing achievements in text (ChatGPT) and image (DALL-E) generation, relying solely on With a few prompt words, they can generate content beyond expectations (such as asking about the cover image of each issue).
The progress made in generative AI represented by large language models prompts us to think: Can ChatGPT really understand what they are "talking about"? Or is it just an example of Searle's "Chinese Room"*? Can it "capture" external reality? Or is it just a mimic phenomenon spawned by natural language data? More deeply, is generative AI the right path to artificial understanding? In addition to copying data, can it understand the "meaning" of words, perceptions, and actions? Or is it simply the end of a self-limiting approach?
*Chinese Room: A thought experiment proposed by American philosophy professor John Searle to refute the idea of strong artificial intelligence. According to the view of strong artificial intelligence, as long as the computer has appropriate programs, it can theoretically be said that the computer has its cognitive state and can perform understanding activities like humans. However, the Chinese Room pointed out that even if a computer can answer questions posed in human language, it cannot establish the semantic relationship of human language and cannot understand human language. It only mechanically manipulates symbols according to rules. John R. Searle. MINDS, BRAINS, AND PROGRAMS. [2014-07-23].In Generating Meaning: The Scope and Limits of Active Reasoning and Passive AI, Giovanni Pezzulo, Thomas Parr, Paul Cisek, Andy Clark and Karl Friston attempted to compare active reasoning models of living organisms (active inference) and the passive generative model of AI, indicating the true basis of "understanding", and thinking about whether generative AI can acquire the ability to understand.
▷Figure 1. Generating meaning: active inference and the scope and limits of passive AI, https://doi.org/10.1016/j.tics.2023.10.002. Image source: cell
Limitations of generative AIBiological systems and active reasoning
Many philosophers (such as Andy Clark, Merleau Ponty), psychologists (such as James Gibson, LawrenceBarsalou and neuroscientists have reached a consensus: the basic function of the brain is not to accumulate knowledge but to control the exchange of information and energy with the world. What's more, specific interactions stably change the state of things in specific ways (e.g., eating reduces hunger, escaping from predators reduces danger, etc.). Therefore, what matters is not the authenticity of knowledge, but the stability formed by interaction with the world.
So, in this interaction, certain features of the world are particularly important to us because they determine how we act. Gibson calls such characteristics affordances*, the possibilities for action provided by the environment. Biological systems often respond to these affordances with sensorimotor movements. For example, a flat surface can be used for support, sitting, and storage.
*Note: Affordance, the noun form of afford, was first systematically introduced by Gibson in the book "The Ecological Approach to Visual" elaboration. Affordances are action possibilities provided by the environment to organisms, which may be good or bad. Affordance is neither an objective nor a subjective quality, but the product of the interaction between living things and the environment.In addition, another characteristic of biological systems is their ability to make action predictions based on their knowledge of the dynamic world before interacting with it. This kind of prediction is the cornerstone of active inference. Simply put, active reasoning holds that the sensory behavior of living organisms is fundamentally predictive rather than randomly triggered, and is based on a model of the world that provides affordances.
Two generative models
Generative AI and active reasoning share a common promise: they both emphasize predictions based on generative models. However, although they are all based on generative models (Figure 2), their operating mechanisms are different.
▷Figure 2: Generative model of generative AI and active reasoning. Source: original paper.
In active reasoning, the generative model is not only used for prediction, but also a guarantee of agency. They reason about goal orientation, decision-making, and planning in the external or internal world. In inactive states (offline), such as during introspection or sleep, the generative model of active reasoning also simulates past counterfactual scenarios (i.e., "what if the past was not so" reasoning) and possible futures to optimize Generate models, resulting in behavioral strategies.
In contrast, generative AI is based on deep networks and constructs generative models from information through self-supervised learning. Taking large language models as an example, they usually use autoregressive models and transfer architectures when inferring the next word in a sentence. After large-scale sample training, DayuLanguage models can generate entirely new content with flexible predictions. Moreover, it is also good at some downstream tasks (such as summarizing text and answering questions), and can solve more tasks (such as writing science fiction novels) with fine-grained domain-specific data sets.
The key difference between these two generative models is that active inference results in responses that are meaningful, and this meaning is based on sensorimotor experience. For example, responding to the question "go north" or "go south" is associated with specific possibilities for action in physical space, and multisensory and affective states of neural processing are also involved. Although artificial systems can learn the statistical laws of spatial translations through training, spatial translations have very different meanings for organisms that can move in space and artificial systems without the ability to move. For the former, spatial translation is about the possibility of action and a causal understanding of the world.
Understanding the meaning of living organismsSuccessful generative models are able to extract "latent variables" from the data that help explain and predict. Generative AI can use latent variables to reflect statistical laws to transcend the boundaries of training data; the purpose of refining latent variables by living organisms may be to better predict the state of the world. Although they both extract latent variables, active inference and generative AI do so differently. Generative models of active reasoning involve understanding and using latent variables as the basis for concept formation.
For humans and other living things, interaction with the world is an exploration of specific properties of the world. A table is not only an object made of wood, made of table legs and a tabletop, but a collection of affordances that can carry plates, sit on people, and serve as a shelter during an earthquake. These affordances are the potential of the table. variable. The word "table" is just a symbol, or an abbreviation. Specifically, a "table" is "an object that can place things, sit on it, and hide under it." Therefore, the concept of a table is actually a constellation of potential variables associated with the outcome of an action. Living organisms learn about objects through sensorimotor experience. Abstract concepts such as weight, size, etc. are developed based on the information provided by these multiple senses.
Language ability is also based on sensory modules and developed in interaction (i.e. communication). From an embodied perspective, communication is a sensorimotor interaction. The significance of communication lies not in speech and grammar but in the social interactions predicted by communication. Although human language communication has developed abstraction to the extreme, it is still based on interaction and control. Words are shorthand for meaningful interactions and are agreed upon in interactions. We also acquire the meaning of language symbols through interactions with similar species. Current cognitive machines based on language acquisition must develop language and symbolic abilities in the context of goal-directed actions. And big language models and other generative AI just start with text data from massive multi-sensory modulesLearn passively.
In short, our understanding of language symbols comes from interaction with the living world, rather than simply the use of natural language. The latent variables of generative AI may be able to grasp the statistical laws of the world, but skip their formation process. In fact, generative AI only inherits the language property obtained from human communication, but does not participate in the interactive process that gives meaning to words. In large language models, only the people who produce the training text and the translated text can understand the meaning of the words.
Action-based embodied intelligenceChildren do not acquire knowledge, but construct knowledge through experience and interaction with the environment. ——Maria MontessoriGive generative AI more data, can they gain understanding? It is imperative to indicate what the true basis of understanding is.
In fact, the way generative AI acquires concepts is quite different from that of living organisms (Figure 3). Living organisms learn through sensorimotor interactions with their environment, which not only include the mastery of statistical laws, but more importantly, they form the basis for perception and understanding of causality in the world. Through sensorimotor experience and dynamic movement in the environment, living organisms acquire various representations of the environment, such as affordances, space, objects, situations, sense of self, and sense of agency. Our brains also encode interactions and affordances with our environment. Studies have shown that the hippocampus and entorhinal cortex integrate self-moving information through pathways and develop spatial coding (including coding of abstract conceptual space)*. The prefrontal cortex also contains spatial circuits for detecting affordances. This embodied intelligence is the basis for developing abstract conceptual thinking.
▷Figure 3: How generative AI and living organisms learn generative models to solve the pathfinding task in Figure 2. Source: original paper.
*Note: The hippocampus maps concept space, not feature space. J. Neurosci. 2020; 40: 7318-7325Different from this, the so-called "understanding" of current generative AI is not based on action. , they only passively reflect the statistical laws of data, rather than presenting causal laws about the world. This approach lacks active selection of data and intervention in training, so it cannot develop an understanding of the causal relationship between actions and their results, nor can it distinguish between predictions and observations.
Generative AI often relies on the complexity of its model to improve prediction accuracy, but this approach also brings certain limitations. These systems perform well on specific tasks but have difficulty generalizing to other similar tasks. This limitation cannot be overcome simply by increasing the amount of data. Because understanding context-sensitive language not only requires a large amount of data, but also requires the ability to extract deep meanings and patterns from the data.
In addition, generative AI and biological organisms also determine the information to pay attention to in different ways. The attention mechanism of the transformer model in generative AI performs a filtering function and determines which information is valuable by assigning different weights. And the attention of living organisms involves active choices whose purpose is to eliminate uncertainty.
In the process of evolution, organisms have developed unique generative models in the face of the pressure of natural selection. Our emotions, for example, are rooted in the feeling that something “is important to me,” which gives meaning and purpose to our understanding of the world. During active reasoning, we use interoceptive prediction to guide actions and decisions in a way that allows us to better understand the causes and consequences of our actions. This interoceptive, exteroceptive and proprioceptive projections work together to promote the survival of living beings. Therefore, unlike generative AI, biological active reasoning models are naturally formed and do not require continuous learning of fine-grained and complicated tasks like AI.
In addition, in order to survive, living organisms cannot just wait passively, waiting for signals to stimulate, but must actively and proactively interact with the world in a purposeful manner. This means that generative models of living organisms must ensure careful trade-offs and flexible choices between exploring new patterns and exploiting old ones. In addition, in order to be more generalizable, this also requires that the model must not only be accurate, but also save energy. In an ecological niche, this trade-off can support action and perception at different time scales. In active inference, the trade-off between exploratory and exploitative behavior, as well as the trade-off between efficiency and accuracy of the generated model, can be addressed by minimizing free energy. But generative AI has yet to achieve this kind of context-sensitive, flexible control.
▷Picture source: Midjourney
Finally, from the perspective of phylogenetic trajectory, there are essential differences between generative AI and active reasoning. Living organisms with the ability to think abstractly and speak can develop a special way of mental representation—what we call “detached representation.” Although these representations originate from sensorimotor experience, they can eventually become independent from their original environment and form an autonomous independent existence. For example, we are able to discuss objects through imagination and language without directly perceiving them.
This representational ability, independent of direct sensory experience, is the basis for higher cognitive functions such as planning, imagining, and discussing abstract or absent things. Complex mental life requires this ability, allowing us to move from direct, pragmatic representations to semantic, descriptive representations. This transformation occurs through complex social interactions and deep engagement with the world, thereby expanding the boundaries of our understanding and meaning. Current generative AI takes a completely different development path, obtaining knowledge directly from text. This process has beenDriven by the availability of advanced technologies such as large data sets and efficient transformer models.
In short, true "understanding" is based on active understanding, based on the interaction of organisms with the world through sensory movements, and based on the active exploration of the environment by living organisms. Deeper understanding requires the capacity for disjunctive representation, which is the ability to plan, imagine, and discuss abstract concepts beyond the immediate situation, even though this capacity is still based on interaction with the world. This understanding is not just a mastery of statistical laws, but an in-depth understanding of the causal structure behind the world model.
Where is the way out for generative AI?Is continuing to expand the scale of generative AI along the same old path a desirable approach to true intelligence?
To make generative AI generate meaning and have the ability to understand, there are currently two options. Either stick to the original method and develop in a more complex direction. Either change your thinking and emphasize the active selection of training data.
Current research mostly adheres to the first option, which is to increase the complexity of generative AI to improve its performance. This complexity is mainly reflected in the increase in model parameters and the expansion of the amount of training data. In addition, it also includes diversifying the types of input information and adding more functions and capabilities to achieve more advanced AI applications. However, there is a potential, more profound approach that is often overlooked, which is to let the model make active choices by interacting with the world, gaining knowledge about the world while pursuing intrinsic goals.
The current large language model uses our description of the world as an intermediary to understand reality. Simply building a large text-based language model and then trying to relate it to the world to gain understanding of the world may not be the most effective approach. A more reliable approach may be to first let the AI system learn from interactions with the real world, and then combine these experiences with large language models. However, this “interaction first, model later” approach has not yet been systematically studied.
Artificial Intelligence, as a Mirror of HumanityGenerative AI can only produce results based on given prompt words or text, and cannot actively Reasoning also generates reasons, such as planning. This provides several basic implications:
First of all, real plans contain agency, and only agents have the "action-result" generative model. Secondly, this means that generative models of active reasoning do not rely solely on data input, but require real-time sensorimotor interactions with the world. In other words, the generative model is based on the world model; the "action-result" model can reveal the causal structure of the world, while information collection only reflects the causal structure secretly with statistical laws.
From a practical perspective, generative AI is not suitable as an ideal model for autonomous robots or autonomous driving technology. Furthermore, because generative AIThere are no affordances, so it has no active learning mechanism driven by curiosity. Compared to this, embodied intelligence may be a more effective model.
Despite the above limitations of generative AI, it still has a profound impact on our ecosystem. It leads us to reflect on the process of human understanding and find bridges between world models and information flows. We humans are constantly externalizing our thoughts, creating entirely new objects that also require careful examination. Generative AI is a vivid case that reveals a way of constructing cognitive self that we have not paid attention to.
It can be said that generative AI is like a mirror of humanity in the 21st century. We see ourselves in it, but unfortunately, there is no one behind the mirror.