Posted in | News | Consumer Robotics

AI Model Exhibits Child-like Generalization Abilities

Researchers in the Cognitive Neurorobotics Research Unit at the Okinawa Institute of Science and Technology (OIST) have developed an embodied intelligence model with a novel architecture. This design provides access to the neural network's internal states and mimics how children generalize during learning.

Comparison of average error between cases of unlearned object positions (blue) and unlearned compositions of words (orange) among groups with different numbers of compositions in training – Group A is trained on 40 compositions of 5 colors and 8 verb clauses, like ‘grasp yellow,’ ‘move red to the left,’ and ‘put blue on green.’ As the number of compositions in training decreases, error increases, especially when encountering unlearned compositions. This suggests that a greater variety in training is crucial for learning to generalize to different combinations of words, including unlearned combinations. Image Credit: Vijayaraghavan et al., 2025

Humans excel at generalization. For example, if a child is taught to recognize the color red by observing a red ball, a red truck, and a red rose, they are likely to identify the color of a tomato, even if encountering it for the first time.

Compositionality is a critical aspect of generalization, enabling the ability to break down and recombine elements, such as identifying the color of an object. Understanding how this skill develops is a central question in developmental neuroscience and AI research.

The first neural networks, which have evolved into large language models (LLMs) that are now widely influential, were initially designed to study information processing in the human brain. Ironically, as these models became more advanced, their internal processing pathways grew increasingly opaque, with some now containing trillions of tunable parameters.

This paper demonstrates a possible mechanism for neural networks to achieve compositionality. Our model achieves this not by inference based on vast datasets, but by combining language with vision, proprioception, working memory, and attention–just like toddlers do.

Dr. Prasanna Vijayaraghavan, Study First Author and Junior Research Fellow, Okinawa Institute of Science and Technology

Perfectly Imperfect

LLMs use transformer network architectures to analyze the statistical relationships between words in large text datasets. They use this data to estimate the most likely response to a given prompt, effectively accessing a vast range of phrases and contexts.

In contrast, the new model is based on the PV-RNN (Predictive Coding Inspired, Variational Recurrent Neural Network) framework. It was trained through embodied interactions using three simultaneous sensory inputs: proprioception, which captures limb movements and joint angles of the robot arm; vision, represented by a video of a robot arm manipulating colored blocks; and language instructions, such as “put red on blue.”

The model was then tasked with generating either a language instruction based on sensory input or a visual prediction and corresponding joint angles in response to a language instruction.

The system draws on the Free Energy Principle, which posits that the human brain predicts sensory inputs based on prior experiences and acts to minimize the difference between prediction and observation, a measure known as “free energy.” Reducing free energy helps maintain cognitive stability.

Unlike LLMs, this AI model incorporates cognitive constraints such as limited working memory and attention span, requiring it to process and update predictions sequentially rather than simultaneously. By analyzing the model's internal information flow, researchers can investigate how it integrates sensory and linguistic inputs to simulate behavior.

This modular design has provided insights into how compositionality—critical for generalization—may emerge in infants.

Dr. Vijayaraghavan recounted, “We found that the more exposure the model has to the same word in different contexts, the better it learns that word. This mirrors real life, where a toddler will learn the concept of the color red much faster if she’s interacted with various red objects in different ways, rather than just pushing a red truck on multiple occasions.”

Opening the Black Box

Our model requires a significantly smaller training set and much less computing power to achieve compositionality. It does make more mistakes than LLMs do, but it makes mistakes that are similar to how humans make mistakes,” stated Dr. Vijayaraghavan.

For cognitive scientists and AI researchers seeking to understand their models' decision-making processes, the PV-RNN's transparency makes it a valuable tool. Unlike LLMs, which prioritize scalability and effectiveness, the PV-RNN focuses on providing insight into its information processing pathways. Its relatively shallow architecture allows researchers to observe the network’s latent state, which represents the evolving internal information retained from past inputs and used for current predictions.

The model also addresses the Poverty of Stimulus problem, which posits that the linguistic input available to children is insufficient to explain their rapid language acquisition. Despite being trained on a much smaller dataset compared to LLMs, the PV-RNN achieves compositionality, suggesting that anchoring language to behavior may play a key role in children’s exceptional language-learning abilities.

Additionally, this embodied learning approach enhances transparency and provides a deeper understanding of the consequences of actions. This could guide the development of AI systems that are both safer and more ethical. For instance, a PV-RNN learns the concept of "suffering" through embodied experiences, which may imbue the term with a richer contextual and emotional understanding than learning it solely through linguistic data, as LLMs do.

We are continuing our work to enhance the capabilities of this model and are using it to explore various domains of developmental neuroscience. We are excited to see what future insights into cognitive development and language learning processes we can uncover.

Jun Tani, Study Senior Author and Professor, Okinawa Institute of Science and Technology

One of the fundamental scientific questions is how humans develop the intelligence required to build and sustain a civilization. While the PV-RNN does not provide a definitive explanation, it offers new avenues for exploring the brain's information-processing mechanisms.

By observing how the model learns to combine language and action, we gain insights into the fundamental processes that underlie human cognition. It has already taught us a lot about compositionality in language acquisition, and it showcases the potential for more efficient, transparent, and safe models,” summarized Dr. Vijayaraghavan.

Journal Reference:

Vijayaraghavan, P., et al. (2025) Development of compositionality through interactive learning of language and action of robots. Science Robotics. doi.org/10.1126/scirobotics.adp0751

Tell Us What You Think

Do you have a review, update or anything you would like to add to this news story?

Leave your feedback
Your comment type
Submit

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.