In a significant step toward intelligent robotics, researchers have developed ELLMER—an AI-powered robot that combines GPT-4 with real-time visual and force feedback to perform complex, multi-step tasks in unpredictable environments.
Study: Embodied large language models enable robots to complete complex tasks in unpredictable environments. Image Credit: Anggalih Prasetya/Shutterstock.com
Published in Nature, the study introduces ELLMER (Embodied Large-Language-Model-Enabled Robot), a framework that fuses a large language model with sensorimotor capabilities, allowing the system to adapt dynamically by retrieving contextual knowledge and responding to real-world conditions as they unfold.
Tested on tasks like making coffee and decorating plates—activities that demand multiple, nuanced actions—ELLMER completed each successfully. The work highlights promising progress toward more adaptable, capable robots that can operate effectively outside controlled lab settings.
Why Embodiment Matters in Machine Intelligence
The nature of intelligence—artificial or human—is still widely debated, but there's a growing consensus that cognition is inherently embodied and shaped by interactions between sensory input and motor actions. This challenges traditional AI approaches that treat perception, decision-making, and action as separate processes.
Without physical embodiment, machines may lack key dimensions of intelligence. Integrating robotics with AI could unlock the ability to perform adaptive, context-sensitive tasks—like brewing coffee—that require flexible thinking and real-time responsiveness.
While reinforcement and imitation learning have helped robots acquire individual skills, they often struggle with generalization. Large language models (LLMs), on the other hand, excel at processing abstract instructions, but their use in robotics has been limited by static knowledge bases and weak feedback loops. ELLMER addresses these limitations directly.
ELLMER: Bridging High-Level Reasoning and Physical Execution
The researchers developed ELLMER as an embodied AI system capable of executing long-horizon tasks in changing environments. At its core, the system integrates GPT-4 with retrieval-augmented generation (RAG) and multimodal feedback from vision and force sensors, all within a robotic operating system (ROS) pipeline.
A Kinova robotic arm equipped with an Azure Kinect camera and ATI force sensor forms the physical platform. ELLMER translates high-level commands—like “Make coffee”—into step-by-step instructions. RAG is used to fetch relevant code snippets from a curated knowledge base, helping sequence actions based on context and probability.
For vision, Grounded-Segment-Anything generates 3D voxel maps to identify object positions, while force sensing enables precise movements, such as pouring, guided by global force vectors. The robot operates under velocity (±0.05 m/second), force (20 N), and workspace constraints, with control nodes updating at 40–100 Hz for real-time responsiveness.
In kitchen-like scenarios, ELLMER successfully handled tasks, including opening drawers, grasping mugs, and pouring liquids—all while adapting to unexpected changes in the environment. Its modular architecture supports various retrieval methods, such as vector and hybrid RAG, making it flexible and scalable.
The system is also energy-efficient, with each task generating roughly 7 grams of CO2—low for robotics. By linking high-level reasoning with responsive physical control, ELLMER takes on challenges that traditional systems have struggled to meet: real-time adaptability, feedback integration, and practical safety.
Results: From Abstract Commands to Precision Tasks
ELLMER’s real-world performance shows the strength of this integrated approach. When given a general command like “decorate a plate with a random animal,” the system broke it down into sub-tasks—selecting tools, generating an image using DALL·E, and executing the design with consistent pen pressure.
It accessed a motion library via RAG to apply actions like scooping, pouring, and drawing based on the situation. The vision system identified objects with 100 % accuracy in ideal conditions (e.g., a white mug on a clear surface), though its performance dipped in cluttered scenes or with unfamiliar items. On the force side, the robot achieved an average pouring accuracy of 5.4 grams per 100 grams at low speeds.
In comparative tests, ELLMER significantly outperformed VoxPoser, a baseline system without RAG or force feedback. GPT-4’s task fidelity rose from 0.74 to 0.88 with RAG integration, showing better alignment between plans and real-world execution. Some challenges remain—like vision issues in complex scenes and force inaccuracies at high velocities—but the path forward is clear.
Toward Smarter, More Resilient Robots
This study introduced ELLMER, an embodied AI framework that merges GPT-4, multimodal sensor feedback, and a retrieval-augmented knowledge base to enable flexible task execution in real-world settings. The system demonstrated robust performance in dynamic environments, tackling long-horizon tasks like coffee-making and plate design with adaptive precision.
Though improvements in proactive planning and force modeling are needed, ELLMER represents a meaningful leap toward intelligent, autonomous robots. Its modular, hardware-agnostic design offers a foundation for scalable deployment across industries—bridging the gap between abstract cognition and physical action.
Journal Reference
Ruaridh Mon-Williams, Li, G., Long, R., Du, W., & Lucas, C. G. (2025). Embodied large language models enable robots to complete complex tasks in unpredictable environments. Nature Machine Intelligence. DOI:10.1038/s42256-025-01005-x
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.