MIT’s Masked IRL Uses LLMs to Teach Robots from Vague Instructions
Teaching a robot to perform a task like placing a coffee mug on a desk without disturbing a video call usually requires extensive physical demonstrations or detailed written instructions. This process is labor-intensive for humans, and without both types of data, robots often misinterpret what is needed. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a new method called “Masked Inverse Reinforcement Learning” (Masked IRL) that automates this teaching process, using large language models (LLMs) to clarify ambiguous prompts and requiring nearly five times less demonstration data than conventional approaches.
How Masked IRL Clarifies and Prioritizes Instructions
Masked IRL leverages two LLMs in sequence. First, during kinesthetic demonstrations—where a human physically moves a robot to show a task—the system records the motion trajectory. An LLM compares this trajectory to the shortest possible path and elaborates on vague requests, turning “stay close” into “stay close to the surface of the table.” A second LLM then evaluates environmental details such as obstacle positions and object shapes. It assigns each element a mask—a “1” if important for the task or a “0” if irrelevant (e.g., whether the demonstrator leaned on a table). Only elements marked “1” are incorporated into the robot’s final motion plan, allowing the machine to focus on what truly matters for safe and efficient behavior.

Superior Performance in Simulation and Real World
In both 3D simulations and real-world tests, Masked IRL outperformed comparable baselines. The system enabled virtual and physical robots to maneuver objects around obstacles—for example, moving a coffee mug around a laptop to different spots on a table. It correctly identified users’ unstated preferences up to 15 percent more often than other methods. A real robotic arm trained on just 50 demonstrations successfully executed unseen prompts: it avoided a computer while moving a cup toward a person, wiped a table while “staying close” to it, and handed a bag of chips while “staying away” from both a human and the table. The fast learning required fewer demonstrations than baseline approaches.
Future Vision: Adding Visual Understanding
The current system relies on sensor data and motion logging, but the CSAIL team plans to make Masked IRL more dynamic by integrating cameras. With visual input, the robot could highlight and focus on specific objects in its surroundings—for instance, ignoring nearby bananas when asked to pick up a toy. This advancement aims to further reduce the need for explicit human guidance. The research, presented at the 2026 IEEE International Conference on Robotics and Automation, was supported in part by the MIT Generative AI Impact Consortium Award and the Department of Defense.
The source for this article is https://news.mit.edu/2026/llms-help-robots-understand-vague-instructions-and-focus-key-details-0626.