Very often people learn to do things by watching others do it first. One popular scenario nowadays is people watching someone perform a certain skill/action on youtube (as a tutorial) in order to learn or get better. Now, what if robots could do the same? Today, however, the predominant paradigm for teaching robots is to remote control them using specialized hardware for teleoperation and then train them to imitate pre-recorded demonstrations.
If robots could instead self-learn new tasks by watching humans, this capability could allow them to be deployed in more unstructured settings like the home, and make it dramatically easier for anyone to teach or communicate with them, expert or otherwise. Perhaps one day, they might even be able to use Youtube videos to grow their collection of skills over time.
The biggest impediment is obvious but often overlooked, a robot is physically different from a human, which means it often completes tasks differently than we do. As a perfect example, the images below show how a human would outperform a gripper robot in a pen manipulation task just by the simple motion that allows a human to grip all the pens at the same time. The problem is not just in the performance side, but how exactly should a robot approach this in order to mimic the approach of the human.
 |
Left: The hand grabs all pens and quickly transfers them between containers. Right: The two-fingered gripper transports one pen at a time. Image source |
Cross-Embodiment Inverse Reinforcement Learning (XIRL)
In 2021 at the Conference on Robot Learning (CoRL) 2021, the team formed by Zakka, Kevin and Zeng, Andy and Florence, Pete and Tompson, Jonathan and Bohg, Jeannette and Dwibedi, Debidatta, presented XIRL as an oral paper, their statement : "We explore these challenges further and introduce a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL). Rather than focusing on how individual human actions should correspond to robot actions, XIRL learns the high-level task objective from videos, and summarizes that knowledge in the form of a reward function that is invariant to embodiment differences, such as shape, actions and end-effector dynamics. The learned rewards can then be used together with reinforcement learning to teach the task to agents with new physical embodiments through trial and error. Our approach is general and scales autonomously with data — the more embodiment diversity presented in the videos, the more invariant and robust the reward functions become. Experiments show that our learned reward functions lead to significantly more sample efficient (roughly 2 to 4 times) reinforcement learning on new embodiments compared to alternative methods. To extend and build on our work, we are releasing an accompanying open-source implementation of our method along with X-MAGICAL, our new simulated benchmark for cross-embodiment imitation.
The underlying observation in this work is that in spite of the many differences induced by different embodiments, there still exist visual cues that reflect progression towards a common task objective. For example, in the pen manipulation task above, the presence of pens in the cup but not the mug, or the absence of pens on the table, are key frames that are common to different embodiments and indirectly provide cues for how close to being complete a task is. The key idea behind XIRL is to automatically discover these key moments in videos of different length and cluster them meaningfully to encode task progression. This motivation shares many similarities with unsupervised video alignment esearch, from which we can leverage a method called Temporal Cycle Consistency (TCC), which aligns videos accurately while learning useful visual representations for fine-grained video understanding without requiring any ground-truth correspondences."
 |
XIRL self-supervises reward functions from expert demonstrations using temporal cycle consistency (TCC), then uses them for downstream reinforcement learning to learn new skills from third-person demonstrations. |
Results and highlights
To evaluate the performance of XIRL and baseline alternatives e.g (TCN, LIFS, Goal Classifier) in a consistent environment, they created X-MAGICAL, which is a simulated benchmark for cross-embodiment imitation.
The task: A simplified 2D equivalent of a common household robotic sweeping task, where an agent has to push three objects into a goal zone in the environment.
In their first set of experiments, they checked whether the learned embodiment-invariant reward function can enable successful reinforcement learning, when the expert demonstrations are provided through the agent itself. The results? XIRL significantly outperforms alternative methods especially on the tougher agents (e.g., short-stick and gripper).
For more details and experiments, check out their
paper and download the code from the
GitHub repository.
Conclusion
XIRL learns an embodiment-invariant reward function that encodes task progress using a temporal cycle-consistency objective. Policies learned using the reward functions are significantly more sample-efficient than baseline alternatives. Furthermore, the reward functions do not require manually paired video frames between the demonstrator and the learner, giving them the ability to scale to an arbitrary number of embodiments or experts with varying skill levels.
Sources:
- https://arxiv.org/abs/2106.03911
- https://github.com/google-research/google-research/tree/master/xirl
- https://github.com/kevinzakka/x-magical