Figure 1: Overview of VIDAR.
VIDAR is a foundation video diffusion model for video prediction and masked inverse dynamics model (MIDM) for action regression. It is trained on 750K multi-view bimanual videos with test-time scaling (TTS) applied during testing, VIDAR can adapt to new robot platforms with only 20 minutes of demonstrations with state-of-the-art performance and generalize to unseen tasks with strong semantic understanding.
Figure 2: Methods of VIDAR.
Figure 3: Overview of AnyPos.
AnyPos is a robot-specific image-to-action model trained entirely on task-agnostic trajectories sampled by ATARA. It integrates two key techniques to enhance performance: Arm-Decoupled Estimation and Direction-Aware Decoder (DAD). Together, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision. By combining scalable unsupervised data collection with physically informed learning architectures, our approach demonstrates that task-agnostic action data can serve as a practical and powerful foundation for generalizable manipulation.
Figure 4: Methods of AnyPos. We obtain a task-agnostic training dataset covering the entire cubic workspace of dual robotic arms using ATARA.
@misc{feng2025generalistbimanualmanipulationfoundation,
title={Generalist Bimanual Manipulation via Foundation Video Diffusion Models},
author={Yao Feng and Hengkai Tan and Xinyi Mao and Guodong Liu and Shuhe Huang and Chendong Xiang and Hang Su and Jun Zhu},
year={2025},
eprint={2507.12898},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.12898},
}
@misc{tan2025anyposautomatedtaskagnosticactions,
title={AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation},
author={Hengkai Tan and Yao Feng and Xinyi Mao and Shuhe Huang and Guodong Liu and Zhongkai Hao and Hang Su and Jun Zhu},
year={2025},
eprint={2507.12768},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.12768},
}
Thank you!