Figure: Overview of Vidar.
Vidar includes a video diffusion foundation model for video prediction and a masked inverse dynamics model (MIDM) for action regression. The video diffusion model is trained on 750K multi-view bimanual videos with test-time scaling (TTS) applied during testing, and it can adapt to new robot platforms with only 20 minutes of demonstrations with state-of-the-art performance and generalize to unseen tasks with strong semantic understanding.
Figure: Methods of Vidar.
Figure: Examples of Vidar.
Figure: From dream world prediction to real world execution.
Figure: Overview of AnyPos.
AnyPos is a robot-specific image-to-action model trained entirely on task-agnostic trajectories sampled by ATARA. It integrates two key techniques to enhance performance: Arm-Decoupled Estimation and Direction-Aware Decoder (DAD). Together, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision. By combining scalable unsupervised data collection with physically informed learning architectures, our approach demonstrates that task-agnostic action data can serve as a practical and powerful foundation for generalizable manipulation.
Our AnyPos method achieves 57.13% action prediction accuracy on the test set, which includes unseen skills and objects, surpassing previous approaches (naïve ResNet+MLP used in unipi, unisim, robodreamer, and susie) by 51%. In real-world robot replay tests, AnyPos-ATARA demonstrates 92.59% task success rate (as shown in the figure below), representing a 33% improvement over the human-collected dataset, and surpassing previous approaches by 44% (as shown in the Figure of "Overview of AnyPos").
Figure: The results of AnyPos-ATARA with video replay to accomplish various manipulation tasks.
Standard VLA models learn temporally extended policies , where represents the parameters of the VLA policy, T is the current timestep and H denotes the history window size, mapping observation histories and language commands to action sequences. Given an expert dataset , the training objective of VLAs is to maximize the likelihood:
Figure 3: The objective of VLAs.
Figure 4: Decompsing task-specific action.
Figure: Methods of AnyPos. We obtain a task-agnostic training dataset covering the entire cubic workspace of dual robotic arms using ATARA.
@misc{feng2025vidarembodiedvideodiffusion,
title={Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation},
author={Yao Feng and Hengkai Tan and Xinyi Mao and Guodong Liu and Shuhe Huang and Chendong Xiang and Hang Su and Jun Zhu},
year={2025},
eprint={2507.12898},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.12898},
}
@misc{tan2025anyposautomatedtaskagnosticactions,
title={AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation},
author={Hengkai Tan and Yao Feng and Xinyi Mao and Shuhe Huang and Guodong Liu and Zhongkai Hao and Hang Su and Jun Zhu},
year={2025},
eprint={2507.12768},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.12768},
}
Thank you!