Figure: Overview of Vidar.
Vidar includes a video diffusion foundation model for video prediction and a masked inverse dynamics model (MIDM) for action regression. The video diffusion model is trained on 750K multi-view bimanual videos with test-time scaling (TTS) applied during testing, and it can adapt to new robot platforms with only 20 minutes of demonstrations with state-of-the-art performance and generalize to unseen tasks with strong semantic understanding.
Figure: Methods of Vidar.
Figure: Examples of Vidar.
Lift Dish
Place Apple
Trash Paper Ball
Wipe Table
Task | Clean (Std.) | Rand. (Std.) | Clean (L.D.) | Rand. (L.D.) |
---|---|---|---|---|
Adjust Bottle | 63.0% | 34.0% | 100.0% | 65.0% |
Beat Block Hammer | 93.0% | 11.0% | 85.0% | 10.0% |
Blocks Ranking RGB | 52.0% | 2.0% | 55.0% | 0.0% |
Blocks Ranking Size | 21.0% | 0.0% | 35.0% | 0.0% |
Click Alarmclock | 95.0% | 58.0% | 100.0% | 35.0% |
Click Bell | 100.0% | 54.0% | 95.0% | 25.0% |
Dump Bin Bigbin | 72.0% | 6.0% | 50.0% | 10.0% |
Grab Roller | 96.0% | 28.0% | 100.0% | 30.0% |
Handover Block | 2.0% | 0.0% | 5.0% | 0.0% |
Handover Mic | 24.0% | 6.0% | 0.0% | 0.0% |
Hanging Mug | 1.0% | 0.0% | 0.0% | 0.0% |
Lift Pot | 93.0% | 6.0% | 90.0% | 10.0% |
Move Can Pot | 48.0% | 0.0% | 60.0% | 0.0% |
Move Pillbottle Pad | 72.0% | 3.0% | 70.0% | 20.0% |
Move Playingcard Away | 97.0% | 17.0% | 100.0% | 40.0% |
Move Stapler Pad | 28.0% | 4.0% | 35.0% | 0.0% |
Open Laptop | 73.0% | 21.0% | 50.0% | 30.0% |
Open Microwave | 43.0% | 3.0% | 20.0% | 0.0% |
Pick Diverse Bottles | 67.0% | 5.0% | 55.0% | 0.0% |
Pick Dual Bottles | 87.0% | 17.0% | 85.0% | 15.0% |
Place A2B Left | 86.0% | 10.0% | 45.0% | 10.0% |
Place A2B Right | 91.0% | 11.0% | 55.0% | 15.0% |
Place Bread Basket | 82.0% | 7.0% | 75.0% | 15.0% |
Place Bread Skillet | 79.0% | 6.0% | 85.0% | 10.0% |
Place Burger Fries | 93.0% | 13.0% | 80.0% | 5.0% |
Place Can Basket | 38.0% | 3.0% | 50.0% | 0.0% |
Place Cans Plasticbox | 69.0% | 13.0% | 0.0% | 0.0% |
Place Container Plate | 98.0% | 21.0% | 100.0% | 55.0% |
Place Dual Shoes | 9.0% | 1.0% | 0.0% | 0.0% |
Place Empty Cup | 92.0% | 22.0% | 100.0% | 20.0% |
Place Fan | 55.0% | 7.0% | 45.0% | 0.0% |
Place Mouse Pad | 74.0% | 11.0% | 60.0% | 10.0% |
Place Object Basket | 55.0% | 3.0% | 35.0% | 10.0% |
Place Object Scale | 75.0% | 13.0% | 85.0% | 0.0% |
Place Object Stand | 90.0% | 12.0% | 95.0% | 35.0% |
Place Phone Stand | 82.0% | 16.0% | 75.0% | 25.0% |
Place Shoe | 89.0% | 20.0% | 80.0% | 40.0% |
Press Stapler | 98.0% | 49.0% | 90.0% | 40.0% |
Put Bottles Dustbin | 3.0% | 0.0% | 0.0% | 0.0% |
Put Object Cabinet | 22.0% | 1.0% | 0.0% | 0.0% |
Rotate QRcode | 65.0% | 0.0% | 65.0% | 10.0% |
Scan Object | 47.0% | 5.0% | 45.0% | 5.0% |
Shake Bottle | 99.0% | 64.0% | 100.0% | 65.0% |
Shake Bottle Horizontally | 99.0% | 58.0% | 100.0% | 60.0% |
Stack Blocks Three | 25.0% | 2.0% | 15.0% | 0.0% |
Stack Blocks Two | 90.0% | 10.0% | 80.0% | 5.0% |
Stack Bowls Three | 39.0% | 3.0% | 45.0% | 15.0% |
Stack Bowls Two | 92.0% | 22.0% | 95.0% | 35.0% |
Stamp Seal | 68.0% | 7.0% | 50.0% | 0.0% |
Turn Switch | 60.0% | 30.0% | 60.0% | 10.0% |
Average | 65.8% | 14.3% | 60.0% | 15.7% |
Success rates of Vidar on the RoboTwin 2.0 benchmark. Columns 1-2: Standard leaderboard setting (trained under the clean scenario with 50 episodes for each task), success rate averaged over 100 episodes. Columns 3-4: Low-data setting (trained under the clean scenario with 20 episodes and adjusted camera views for each task), success rate averaged over 20 episodes.
Figure: From dream world prediction to real world execution.
Figure: Overview of AnyPos.
AnyPos is a robot-specific image-to-action model trained entirely on task-agnostic trajectories sampled by ATARA. It integrates two key techniques to enhance performance: Arm-Decoupled Estimation and Direction-Aware Decoder (DAD). Together, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision. By combining scalable unsupervised data collection with physically informed learning architectures, our approach demonstrates that task-agnostic action data can serve as a practical and powerful foundation for generalizable manipulation.
Our AnyPos method achieves 57.13% action prediction accuracy on the test set, which includes unseen skills and objects, surpassing previous approaches (naïve ResNet+MLP used in unipi, unisim, robodreamer, and susie) by 51%. In real-world robot replay tests, AnyPos-ATARA demonstrates 92.59% task success rate (as shown in the figure below), representing a 33% improvement over the human-collected dataset, and surpassing previous approaches by 44% (as shown in the Figure of "Overview of AnyPos").
Figure: The results of AnyPos-ATARA with video replay to accomplish various manipulation tasks.
Standard VLA models learn temporally extended policies , where represents the parameters of the VLA policy, T is the current timestep and H denotes the history window size, mapping observation histories and language commands to action sequences. Given an expert dataset , the training objective of VLAs is to maximize the likelihood:
Figure 3: The objective of VLAs.
Figure 4: Decompsing task-specific action.
Figure: Methods of AnyPos. We obtain a task-agnostic training dataset covering the entire cubic workspace of dual robotic arms using ATARA.
@misc{feng2025vidarembodiedvideodiffusion,
title={Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation},
author={Yao Feng and Hengkai Tan and Xinyi Mao and Guodong Liu and Shuhe Huang and Chendong Xiang and Hang Su and Jun Zhu},
year={2025},
eprint={2507.12898},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2507.12898},
}
@misc{tan2025anyposautomatedtaskagnosticactions,
title={AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation},
author={Hengkai Tan and Yao Feng and Xinyi Mao and Shuhe Huang and Guodong Liu and Zhongkai Hao and Hang Su and Jun Zhu},
year={2025},
eprint={2507.12768},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.12768},
}
Thank you!