Overview

  • Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation.
  • However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs.
  • We present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning to effectively overcome those limitations.
  • To address the data collection challenges posed by this paradigm, we introduce ATARA (Automated Task-AgnosticRandom Actions), a novel data collection framework that automatically generates large-scale task-agnostic actions for bimanual manipulation efficiently.
  • By using a vision-generation model for future observation prediction and a downstream inverse dynamics model (IDM) for action regression, we can achieve exceptional generalization capability and remarkable data efficiency.
  • Vidar (Video Diffusion for Action Reasoning) is a two-stage framework that leverages a large-scale, diffusion-based video pre-training model and a novel Masked Inverse Dynamics Model (MIDM) for action prediction, which can be generalized to an unseen robot platform with only 20 minutes of human demonstrations (1/81 of RDT demonstrations, 1/1200 of π0.5 demonstrations).
  • AnyPos is another inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD) that is able to learn from ATARA-generated task-agnostic data. Therefore, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision.
  • Our experiments demonstrate that our Vidar framework can generalize to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods like VPP and UniPi by over 40%.
  • We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks, and we demonstrate that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation.

Vidar: Video Diffusion for Action Reasoning

Vidar Architecture

Figure: Overview of Vidar.

Vidar includes a video diffusion foundation model for video prediction and a masked inverse dynamics model (MIDM) for action regression. The video diffusion model is trained on 750K multi-view bimanual videos with test-time scaling (TTS) applied during testing, and it can adapt to new robot platforms with only 20 minutes of demonstrations with state-of-the-art performance and generalize to unseen tasks with strong semantic understanding.

Vidar Methods

Figure: Methods of Vidar.

Vidar Examples

Figure: Examples of Vidar.

Key Techniques

  • Video Generation Model: Rectified flow models with internet-videos pre-training, emboided pre-training and fine-tuning on unified observation space.
  • Masked Inverse Dynamics Model (MIDM): Inverse dynamics models often suffer from poor generalization due to the presence of background noise, texture biases, and visual distractions in high-dimensional observation, while MIDM can focus on task-relevent regions of the input frame via implicit mask prediction.
  • Test-Time Scaling (TTS): We generate K candidate video trajectories using different random seeds. Then rank these trajectories using a pretrained evaluator (e.g., CLIP or a vision-language model) and select the one which scores the highest.

Vidar Predicted Video Samples

Lift Dish

Place Apple

Trash Paper Ball

Wipe Table

Evaluation of Vidar on the RoboTwin 2.0 benchmark

Task Clean (Std.) Rand. (Std.) Clean (L.D.) Rand. (L.D.)
Adjust Bottle 63.0% 34.0% 100.0% 65.0%
Beat Block Hammer 93.0% 11.0% 85.0% 10.0%
Blocks Ranking RGB 52.0% 2.0% 55.0% 0.0%
Blocks Ranking Size 21.0% 0.0% 35.0% 0.0%
Click Alarmclock 95.0% 58.0% 100.0% 35.0%
Click Bell 100.0% 54.0% 95.0% 25.0%
Dump Bin Bigbin 72.0% 6.0% 50.0% 10.0%
Grab Roller 96.0% 28.0% 100.0% 30.0%
Handover Block 2.0% 0.0% 5.0% 0.0%
Handover Mic 24.0% 6.0% 0.0% 0.0%
Hanging Mug 1.0% 0.0% 0.0% 0.0%
Lift Pot 93.0% 6.0% 90.0% 10.0%
Move Can Pot 48.0% 0.0% 60.0% 0.0%
Move Pillbottle Pad 72.0% 3.0% 70.0% 20.0%
Move Playingcard Away 97.0% 17.0% 100.0% 40.0%
Move Stapler Pad 28.0% 4.0% 35.0% 0.0%
Open Laptop 73.0% 21.0% 50.0% 30.0%
Open Microwave 43.0% 3.0% 20.0% 0.0%
Pick Diverse Bottles 67.0% 5.0% 55.0% 0.0%
Pick Dual Bottles 87.0% 17.0% 85.0% 15.0%
Place A2B Left 86.0% 10.0% 45.0% 10.0%
Place A2B Right 91.0% 11.0% 55.0% 15.0%
Place Bread Basket 82.0% 7.0% 75.0% 15.0%
Place Bread Skillet 79.0% 6.0% 85.0% 10.0%
Place Burger Fries 93.0% 13.0% 80.0% 5.0%
Place Can Basket 38.0% 3.0% 50.0% 0.0%
Place Cans Plasticbox 69.0% 13.0% 0.0% 0.0%
Place Container Plate 98.0% 21.0% 100.0% 55.0%
Place Dual Shoes 9.0% 1.0% 0.0% 0.0%
Place Empty Cup 92.0% 22.0% 100.0% 20.0%
Place Fan 55.0% 7.0% 45.0% 0.0%
Place Mouse Pad 74.0% 11.0% 60.0% 10.0%
Place Object Basket 55.0% 3.0% 35.0% 10.0%
Place Object Scale 75.0% 13.0% 85.0% 0.0%
Place Object Stand 90.0% 12.0% 95.0% 35.0%
Place Phone Stand 82.0% 16.0% 75.0% 25.0%
Place Shoe 89.0% 20.0% 80.0% 40.0%
Press Stapler 98.0% 49.0% 90.0% 40.0%
Put Bottles Dustbin 3.0% 0.0% 0.0% 0.0%
Put Object Cabinet 22.0% 1.0% 0.0% 0.0%
Rotate QRcode 65.0% 0.0% 65.0% 10.0%
Scan Object 47.0% 5.0% 45.0% 5.0%
Shake Bottle 99.0% 64.0% 100.0% 65.0%
Shake Bottle Horizontally 99.0% 58.0% 100.0% 60.0%
Stack Blocks Three 25.0% 2.0% 15.0% 0.0%
Stack Blocks Two 90.0% 10.0% 80.0% 5.0%
Stack Bowls Three 39.0% 3.0% 45.0% 15.0%
Stack Bowls Two 92.0% 22.0% 95.0% 35.0%
Stamp Seal 68.0% 7.0% 50.0% 0.0%
Turn Switch 60.0% 30.0% 60.0% 10.0%
Average 65.8% 14.3% 60.0% 15.7%

Success rates of Vidar on the RoboTwin 2.0 benchmark. Columns 1-2: Standard leaderboard setting (trained under the clean scenario with 50 episodes for each task), success rate averaged over 100 episodes. Columns 3-4: Low-data setting (trained under the clean scenario with 20 episodes and adjusted camera views for each task), success rate averaged over 20 episodes.

Anypos: Automated Task-Agnostic Actions for Bimanual Manipulation

The objective of VLAs

Figure: From dream world prediction to real world execution.

Overview of AnyPos

Figure: Overview of AnyPos.

AnyPos is a robot-specific image-to-action model trained entirely on task-agnostic trajectories sampled by ATARA. It integrates two key techniques to enhance performance: Arm-Decoupled Estimation and Direction-Aware Decoder (DAD). Together, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision. By combining scalable unsupervised data collection with physically informed learning architectures, our approach demonstrates that task-agnostic action data can serve as a practical and powerful foundation for generalizable manipulation.

Our AnyPos method achieves 57.13% action prediction accuracy on the test set, which includes unseen skills and objects, surpassing previous approaches (naïve ResNet+MLP used in unipi, unisim, robodreamer, and susie) by 51%. In real-world robot replay tests, AnyPos-ATARA demonstrates 92.59% task success rate (as shown in the figure below), representing a 33% improvement over the human-collected dataset, and surpassing previous approaches by 44% (as shown in the Figure of "Overview of AnyPos").

The objective of VLAs

Figure: The results of AnyPos-ATARA with video replay to accomplish various manipulation tasks.

ATARA: Automatically collect data of target robots! (4x speed)

Task-Agnostic Action

Standard VLA models learn temporally extended policies pθ(𝒂T+1:T+t|𝒙TH+1:T,𝐥)subscript𝑝𝜃conditionalsubscript𝒂:𝑇1𝑇𝑡subscript𝒙:𝑇𝐻1𝑇𝐥p_{\theta}(\bm{a}_{T+1:T+t}|\bm{x}_{T-H+1:T},\mathbf{l})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_T + 1 : italic_T + italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_T - italic_H + 1 : italic_T end_POSTSUBSCRIPT , bold_l ) , where θ𝜃\thetaitalic_θ represents the parameters of the VLA policy, T is the current timestep and H denotes the history window size, mapping observation histories and language commands to action sequences. Given an expert dataset Dexpertsubscript𝐷expertD_{\text{expert}}italic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT , the training objective of VLAs is to maximize the likelihood:

The objective of VLAs

Figure 3: The objective of VLAs.

Here, all actions are task-dependent, i.e., Task-Specific Actions. The vastness of task language instructions and action spaces creates an enormous demand for action data in Vision-Language-Action (VLA) models.
In scenarios where robotic actions are position-controlled, the above formulation can be decomposed into a "future video prediction problem" and an "action execution problem." This enables the decoupling of the action modality from the embodied foundation model, shifting the requirement for high generalizability to the data-rich vision-language modality. Through our derivation, we propose the concept of Task-Agnostic Action, which significantly simplifies the learning of the action modality:
Decompsing task-specific action.

Figure 4: Decompsing task-specific action.

Benefits of task-agnostic action paradigm

  1. Data Efficiency and Reusable Motor Skills: Task-agnostic training avoids costly task-specific demonstrations, enabling large-scale unsupervised data collection. The inverse dynamics model (IDM), which learns a universal action prior p(𝒂i𝒙i)𝑝conditionalsubscript𝒂𝑖subscript𝒙𝑖p(\bm{a}_{i}\mid\bm{x}_{i})italic_p ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , acts as a shared motor skill library for diverse tasks.
  2. Zero-Shot Task Generalization: The IDM models task-independent action priors, allowing generalization to new tasks by adapting only the video generation model (e.g., via language prompts) without IDM retraining.
  3. Decoupled Planning and Low-Level Control: High-level planning (e.g., "open the drawer") is handled by a video generation model, while the IDM executes the visual trajectories. This modular approach simplifies policy design by framing manipulation as an vision-space prediction problem.

Key Techniques

Methods of AnyPos

Figure: Methods of AnyPos. We obtain a task-agnostic training dataset covering the entire cubic workspace of dual robotic arms using ATARA.

  • ATARA: As shown in the Figure of "Methods of AnyPos", naïve joint-space sampling often results in inefficient coverage of reachable states, redundant or degenerate motions (e.g., arms exiting the field of view), and frequent self-collisions. To address these limitations, we propose ATARA, a reinforcement learning framework that constructs a coverage-aware mapping from end-effector space to joint space. Therefore, it enables efficient task-agnostic data generation that preserves inherently encoded robot embodiment information and broad behavioral coverage, serving as a reusable prior for downstream policy learning.
  • Arm-Decoupled Estimation: We observe that when estimating left arm joints, the model often attends to visual features of the right arm, and vice versa. To mitigate this, we isolate the input features per arm. We (1) split the left and right arms in the observation x, then (2) use two sub-networks estimate joint positions for each arm independently. Gripper poses are estimated by specialized networks. This decoupling reduces the visual hypothesis space, improves estimation accuracy, and enables specialization per arm.
  • Direction-Aware Decoder (DAD): For action estimation networks, we choose DINOv2 with register (DINOv2-Reg) as the visual encoder, and uses three core componets to meet the high-precision requirement for action prediction: (1) Multi-Scale Dilated Convolutions (2) Deformable Convolutions (3) Angle-Sensitive Pooling

BibTeX

If you find our work helpful, please cite us:

@misc{feng2025vidarembodiedvideodiffusion,
    title={Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation},
    author={Yao Feng and Hengkai Tan and Xinyi Mao and Guodong Liu and Shuhe Huang and Chendong Xiang and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.12898},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2507.12898}, 
}
@misc{tan2025anyposautomatedtaskagnosticactions,
    title={AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation}, 
    author={Hengkai Tan and Yao Feng and Xinyi Mao and Shuhe Huang and Guodong Liu and Zhongkai Hao and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.12768},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.12768}, 
}
Thank you!