Overview

  • Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation.
  • However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs.
  • We present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning to effectively overcome those limitations.
  • To address the data collection challenges posed by this paradigm, we introduce ATARA (Automated Task-AgnosticRandom Actions), a novel data collection framework that automatically generates large-scale task-agnostic actions for bimanual manipulation efficiently.
  • By using a vision-generation model for future observation prediction and a downstream inverse dynamics model (IDM) for action regression, we can achieve exceptional generalization capability and remarkable data efficiency.
  • Vidar (Video Diffusion for Action Reasoning) is a two-stage framework that leverages a large-scale, diffusion-based video pre-training model and a novel Masked Inverse Dynamics Model (MIDM) for action prediction, which can be generalized to an unseen robot platform with only 20 minutes of human demonstrations (1/81 of RDT demonstrations, 1/1200 of π0.5 demonstrations).
  • AnyPos is another inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD) that is able to learn from ATARA-generated task-agnostic data. Therefore, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision.
  • Our experiments demonstrate that our Vidar framework can generalize to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods like VPP and UniPi by over 40%.
  • We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks, and we demonstrate that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation.

Vidar: Video Diffusion for Action Reasoning

Vidar Architecture

Figure: Overview of Vidar.

Vidar includes a video diffusion foundation model for video prediction and a masked inverse dynamics model (MIDM) for action regression. The video diffusion model is trained on 750K multi-view bimanual videos with test-time scaling (TTS) applied during testing, and it can adapt to new robot platforms with only 20 minutes of demonstrations with state-of-the-art performance and generalize to unseen tasks with strong semantic understanding.

Vidar Methods

Figure: Methods of Vidar.

Vidar Examples

Figure: Examples of Vidar.

Key Techniques

  • Video Generation Model: Rectified flow models with internet-videos pre-training, emboided pre-training and fine-tuning on unified observation space.
  • Masked Inverse Dynamics Model (MIDM): Inverse dynamics models often suffer from poor generalization due to the presence of background noise, texture biases, and visual distractions in high-dimensional observation, while MIDM can focus on task-relevent regions of the input frame via implicit mask prediction.
  • Test-Time Scaling (TTS): We generate K candidate video trajectories using different random seeds. Then rank these trajectories using a pretrained evaluator (e.g., CLIP or a vision-language model) and select the one which scores the highest.

Anypos: Automated Task-Agnostic Actions for Bimanual Manipulation

The objective of VLAs

Figure: From dream world prediction to real world execution.

Overview of AnyPos

Figure: Overview of AnyPos.

AnyPos is a robot-specific image-to-action model trained entirely on task-agnostic trajectories sampled by ATARA. It integrates two key techniques to enhance performance: Arm-Decoupled Estimation and Direction-Aware Decoder (DAD). Together, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision. By combining scalable unsupervised data collection with physically informed learning architectures, our approach demonstrates that task-agnostic action data can serve as a practical and powerful foundation for generalizable manipulation.

Our AnyPos method achieves 57.13% action prediction accuracy on the test set, which includes unseen skills and objects, surpassing previous approaches (naïve ResNet+MLP used in unipi, unisim, robodreamer, and susie) by 51%. In real-world robot replay tests, AnyPos-ATARA demonstrates 92.59% task success rate (as shown in the figure below), representing a 33% improvement over the human-collected dataset, and surpassing previous approaches by 44% (as shown in the Figure of "Overview of AnyPos").

The objective of VLAs

Figure: The results of AnyPos-ATARA with video replay to accomplish various manipulation tasks.

ATARA: Automatically collect data of target robots! (4x speed)

Task-Agnostic Action

Standard VLA models learn temporally extended policies pθ(𝒂T+1:T+t|𝒙TH+1:T,𝐥)subscript𝑝𝜃conditionalsubscript𝒂:𝑇1𝑇𝑡subscript𝒙:𝑇𝐻1𝑇𝐥p_{\theta}(\bm{a}_{T+1:T+t}|\bm{x}_{T-H+1:T},\mathbf{l})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_a start_POSTSUBSCRIPT italic_T + 1 : italic_T + italic_t end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_T - italic_H + 1 : italic_T end_POSTSUBSCRIPT , bold_l ) , where θ𝜃\thetaitalic_θ represents the parameters of the VLA policy, T is the current timestep and H denotes the history window size, mapping observation histories and language commands to action sequences. Given an expert dataset Dexpertsubscript𝐷expertD_{\text{expert}}italic_D start_POSTSUBSCRIPT expert end_POSTSUBSCRIPT , the training objective of VLAs is to maximize the likelihood:

The objective of VLAs

Figure 3: The objective of VLAs.

Here, all actions are task-dependent, i.e., Task-Specific Actions. The vastness of task language instructions and action spaces creates an enormous demand for action data in Vision-Language-Action (VLA) models.
In scenarios where robotic actions are position-controlled, the above formulation can be decomposed into a "future video prediction problem" and an "action execution problem." This enables the decoupling of the action modality from the embodied foundation model, shifting the requirement for high generalizability to the data-rich vision-language modality. Through our derivation, we propose the concept of Task-Agnostic Action, which significantly simplifies the learning of the action modality:
Decompsing task-specific action.

Figure 4: Decompsing task-specific action.

Benefits of task-agnostic action paradigm

  1. Data Efficiency and Reusable Motor Skills: Task-agnostic training avoids costly task-specific demonstrations, enabling large-scale unsupervised data collection. The inverse dynamics model (IDM), which learns a universal action prior p(𝒂i𝒙i)𝑝conditionalsubscript𝒂𝑖subscript𝒙𝑖p(\bm{a}_{i}\mid\bm{x}_{i})italic_p ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , acts as a shared motor skill library for diverse tasks.
  2. Zero-Shot Task Generalization: The IDM models task-independent action priors, allowing generalization to new tasks by adapting only the video generation model (e.g., via language prompts) without IDM retraining.
  3. Decoupled Planning and Low-Level Control: High-level planning (e.g., "open the drawer") is handled by a video generation model, while the IDM executes the visual trajectories. This modular approach simplifies policy design by framing manipulation as an vision-space prediction problem.

Key Techniques

Methods of AnyPos

Figure: Methods of AnyPos. We obtain a task-agnostic training dataset covering the entire cubic workspace of dual robotic arms using ATARA.

  • ATARA: As shown in the Figure of "Methods of AnyPos", naïve joint-space sampling often results in inefficient coverage of reachable states, redundant or degenerate motions (e.g., arms exiting the field of view), and frequent self-collisions. To address these limitations, we propose ATARA, a reinforcement learning framework that constructs a coverage-aware mapping from end-effector space to joint space. Therefore, it enables efficient task-agnostic data generation that preserves inherently encoded robot embodiment information and broad behavioral coverage, serving as a reusable prior for downstream policy learning.
  • Arm-Decoupled Estimation: We observe that when estimating left arm joints, the model often attends to visual features of the right arm, and vice versa. To mitigate this, we isolate the input features per arm. We (1) split the left and right arms in the observation x, then (2) use two sub-networks estimate joint positions for each arm independently. Gripper poses are estimated by specialized networks. This decoupling reduces the visual hypothesis space, improves estimation accuracy, and enables specialization per arm.
  • Direction-Aware Decoder (DAD): For action estimation networks, we choose DINOv2 with register (DINOv2-Reg) as the visual encoder, and uses three core componets to meet the high-precision requirement for action prediction: (1) Multi-Scale Dilated Convolutions (2) Deformable Convolutions (3) Angle-Sensitive Pooling

BibTeX

If you find our work helpful, please cite us:

@misc{feng2025vidarembodiedvideodiffusion,
    title={Vidar: Embodied Video Diffusion Model for Generalist Bimanual Manipulation},
    author={Yao Feng and Hengkai Tan and Xinyi Mao and Guodong Liu and Shuhe Huang and Chendong Xiang and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.12898},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2507.12898}, 
}
@misc{tan2025anyposautomatedtaskagnosticactions,
    title={AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation}, 
    author={Hengkai Tan and Yao Feng and Xinyi Mao and Shuhe Huang and Guodong Liu and Zhongkai Hao and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.12768},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.12768}, 
}
Thank you!