Embodied Foundation Video Models

TODO: demo

Overview

Vision-language-action (VLA) models have shown promise on task-conditioned control in complex settings such as bimanual manipulation.
However, the heavy reliance on task-specific human demonstrations limits their generalization and incurs high data acquisition costs.
We present a new notion of task-agnostic action paradigm that decouples action execution from task-specific conditioning to effectively overcome those limitations.
To address the data collection challenges posed by this paradigm, we introduce ATARA (Automated Task-AgnosticRandom Actions), a novel data collection framework that automatically generates large-scale task-agnostic actions for bimanual manipulation efficiently.
By using a vision-generation model for future observation prediction and a downstream inverse dynamics model (IDM) for action regression, we can achieve exceptional generalization capability and remarkable data efficiency.
VIDAR (VIdeo Diffusion for Action Reasonin) is a two-stage framework that leverages a large-scale, diffusion-based video pre-training model and a novel masked inverse dynamics model (MIDM) for action prediction, which can be generalized to an unseen robot platform with only 20 minutes of human demonstrations.
AnyPos is another inverse dynamics model equipped with Arm-Decoupled Estimation and a Direction-Aware Decoder (DAD) that is able to learn from ATARA-generated task-agnostic data. Therefore, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision.
Our experiments demonstrate that our VIDAR framework can generalize to unseen tasks and backgrounds with strong semantic understanding, surpassing state-of-the-art methods like VPP and UniPi.
We additionally integrate a video-conditioned action validation module to verify the feasibility of learned policies across diverse manipulation tasks, and we demonstrate that the AnyPos-ATARA pipeline yields a 51% improvement in test accuracy and achieves 30-40% higher success rates in downstream tasks such as lifting, pick-and-place, and clicking, using replay-based video validation

VIDAR: VIdeo Diffusion for Action Reasoning

Figure 1: Overview of VIDAR.

VIDAR is a foundation video diffusion model for video prediction and masked inverse dynamics model (MIDM) for action regression. It is trained on 750K multi-view bimanual videos with test-time scaling (TTS) applied during testing, VIDAR can adapt to new robot platforms with only 20 minutes of demonstrations with state-of-the-art performance and generalize to unseen tasks with strong semantic understanding.

Figure 2: Methods of VIDAR.

Key Techniques

Video Generation Model: Rectified flow models with emboided pre-training and fine-tuning on unified observation space.
Masked Inverse Dynamics Model (MIDM): Inverse dynamics models often suffer from poor generalization due to the presence of background noise, texture biases, and visual distractions in high-dimensional observation, while MIDM can focus on task-relevent regions of the input frame via implicit mask prediction.
Test-Time Scaling (TTS): We generate K candidate video trajectories using different random seeds. Then rank these trajectories using a pretrained evaluator (e.g., CLIP or a vision-language model) and select the one which scores the highest.

Anypos: Automated Task-Agnostic Actions for Bimanual Manipulation

Figure 3: Overview of AnyPos.

AnyPos is a robot-specific image-to-action model trained entirely on task-agnostic trajectories sampled by ATARA. It integrates two key techniques to enhance performance: Arm-Decoupled Estimation and Direction-Aware Decoder (DAD). Together, ATARA and AnyPos constitute a fully task-agnostic framework for training IDMs without goal supervision. By combining scalable unsupervised data collection with physically informed learning architectures, our approach demonstrates that task-agnostic action data can serve as a practical and powerful foundation for generalizable manipulation.

ATARA: Automatically collect data of target robots! (4x speed)

Figure 4: Methods of AnyPos. We obtain a task-agnostic training dataset covering the entire cubic workspace of dual robotic arms using ATARA.

Key Techniques

ATARA: As shown in Figure 4, naïve joint-space sampling often results in inefficient coverage of reachable states, redundant or degenerate motions (e.g., arms exiting the field of view), and frequent self-collisions. To address these limitations, we propose ATARA, a reinforcement learning framework that constructs a coverage-aware mapping from end-effector space to joint space. Therefore, it enables efficient task-agnostic data generation that preserves inherently encoded robot embodiment information and broad behavioral coverage, serving as a reusable prior for downstream policy learning.
Arm-Decoupled Estimation: We observe that when estimating left arm joints, the model often attends to visual features of the right arm, and vice versa. To mitigate this, we isolate the input features per arm. We (1) split the left and right arms in the observation x, then (2) use two sub-networks estimate joint positions for each arm independently. Gripper poses are estimated by specialized networks. This decoupling reduces the visual hypothesis space, improves estimation accuracy, and enables specialization per arm.
Direction-Aware Decoder (DAD): For action estimation networks, we choose DINOv2 with register (DINOv2-Reg) as the visual encoder, and uses three core componets to meet the high-precision requirement for action prediction: (1) Multi-Scale Dilated Convolutions (2) Deformable Convolutions (3) Angle-Sensitive Pooling

BibTeX

If you find our work helpful, please cite us:


@misc{feng2025generalistbimanualmanipulationfoundation,
    title={Generalist Bimanual Manipulation via Foundation Video Diffusion Models}, 
    author={Yao Feng and Hengkai Tan and Xinyi Mao and Guodong Liu and Shuhe Huang and Chendong Xiang and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.12898},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2507.12898}, 
}
@misc{tan2025anyposautomatedtaskagnosticactions,
    title={AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation}, 
    author={Hengkai Tan and Yao Feng and Xinyi Mao and Shuhe Huang and Guodong Liu and Zhongkai Hao and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.12768},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.12768}, 
}

Thank you!

Embodied Video Foundation Model: VIDAR & AnyPos

TODO: demo

Overview

VIDAR: VIdeo Diffusion for Action Reasoning

Key Techniques

Anypos: Automated Task-Agnostic Actions for Bimanual Manipulation

ATARA: Automatically collect data of target robots! (4x speed)

Key Techniques

BibTeX