H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

Overview

Figure 1: Overview of H-RDT. A human-to-robotics diffusion transformer with two-stage training.

Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging.
We present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning.
We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions.
Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including π₀ and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively.

H-RDT: Human to Robotics Diffusion Transformer

Figure 2: H-RDT framework.

Key Techniques

Human Action Representation Design: H-RDT uses a unified 48-dimensional action space capturing bilateral wrist poses (position + 6D rotation) and fingertip positions for all 5 fingers on both hands, totaling 48 dimensions of essential bimanual dexterous manipulation patterns.
Two-Stage Training Paradigm: Our approach consists of two main stages: (1) pre-training on large-scale human manipulation data with 48-dimensional hand pose representations using the complete EgoDex dataset (338K+ trajectories, 194 manipulation tasks, 829 hours), and (2) cross-embodiment fine-tuning with modular action encoders and decoders adapted to specific robot action spaces.
Diffusion Transformer Architecture: Built on a diffusion transformer with 2B parameters, H-RDT uses flow matching to model complex action distributions and achieve robust manipulation capabilities.

H-RDT Inference Video Samples

Towel Folding

✅ H-RDT

❌ H-RDT w/o Human

This task involves manipulating deformable towel with two sequential folds, where the first fold requires bimanual coordination to simultaneously grasp the towel's bottom edges.

Cup to Coaster Placement

✅ H-RDT

❌ H-RDT w/o Human

This task requires spatial reasoning to select the appropriate arm (left or right) based on the cup's position relative to the coaster.

Other Demos

Water Pouring

Plates Stacking

Pen Capping

Note: More demos coming soon

Real-world Task Definitions

Figure 4: Task definition of real-world experiments.

Our real-world experiments encompass diverse bimanual manipulation tasks across multiple robotic platforms:

Dual-Arm Piper: Towel folding and cup-to-coaster placement tasks requiring deformable object manipulation and spatial reasoning
Dual-Arm UR5: Takeout bag placement tasks with sequential bimanual coordination
Dual-Arm ARX5: Pick-and-place manipulation tasks including stacking bowls and object placement

These tasks validate H-RDT's ability to handle complex real-world scenarios with varying degrees of dexterity and coordination requirements.

BibTeX

If you find our work helpful, please cite us:

@misc{bi2025hrdt,
    title={H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation}, 
    author={Hongzhe Bi and Lingxuan Wu and Tianwei Lin and Hengkai Tan and Zhizhong Su and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.23523},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://embodiedfoundation.github.io/hrdt}, 
}

Thank you!