URDF-Anything+ : Autoregressive Articulated 3D Models Generation for Physical Simulation

1Peking University 2University of Oxford 3University of Washington

Abstract

Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, reconstructing them from visual input remains challenging, as it requires jointly inferring both part geometry and kinematic structure.

We present URDF-Anything+, an end-to-end autoregressive framework that directly generates executable articulated object models from visual observations. Given image and object-level 3D cues, our method sequentially produces part geometries and their associated joint parameters, resulting in complete URDF models without reliance on multi-stage pipelines. The generation proceeds until the model determines that all parts have been produced, automatically inferring complete geometry and kinematics.

Building on this capability, we enable a new Real-Follow-Sim paradigm, where high-fidelity digital twins constructed from visual observations allow policies trained and tested purely in simulation to transfer to real robots without online adaptation.

Experiments on large-scale articulated object benchmarks and real-world robotic tasks demonstrate that URDF-Anything+ outperforms prior methods in geometric reconstruction quality, joint parameter accuracy, and physical executability.

Method

URDF-Anything+ is an end-to-end autoregressive framework that generates executable articulated object models from a single RGB image. We first extract global visual features to reconstruct a holistic 3D representation, then sequentially generate parts and joint parameters conditioned on this context.

At each step, a diffusion transformer produces a shared latent that encodes both geometry and articulation cues; new parts are merged and re-encoded to keep the context consistent. The process stops when an end token is predicted, yielding a complete URDF model ready for standard physics simulators.

Reconstructions

From Test-Dataset

Reconstruction results
Reconstruction 1029
Reconstruction 1029 animation
Reconstruction 1974
Reconstruction 1274 animation
Reconstruction 4094
Reconstruction 4094 animation
Reconstruction 11304
Reconstruction 11304 animation
Reconstruction 10279
Reconstruction 10279 animation
Reconstruction 12345
Reconstruction 12345 animation
Reconstruction 1496
Reconstruction 1496 animation
Reconstruction 1564
Reconstruction 1564 animation
Reconstruction 4583
Reconstruction 4583 animation
Reconstruction 7139
Reconstruction 7139 animation
Reconstruction 7282
Reconstruction 7282 animation
Reconstruction 10980
Reconstruction 10980 animation

In-the-Wild

Reconstruction display
Reconstruction display animation
Reconstruction faucet
Reconstruction faucet animation
Reconstruction laptop
Reconstruction laptop animation
Reconstruction microwave
Reconstruction microwave animation

Real-Follow-Sim Experiments

we propose Real-Follow-Sim, a paradigm that shifts the primary challenge from policy generalization to simulation fidelity. We construct a dynamic digital twin of the real environment by leveraging URDF-Anything+ to stream real-world visual observations into the simulator, thereby continuously aligning the virtual scene with its physical counterpart in both geometry and appearance. Crucially, the policy is trained and executed exclusively in simulation, operating solely on synthetic data. The real robot acts merely as a faithful “follower”, executing the actions generated by the simulated agent without any online adaptation.

Overall, our framework consists of three stages: (1) Digital Twin Asset Construction; (2) Policy Learning in Simulation; (3) Real-World Execution via Simulated Trajectories.

We use both IsaacSim and Sapien simulators to load the asset.

Real-Follow-Sim experiments

BibTeX

If you find this work useful, please consider citing:

@misc{wu2026urdfanythingautoregressivearticulated3d,
      title={URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical Simulation},
      author={Zhuangzhe Wu and Yue Xin and Chengkai Hou and Minghao Chen and Yaoxu Lyu and Jieyu Zhang and Shanghang Zhang},
      year={2026},
      eprint={2603.14010},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2603.14010},
}