Articulated objects are fundamental for robotics, simulation of physics, and interactive virtual environments. However, reconstructing them from visual input remains challenging, as it requires jointly inferring both part geometry and kinematic structure.
We present URDF-Anything+, an end-to-end autoregressive framework that directly generates executable articulated object models from visual observations. Given image and object-level 3D cues, our method sequentially produces part geometries and their associated joint parameters, resulting in complete URDF models without reliance on multi-stage pipelines. The generation proceeds until the model determines that all parts have been produced, automatically inferring complete geometry and kinematics.
Building on this capability, we enable a new Real-Follow-Sim paradigm, where high-fidelity digital twins constructed from visual observations allow policies trained and tested purely in simulation to transfer to real robots without online adaptation.
Experiments on large-scale articulated object benchmarks and real-world robotic tasks demonstrate that URDF-Anything+ outperforms prior methods in geometric reconstruction quality, joint parameter accuracy, and physical executability.
URDF-Anything+ is an end-to-end autoregressive framework that generates executable articulated object models from a single RGB image. We first extract global visual features to reconstruct a holistic 3D representation, then sequentially generate parts and joint parameters conditioned on this context.
At each step, a diffusion transformer produces a shared latent that encodes both geometry and articulation cues; new parts are merged and re-encoded to keep the context consistent. The process stops when an end token is predicted, yielding a complete URDF model ready for standard physics simulators.
we propose Real-Follow-Sim, a paradigm that shifts the primary challenge from policy generalization to simulation fidelity. We construct a dynamic digital twin of the real environment by leveraging URDF-Anything+ to stream real-world visual observations into the simulator, thereby continuously aligning the virtual scene with its physical counterpart in both geometry and appearance. Crucially, the policy is trained and executed exclusively in simulation, operating solely on synthetic data. The real robot acts merely as a faithful “follower”, executing the actions generated by the simulated agent without any online adaptation.
Overall, our framework consists of three stages: (1) Digital Twin Asset Construction; (2) Policy Learning in Simulation; (3) Real-World Execution via Simulated Trajectories.
We use both IsaacSim and Sapien simulators to load the asset.
If you find this work useful, please consider citing:
@misc{wu2026urdfanythingautoregressivearticulated3d,
title={URDF-Anything+: Autoregressive Articulated 3D Models Generation for Physical Simulation},
author={Zhuangzhe Wu and Yue Xin and Chengkai Hou and Minghao Chen and Yaoxu Lyu and Jieyu Zhang and Shanghang Zhang},
year={2026},
eprint={2603.14010},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2603.14010},
}