Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, serving as the foundational elements that drive immersive experiences and advanced automation. However, creating these interactable objects (i.e., articulation) requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we present Articulate-Anything, a system that automates the articulation of diverse, complex objects from different input modalities, including text, images, and videos. Articulate-Anything leverages Vision-Language Models (VLMs) to generate Python programs compilable into articulated Unified Robot Description Format (URDF) files. Our system consists of a mesh retrieval mechanism and dual actor-critic closed-loop systems. The agentic system is capable of automatically generating articulation, inspecting simulated predictions against ground-truths, and self-correcting errors. Qualitative evaluations demonstrate Articulate-Anything's capability to articulate complex and even ambiguous object affordances by leveraging rich grounded inputs. In extensive quantitative experiments on the standard PartNet-Mobility dataset, Articulate-Anything substantially outperforms prior work in automatic articulation, increasing the success rate from 8.7-12.2% to 75%, setting a new bar for state-of-art performance. Lastly, to showcase an application, we train multiple robotic policies using generated assets, demonstrating the utility of articulation for finer-grained manipulation beyond pick and place.
Articulate-Anything is a state-of-the-art method for articulating
diverse
in-the-wild objects from diverse input modalities, including
text, images, and videos. We can create high-quality digital twins in simulation that can be
used to
train robotic skills among
other applications in VR/AR and animation.
Listen to a podcast-style audio description of our work (generated using NotebookLM)!
Articulate-Anything uses a series of specialized VLM systems to automatically generate digital twins from any human-API inputs: text, images or videos.
In this demo, we visualize the articulation results for in-the-wild videos. These videos were
casually
captured on an iPhone (e.g., titled angles) in cluttered environments, showing Articulate-Anything's ability to handle
diverse, complex, and realistic object affordances.
Note that Articulate-Anything can resolve ambiguous affordance by
leveraging grounded
video inputs. For example, double-hung windows can potentially either slide or tip to open. This
affordance is not clear
from static image. When Articulate-Anything is shown a demonstration video, it
can accurately produce
the desired model in simulation.
Below, we show the Python output generated by our system and the corresponding digital twin rendered in Sapien simulator.
Articulate-anything response shown within code block.
Without articulation, objects can only afford trivial interaction such as pick and place. We show that articulate Articulate-Anything's output can be used to create digital twins in simulation for robotic training of finer-grained manipulation skills, from closing a toilet lid to closing a cabinet drawer. These skills can be then transferred to a real-world robotic system in a zeroshot manner. Here, we visualize the policies trained using PPO in a simulator using Robosuite library for 2 million episodes over 3 random seeds per task.
We execute the best trained policies on a Franka Emika Panda robot arm and show the results below. Note that all training is done in parallel using GPUs in the simulation.
Currently, Articulate-Anything employs a mesh retrieval mechanism to exploit existing 3D datasets. Open repositories such as Objaverse contain more than 10 million static objects. Our system can potentially bring those objects to life via articulation. However, as stated, a promising future direction to enable the generation of even more customized assets is to leverage large-scale mesh generation models. Below, we show a few examples of meshes generated by Rodin using in-the-wild images. The static meshes are then articulated by Articulate-Anything.
Prior works such as URDFormer and Real2code rely on impoverished inputs such as cropped images or text
bounding box coordinates. Impoverished inputs also mean that their articulation had to be done on an open
loop.
On the standard PartNet-Mobility dataset, Articulate-Anything, leveraging
grounded video inputs and closed-loop actor-critic system,
substantially outperforms these prior works in automatic articulation, setting a new bar
in
state-of-the-art performance.
In an ablation experiment, we found that indeed richer and more grounded modalities such as videos enable higher articulation accuracy in our own system.
We also evaluate the effect of in-context examples on success rates for link placement and joint prediction tasks. The results show that providing more in-context examples significantly improves performance.
Finally, by leveraging a visual critic, Articulate-Anything can self-evaluate its predictions and self-improve over subsequent iterations.
@article{le2024articulate,
title={Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model},
author={Le, Long and Xie, Jason and Liang, William and Wang, Hung-Ju and Yang, Yue and Ma, Yecheng Jason and Vedder, Kyle and Krishna, Arjun and Jayaraman, Dinesh and Eaton, Eric},
journal={arXiv preprint arXiv:2410.13882},
year={2024}
}