Action-guided generation of 3D functionality segmentation data

arXiv:2511.23230v2 Announce Type: replace Abstract: 3D functionality segmentation aims to identify the interactive element in a 3D scene required to perform an action described in free-form language (e.g., the handle to ``Open the second drawer of the cabinet near the bed''). Progress has been constrained by the scarcity of annotated real-world data, as collecting and labeling fine-grained 3D masks is prohibitively expensive. To address this limitation, we introduce SynthFun3D, the first method for generating 3D functionality segmentation data directly from action descriptions. Given an action description, SynthFun3D constructs a plausible 3D scene by retrieving objects with part-level annotations from a large-scale asset repository and arranging them under spatial and semantic constraints. SynthFun3D renders multi-view images and automatically identifies the target functional element, producing precise ground-truth masks without manual annotation. We demonstrate the effectiveness of the generated data by training a VLM-based 3D functionality segmentation model. Augmenting real-world data with our synthetic data consistently improves performance, with gains of +2.2 mAP, +6.3 mAR, and +5.7 mIoU over real-only training. This shows that action-guided synthetic data generation provides a scalable and effective complement to manual annotation for 3D functionality understanding. Project page: tev-fbk.github.io/synthfun3d.

Leave a Comment