ChatPaper.aiChatPaper

Part-X-MLLM:具备部件感知能力的3D多模态大语言模型

Part-X-MLLM: Part-aware 3D Multimodal Large Language Model

November 17, 2025
作者: Chunshi Wang, Junliang Ye, Yunhan Yang, Yang Li, Zizhuo Lin, Jun Zhu, Zhuo Chen, Yawei Luo, Chunchao Guo
cs.AI

摘要

我们推出Part-X-MLLM——一个原生3D多模态大语言模型,通过将多样化3维任务构建为结构化可执行语法中的程序,实现了任务统一。给定RGB点云与自然语言提示,该模型能自回归生成单一连贯的标记序列,编码部件级边界框、语义描述及编辑指令。这种结构化输出作为通用接口,可驱动下游几何感知模块进行基于部件的生成与编辑。通过将符号规划与几何合成解耦,我们的方法使得任何兼容的几何引擎都能通过单一的语言原生前端进行控制。我们预训练了双编码器架构以实现结构与语义的分离,并在大规模部件中心数据集上对模型进行指令微调。实验表明,该模型能生成高质量的结构化方案,通过统一接口在具身问答、组合生成及局部化编辑任务中实现最先进性能。项目页面:https://chunshi.wang/Part-X-MLLM/
English
We introduce Part-X-MLLM, a native 3D multimodal large language model that unifies diverse 3D tasks by formulating them as programs in a structured, executable grammar. Given an RGB point cloud and a natural language prompt, our model autoregressively generates a single, coherent token sequence encoding part-level bounding boxes, semantic descriptions, and edit commands. This structured output serves as a versatile interface to drive downstream geometry-aware modules for part-based generation and editing. By decoupling the symbolic planning from the geometric synthesis, our approach allows any compatible geometry engine to be controlled through a single, language-native frontend. We pre-train a dual-encoder architecture to disentangle structure from semantics and instruction-tune the model on a large-scale, part-centric dataset. Experiments demonstrate that our model excels at producing high-quality, structured plans, enabling state-of-the-art performance in grounded Q\&A, compositional generation, and localized editing through one unified interface. Project page: https://chunshi.wang/Part-X-MLLM/
PDF692December 1, 2025