FORGE：面向制造场景的细粒度多模态评估

摘要

制造业正加速采用多模态大语言模型（MLLMs）以实现从简单感知到自主执行的转型，然而现有评估方法难以反映真实制造环境的严苛需求。数据稀缺与现有数据集缺乏细粒度领域语义的问题制约了该领域发展。为弥补这一空白，我们推出FORGE框架。我们首先构建了融合真实世界二维图像与三维点云的高质量多模态数据集，并标注了细粒度领域语义（如精确型号）。随后在工件验证、结构面检测与装配验证三项制造任务中评估了18个前沿MLLMs，揭示了显著性能差距。与传统认知相反，瓶颈分析表明视觉定位并非主要限制因素，领域特定知识不足才是关键瓶颈，这为未来研究指明了方向。除评估外，我们还证明结构化标注可作为有效的训练资源：基于我们数据对紧凑型30亿参数模型进行监督微调后，其在未参与训练的制造场景中准确率最高提升90.8%，为领域自适应制造MLLMs的实践路径提供了初步证据。代码与数据集详见https://ai4manufacturing.github.io/forge-web。

English

The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.