FORGE：面向製造場景的細粒度多模態評估框架

摘要

製造業正加速採用多模態大型語言模型（MLLM）以實現從簡單感知到自主執行的轉型，然而現有評估方法未能反映真實製造環境的嚴苛需求。數據稀缺性與現有數據集缺乏細粒度領域語義，阻礙了相關進展。為此，我們提出FORGE框架。我們首先構建了融合真實世界二維影像與三維點雲的高質量多模態數據集，並標註了細粒度領域語義（如精確型號編碼）。隨後在工件驗證、結構面檢測與裝配驗證三項製造任務中評估了18個前沿MLLM，發現存在顯著性能差距。有悖於傳統認知，瓶頸分析表明視覺定位並非主要限制因素，而領域特定知識的匱乏才是關鍵瓶頸，這為未來研究指明方向。除評估外，我們證實結構化標註可轉化為可操作的訓練資源：基於本數據對緊湊型30億參數模型進行監督微調後，在未見製造場景中準確率最高獲得90.8%的相對提升，為實現領域自適應製造MLLM提供了可行路徑的實證。代碼與數據集已開源於https://ai4manufacturing.github.io/forge-web。

English

The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.

FORGE：面向製造場景的細粒度多模態評估框架

FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

摘要

Support