FORGE: 제조 시나리오를 위한 세분화된 멀티모달 평가

초록

제조 분야에서는 단순 인식에서 자율 실행으로의 전환을 위해 다중 모드 대형 언어 모델(MLLM)을 점점 더 많이 도입하고 있지만, 현재의 평가 방식은 실제 제조 현장의 엄격한 요구사항을 반영하지 못하고 있습니다. 데이터 부족과 기존 데이터셋의 세분화된 도메인 의미 정보 부재로 인해 발전이 더딘 상황입니다. 이러한 격차를 해소하기 위해 우리는 FORGE를 소개합니다. 먼저 실제 2D 이미지와 3D 포인트 클라우드를 결합하고 세분화된 도메인 의미 정보(예: 정확한 모델 번호)로 주석을 추가한 고품질 다중 모드 데이터셋을 구축했습니다. 그런 다음 세 가지 제조 업무(즉, 공작물 검증, 구조물 표면 검사, 조립 검증)에 대해 18개의 최첨단 MLLM을 평가하여 상당한 성능 격차를 확인했습니다. 기존의 통념과는 반대로, 병목 현상 분석 결과 시각적 기반 찾기(visual grounding)가 주요 제한 요인이 아닌 것으로 나타났습니다. 오히려 도메인 특화 지식의 부족이 핵심 병목 현상으로, 향후 연구 방향을 명확히 제시합니다. 평가를 넘어, 우리의 구조화된 주석이 실행 가능한 훈련 자원으로 활용될 수 있음을 보여줍니다: 우리 데이터를 사용하여 컴팩트한 30억 파라미터 모델을 지도 학습 방식으로 미세 조정하면 보류된 제조 시나리오에서 정확도가 최대 90.8%의 상대적 개선을 이루어, 도메인에 적응된 제조용 MLLM을 위한 실용적인 발전 경로에 대한 예비 증거를 제공합니다. 코드와 데이터셋은 https://ai4manufacturing.github.io/forge-web에서 이용할 수 있습니다.

English

The manufacturing sector is increasingly adopting Multimodal Large Language Models (MLLMs) to transition from simple perception to autonomous execution, yet current evaluations fail to reflect the rigorous demands of real-world manufacturing environments. Progress is hindered by data scarcity and a lack of fine-grained domain semantics in existing datasets. To bridge this gap, we introduce FORGE. Wefirst construct a high-quality multimodal dataset that combines real-world 2D images and 3D point clouds, annotated with fine-grained domain semantics (e.g., exact model numbers). We then evaluate 18 state-of-the-art MLLMs across three manufacturing tasks, namely workpiece verification, structural surface inspection, and assembly verification, revealing significant performance gaps. Counter to conventional understanding, the bottleneck analysis shows that visual grounding is not the primary limiting factor. Instead, insufficient domain-specific knowledge is the key bottleneck, setting a clear direction for future research. Beyond evaluation, we show that our structured annotations can serve as an actionable training resource: supervised fine-tuning of a compact 3B-parameter model on our data yields up to 90.8% relative improvement in accuracy on held-out manufacturing scenarios, providing preliminary evidence for a practical pathway toward domain-adapted manufacturing MLLMs. The code and datasets are available at https://ai4manufacturing.github.io/forge-web.

FORGE: 제조 시나리오를 위한 세분화된 멀티모달 평가

FORGE:Fine-grained Multimodal Evaluation for Manufacturing Scenarios

초록

Support