MechVQA: 包括的な機械図面理解におけるマルチモーダルLLMのベンチマークと性能向上

要旨

マルチモーダル大規模言語モデル（MLLM）は、一般的な視覚的質問応答（VQA）タスクにおいて顕著な成果を示してきた。しかし、機械製図に対しては依然として脆弱であり、高い注釈密度と弱いドメイン知識に加え、厳格な投影規則と幾何学的制約の下での信頼性の低い空間関係推論が重なり、決定的な手がかりを見落としやすく、誤った回答に繋がることが多い。このギャップを埋めるため、我々は初の包括的な機械図面理解データセットであるMechVQAを導入する。これは半自動構築と品質管理パイプラインを通じて作成された。MechVQAは3.3k枚の高密度画像と21Kの質問応答ペアを含み、認識、推論、判断の3つの能力レベルにわたる10種類の異なる細粒度タスクを網羅しており、実世界の機械図面におけるMLLMの理解を評価・改善するためのテストベッドを提供する。さらにMechVQAを基に、多段階訓練パラダイムを通じてMechVLモデルを開発し、強力なドメイン特化ベースラインを構築した。広範な実験結果により、MechVLはMechVQA総合スコアにおいて最も強力なクローズドソースベースラインを7.57パーセンテージポイント上回り、機械図面理解能力を大幅に向上させ、機械設計・検査シナリオにおけるMLLM展開のための再利用可能な基盤を提供することが実証された。

English

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.