MechVQA：全面機械製圖理解之多模態大型語言模型基準測試與增強

摘要

多模態大語言模型（MLLMs）在通用視覺問答（VQA）任務中已展現顯著成果。然而，在機械工程圖紙場景中，由於註釋密度高、領域知識薄弱，加上嚴格的投影規則與幾何約束下不可靠的空間關係推理，導致關鍵線索容易遺漏，經常產生錯誤答案。為填補此缺口，我們首次提出全面的機械圖紙理解資料集 MechVQA，透過半自動化建構與品質管控流程產出。MechVQA 包含 3.3 萬張高密度圖像與 2.1 萬組問答對，涵蓋三種能力層級（辨識、推理、判斷）共 10 項細粒度任務，為評估並提升 MLLM 對真實機械圖紙的理解提供測試平台。在此基礎上，我們進一步透過多階段訓練範式開發 MechVL 模型，建立強大的領域專用基線。大量實驗結果顯示，MechVL 在 MechVQA 總分上超越最強的閉源基線 7.57 個百分點，顯著增強機械圖紙理解能力，並為在機械設計與檢測場景中部署 MLLM 提供可重複使用之基礎。

English

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.