MechVQA: 面向全面机械图纸理解的多模态大语言模型基准测试与增强

摘要

多模态大语言模型（MLLMs）在通用视觉问答（VQA）任务中已展现出显著成就。然而，在面对机械工程图纸时，这些模型仍显脆弱：高标注密度与弱领域知识并存，加之在严格投影规则和几何约束下进行空间关系推理的不可靠性，使得关键线索极易被忽略，常导致错误答案。为填补这一空白，我们首次提出综合性机械图纸理解数据集MechVQA，该数据集通过半自动构建与质量控制流程生成。MechVQA包含3.3万张高密度图像及2.1万对问答，涵盖识别、推理、判断三个能力层级下的10种细粒度任务，为评估和提升MLLM在真实机械图纸上的理解能力提供了测试平台。基于MechVQA，我们进一步通过多阶段训练范式开发MechVL模型，构建了强领域专用基线。大量实验结果表明，MechVL在MechVQA总分上超越最强闭源基线7.57个百分点，显著增强了机械图纸理解能力，并为在机械设计与检测场景中部署MLLM提供了可复用基础。

English

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.