跨描绘装配指令对齐的视觉语言模型基准测试与机理分析

摘要

二维装配示意图通常较为抽象且难以遵循，这催生了能够监控进度、检测错误并提供分步指导的智能助手需求。在混合现实环境中，此类系统必须从摄像头画面中识别已完成和正在进行的步骤，并将其与示意图说明对齐。视觉语言模型在此任务中展现出潜力，但由于装配示意图与视频帧之间视觉特征差异显著，面临着表征差异的挑战。为系统评估这一差异，我们构建了IKEA-Bench基准测试集，涵盖29款宜家家具产品的6类任务共1,623个问题，并在三种对齐策略下评估了19个视觉语言模型（20亿至380亿参数）。主要发现包括：（1）通过文本可恢复装配指令理解能力，但文本会同时削弱示意图与视频的对齐；（2）模型架构家族比对参数数量更能预测对齐精度；（3）视频理解仍是难以突破的瓶颈，不受策略影响。三级机制分析进一步表明，示意图与视频占据视觉Transformer的不相交子空间，添加文本会使模型从视觉驱动转向文本驱动推理。这些结果指出视觉编码是提升跨表征鲁棒性的主要改进方向。项目页面：https://ryenhails.github.io/IKEA-Bench/

English

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

跨描绘装配指令对齐的视觉语言模型基准测试与机理分析

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

摘要

Support