跨描绘装配指令对齐的视觉语言模型基准测试与机理分析

摘要

二維組裝示意圖通常較為抽象且難以理解，這催生了對智能輔助系統的需求，這類系統需能監控組裝進度、檢測錯誤並提供逐步指導。在混合現實環境中，此類系統必須能從攝像頭畫面中識別已完成和正在進行的步驟，並將其與圖示說明對齊。視覺語言模型在此任務中展現潛力，但由於組裝圖與視頻幀之間視覺特徵重合度低，面臨表徵差異難題。為系統性評估此差異，我們構建了IKEA-Bench基準測試集，涵蓋29款宜家家具的6類任務共計1,623個問題，並在三種對齊策略下評估了19個視覺語言模型（2B-38B）。主要發現包括：（1）組裝指令理解可通過文本恢復，但文本會同時削弱圖示與視頻的對齊效果；（2）模型架構家族對對齊準確度的預測力強於參數規模；（3）視頻理解仍是難以突破的瓶頸，不受策略影響。三層機理分析進一步揭示：圖示與視頻佔據視覺Transformer的不相交子空間，添加文本會使模型從視覺驅動轉向文本驅動推理。這些結果指明視覺編碼是提升跨表徵魯棒性的主要優化目標。項目頁面：https://ryenhails.github.io/IKEA-Bench/

English

2D assembly diagrams are often abstract and hard to follow, creating a need for intelligent assistants that can monitor progress, detect errors, and provide step-by-step guidance. In mixed reality settings, such systems must recognize completed and ongoing steps from the camera feed and align them with the diagram instructions. Vision Language Models (VLMs) show promise for this task, but face a depiction gap because assembly diagrams and video frames share few visual features. To systematically assess this gap, we construct IKEA-Bench, a benchmark of 1,623 questions across 6 task types on 29 IKEA furniture products, and evaluate 19 VLMs (2B-38B) under three alignment strategies. Our key findings: (1) assembly instruction understanding is recoverable via text, but text simultaneously degrades diagram-to-video alignment; (2) architecture family predicts alignment accuracy more strongly than parameter count; (3) video understanding remains a hard bottleneck unaffected by strategy. A three-level mechanistic analysis further reveals that diagrams and video occupy disjoint ViT subspaces, and that adding text shifts models from visual to text-driven reasoning. These results identify visual encoding as the primary target for improving cross-depiction robustness. Project page: https://ryenhails.github.io/IKEA-Bench/

跨描绘装配指令对齐的视觉语言模型基准测试与机理分析

Benchmarking and Mechanistic Analysis of Vision-Language Models for Cross-Depiction Assembly Instruction Alignment

摘要

Support