ChatPaper.aiChatPaper

視覺可解釋的子任務推理於視覺問答中的應用

Visually Interpretable Subtask Reasoning for Visual Question Answering

May 12, 2025
作者: Yu Cheng, Arushi Goel, Hakan Bilen
cs.AI

摘要

回答諸如「哪些紅色傢俱可用於坐著?」這類複雜的視覺問題,需要進行多步驟的推理,包括物體識別、屬性過濾和關係理解。近期研究通過將任務分解為子任務程序,提升了多模態大語言模型(MLLMs)的可解釋性,但這些方法因對目標數據適應性差而計算成本高且準確性較低。為解決這一問題,我們引入了VISTAR(視覺可解釋子任務感知推理模型),這是一個子任務驅動的訓練框架,通過在MLLMs內生成文本和視覺解釋,增強了可解釋性和推理能力。VISTAR不依賴外部模型,而是微調MLLMs以產生結構化的「子任務思維」推理序列(逐步推理過程)。在兩個基準測試上的實驗表明,VISTAR在保持可解釋性的同時,持續提升了推理準確性。我們的代碼和數據集將在https://github.com/ChengJade/VISTAR上公開。
English
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

Summary

AI-Generated Summary

PDF11May 15, 2025