ChatPaper.aiChatPaper

视觉问答中的可解释子任务推理

Visually Interpretable Subtask Reasoning for Visual Question Answering

May 12, 2025
作者: Yu Cheng, Arushi Goel, Hakan Bilen
cs.AI

摘要

回答诸如“哪些红色家具可用于坐?”这类复杂的视觉问题,需要进行多步骤推理,包括物体识别、属性筛选和关系理解。近期研究通过将任务分解为子任务程序,提升了多模态大语言模型(MLLMs)的可解释性,但这些方法因对目标数据适应性差而计算成本高且准确性较低。为解决这一问题,我们提出了VISTAR(视觉可解释子任务感知推理模型),这是一个子任务驱动的训练框架,通过在MLLMs内部生成文本和视觉解释,同时增强了可解释性和推理能力。VISTAR不依赖外部模型,而是微调MLLMs以生成结构化的“子任务思维”推理链(逐步推理序列)。在两个基准测试上的实验表明,VISTAR在保持可解释性的同时,持续提升了推理准确性。我们的代码和数据集将发布于https://github.com/ChengJade/VISTAR。
English
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

Summary

AI-Generated Summary

PDF12May 15, 2025