视觉问答中的可解释子任务推理
Visually Interpretable Subtask Reasoning for Visual Question Answering
May 12, 2025
作者: Yu Cheng, Arushi Goel, Hakan Bilen
cs.AI
摘要
回答诸如“哪些红色家具可用于坐?”这类复杂的视觉问题,需要进行多步骤推理,包括物体识别、属性筛选和关系理解。近期研究通过将任务分解为子任务程序,提升了多模态大语言模型(MLLMs)的可解释性,但这些方法因对目标数据适应性差而计算成本高且准确性较低。为解决这一问题,我们提出了VISTAR(视觉可解释子任务感知推理模型),这是一个子任务驱动的训练框架,通过在MLLMs内部生成文本和视觉解释,同时增强了可解释性和推理能力。VISTAR不依赖外部模型,而是微调MLLMs以生成结构化的“子任务思维”推理链(逐步推理序列)。在两个基准测试上的实验表明,VISTAR在保持可解释性的同时,持续提升了推理准确性。我们的代码和数据集将发布于https://github.com/ChengJade/VISTAR。
English
Answering complex visual questions like `Which red furniture can be used for
sitting?' requires multi-step reasoning, including object recognition,
attribute filtering, and relational understanding. Recent work improves
interpretability in multimodal large language models (MLLMs) by decomposing
tasks into sub-task programs, but these methods are computationally expensive
and less accurate due to poor adaptation to target data. To address this, we
introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a
subtask-driven training framework that enhances both interpretability and
reasoning by generating textual and visual explanations within MLLMs. Instead
of relying on external models, VISTAR fine-tunes MLLMs to produce structured
Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments
on two benchmarks show that VISTAR consistently improves reasoning accuracy
while maintaining interpretability. Our code and dataset will be available at
https://github.com/ChengJade/VISTAR.Summary
AI-Generated Summary