视觉问答中的可解释子任务推理

摘要

回答诸如“哪些红色家具可用于坐？”这类复杂的视觉问题，需要进行多步骤推理，包括物体识别、属性筛选和关系理解。近期研究通过将任务分解为子任务程序，提升了多模态大语言模型（MLLMs）的可解释性，但这些方法因对目标数据适应性差而计算成本高且准确性较低。为解决这一问题，我们提出了VISTAR（视觉可解释子任务感知推理模型），这是一个子任务驱动的训练框架，通过在MLLMs内部生成文本和视觉解释，同时增强了可解释性和推理能力。VISTAR不依赖外部模型，而是微调MLLMs以生成结构化的“子任务思维”推理链（逐步推理序列）。在两个基准测试上的实验表明，VISTAR在保持可解释性的同时，持续提升了推理准确性。我们的代码和数据集将发布于https://github.com/ChengJade/VISTAR。

English

Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

视觉问答中的可解释子任务推理

Visually Interpretable Subtask Reasoning for Visual Question Answering

摘要

Support