視覚的質問応答のための視覚的に解釈可能なサブタスク推論

要旨

「どの赤い家具が座るのに使えるか？」といった複雑な視覚的質問に答えるには、物体認識、属性フィルタリング、関係理解を含む多段階の推論が必要です。最近の研究では、タスクをサブタスクプログラムに分解することでマルチモーダル大規模言語モデル（MLLM）の解釈可能性を向上させていますが、これらの手法は計算コストが高く、ターゲットデータへの適応が不十分なため精度が低いという課題があります。この問題に対処するため、私たちはVISTAR（Visually Interpretable Subtask-Aware Reasoning Model）を提案します。VISTARは、MLLM内でテキストと視覚的な説明を生成することで、解釈可能性と推論能力の両方を向上させるサブタスク駆動型のトレーニングフレームワークです。外部モデルに依存する代わりに、VISTARはMLLMを微調整して構造化された「Subtask-of-Thought」推論（段階的な推論シーケンス）を生成します。2つのベンチマークでの実験により、VISTARが解釈可能性を維持しながら推論精度を一貫して向上させることが示されました。私たちのコードとデータセットはhttps://github.com/ChengJade/VISTARで公開予定です。

English

Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

視覚的質問応答のための視覚的に解釈可能なサブタスク推論

Visually Interpretable Subtask Reasoning for Visual Question Answering

要旨

Support