LLaVA-o1：ビジョン言語モデルが段階的に推論する

要旨

大規模言語モデルは、推論時間のスケーリングを通じて、特に推論能力の面で著しい進歩を示しており、OpenAIのo1などのモデルによって示されています。しかし、現在のビジョン・ランゲージ・モデル（VLM）は、複雑なビジュアル質問応答タスクを処理する際に特に体系的かつ構造化された推論を行うのに苦労することがよくあります。本研究では、自律的な多段階推論を実行するために設計された新しいVLMであるLLaVA-o1を紹介します。LLaVA-o1は、思考の連鎖には頼らず、要約、視覚的解釈、論理的推論、結論生成の各段階で独立して関与します。この構造化されたアプローチにより、LLaVA-o1は推論集中タスクにおいて精度の向上を達成します。これを達成するために、様々なビジュアル質問応答ソースからサンプルを統合し、構造化された推論注釈を提供するLLaVA-o1-100kデータセットを編纂します。さらに、推論時間の段階レベルのビームサーチ手法を提案し、効果的な推論時間のスケーリングを実現します。驚くべきことに、わずか100kのトレーニングサンプルとシンプルで効果的な推論時間スケーリング手法で、LLaVA-o1は、多様なマルチモーダル推論ベンチマークでベースモデルを8.9%上回るだけでなく、Gemini-1.5-pro、GPT-4o-mini、Llama-3.2-90B-Vision-Instructなどのより大きな、さらにはクローズドソースのモデルをも凌駕します。

English

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-o1, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-o1 independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-o1 to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-o1-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-o1 not only outperforms its base model by 8.9% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

LLaVA-o1：ビジョン言語モデルが段階的に推論する

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

要旨

Support