Insight-V: マルチモーダルな大規模言語モデルを用いた長い連鎖のビジュアル推論の探索

要旨

大規模言語モデル（LLM）は、Chain-of-ThoughtプロンプティングからOpenAI o1のような製品レベルのソリューションに進化することで、推論をより多く行い、信頼性を高める能力を示しています。LLMの推論能力を向上させるためのさまざまな取り組みがあるものの、高品質な長い連鎖推論データや最適化されたトレーニングパイプラインは、ビジョン言語タスクにおいて依然として不十分に探究されています。本論文では、複雑なマルチモーダルタスク向けに長くて頑健な推論データを効率的に生成し、マルチモーダル大規模言語モデル（MLLM）の推論能力を向上させるための効果的なトレーニングパイプラインを提示するInsight-Vを紹介します。具体的には、人手を介さずに長く構造化された推論データを生成するための二段階パイプラインを設計し、十分に長く多様な推論パスを生成するためのプログレッシブ戦略と、データ品質を確保するための多粒度評価方法を組み込んでいます。このような長く複雑な推論データでMLLMを直接監督すると、理想的な推論能力が得られないことがわかりました。この問題に対処するため、長い連鎖推論を行う推論エージェントと、推論結果を判断および要約するために訓練された要約エージェントからなるマルチエージェントシステムを設計しました。さらに、推論エージェントの生成安定性と品質を向上させるために反復的DPOアルゴリズムを組み込んでいます。人気のLLaVA-NeXTモデルとより強力な基本MLLMに基づいて、視覚推論を必要とする厳しいマルチモーダルベンチマークで著しいパフォーマンス向上を実証しています。マルチエージェントシステムの恩恵を受けて、Insight-Vは知覚に焦点を当てたマルチモーダルタスクにおいても簡単にパフォーマンスを維持または向上させることができます。

English

Large Language Models (LLMs) demonstrate enhanced capabilities and reliability by reasoning more, evolving from Chain-of-Thought prompting to product-level solutions like OpenAI o1. Despite various efforts to improve LLM reasoning, high-quality long-chain reasoning data and optimized training pipelines still remain inadequately explored in vision-language tasks. In this paper, we present Insight-V, an early effort to 1) scalably produce long and robust reasoning data for complex multi-modal tasks, and 2) an effective training pipeline to enhance the reasoning capabilities of multi-modal large language models (MLLMs). Specifically, to create long and structured reasoning data without human labor, we design a two-step pipeline with a progressive strategy to generate sufficiently long and diverse reasoning paths and a multi-granularity assessment method to ensure data quality. We observe that directly supervising MLLMs with such long and complex reasoning data will not yield ideal reasoning ability. To tackle this problem, we design a multi-agent system consisting of a reasoning agent dedicated to performing long-chain reasoning and a summary agent trained to judge and summarize reasoning results. We further incorporate an iterative DPO algorithm to enhance the reasoning agent's generation stability and quality. Based on the popular LLaVA-NeXT model and our stronger base MLLM, we demonstrate significant performance gains across challenging multi-modal benchmarks requiring visual reasoning. Benefiting from our multi-agent system, Insight-V can also easily maintain or improve performance on perception-focused multi-modal tasks.

Insight-V: マルチモーダルな大規模言語モデルを用いた長い連鎖のビジュアル推論の探索

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

要旨

Support