Insight-V++：マルチモーダル大規模言語モデルによる高度な長鎖視覚推論へのアプローチ

要旨

大規模言語モデル（LLM）は、長時間の推論時間を活用することで、驚異的な信頼性と高度な能力を実現している。しかし、高品質な長鎖推論データと最適化された学習パイプラインが決定的に不足しているため、これらの能力をマルチモーダル大規模言語モデル（MLLM）に拡張することは依然として大きな課題である。このギャップを埋めるため、我々は画像中心の基盤モデルInsight-Vから体系的に発展させた、統一的なマルチエージェント視覚推論フレームワークInsight-V++を提案する。まず、人的介入なしで画像および映像ドメインにわたる構造化された複雑な推論軌跡を自律的に合成する、マルチ粒度評価を備えたスケーラブルなデータ生成パイプラインを提案する。このような複雑なデータでMLLMを直接指導すると最適ではない結果が得られることを認識し、大規模な分析チェーンを実行する推論エージェントと、最終結果を批判的に評価・要約する要約エージェントからなるデュアルエージェントアーキテクチャを設計する。初期フレームワークでは直接選好最適化（DPO）を採用したが、そのオフポリシー性が強化学習の可能性を根本的に制約していた。特に長編映像理解におけるこれらの制限を克服するため、Insight-V++は時空間推論を強化し評価の頑健性を向上させる二つの新規アルゴリズム、ST-GRPOとJ-GRPOを導入する。要約エージェントからの信頼性の高いフィードバックを活用することで、反復的な推論パス生成プロセスを導き、マルチエージェントシステム全体を継続的かつ自己改善的なループで再学習する。LLaVA-NeXTやQwen2.5-VLなどの基盤モデルを用いた大規模実験により、従来の知覚中心タスクでの強力な能力を維持しつつ、難易度の高い画像・映像推論ベンチマークで顕著な性能向上が実証された。

English

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

Insight-V++：マルチモーダル大規模言語モデルによる高度な長鎖視覚推論へのアプローチ

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

要旨

Support