Insight-V++：基于多模态大语言模型的长链视觉推理技术进阶

摘要

大型语言模型（LLMs）通过扩展测试时推理已实现显著的可靠性提升和先进能力。然而，由于高质量长链推理数据与优化训练流程的严重匮乏，将此类能力拓展至多模态大语言模型（MLLMs）仍面临重大挑战。为弥补这一鸿沟，我们提出统一的多智能体视觉推理框架，从以图像为核心的基础模型Insight-V系统性地演进为通用时空架构Insight-V++。我们首先构建具备多粒度评估能力的可扩展数据生成流程，无需人工干预即可自主合成跨图像与视频领域的结构化复杂推理轨迹。鉴于直接使用此类复杂数据监督MLLMs会导致次优结果，我们设计了双智能体架构：推理智能体负责执行广泛分析链，摘要智能体则对最终结果进行批判性评估与提炼。虽然初始框架采用直接偏好优化（DPO），但其离策略特性从根本上限制了强化学习潜力。为突破这一局限（尤其针对长时序视频理解），Insight-V++引入ST-GRPO与J-GRPO两种新型算法，分别增强时空推理能力与评估鲁棒性。关键创新在于：通过利用摘要智能体的可靠反馈，我们引导迭代式推理路径生成过程，使整个多智能体系统在持续自我优化的循环中完成再训练。基于LLaVA-NeXT和Qwen2.5-VL等基础模型的大规模实验表明，该框架在挑战性图像视频推理基准上取得显著性能提升，同时保持传统感知任务的强健能力。

English

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

Insight-V++：基于多模态大语言模型的长链视觉推理技术进阶

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

摘要

Support