Insight-V++: 다중 모달 대규모 언어 모델을 활용한 고급 장기 시각 추론 방향

초록

대규모 언어 모델(LLM)은 확장된 테스트 타임 추론을 통해 뛰어난 신뢰성과 고급 능력을 달성했습니다. 그러나 고품질의 장기간 추론 데이터와 최적화된 훈련 파이프라인의 심각한 부족으로 인해 이러한 능력을 다중 모달 대규모 언어 모델(MLLM)로 확장하는 것은 여전히 큰 과제로 남아 있습니다. 이러한 격차를 해소하기 위해, 우리는 이미지 중심 기반 모델인 Insight-V에서 출발하여 일반화된 시공간 아키텍처인 Insight-V++로 체계적으로 진화하는 통합 다중 에이전트 시각 추론 프레임워크를 제시합니다. 먼저, 다중 세분성 평가를 갖춘 확장 가능한 데이터 생성 파이프라인을 제안하여 인간의 개입 없이 이미지 및 비디오 영역에 걸친 구조화된 복잡한 추론 궤적을 자율적으로 합성합니다. 이러한 복잡한 데이터로 MLLM을 직접 지도하는 것이 최적의 결과를 내지 못한다는 점을 인식하고, 우리는 광범위한 분석 체인을 실행하는 추론 에이전트와 최종 결과를 비판적으로 평가하여 정제하는 요약 에이전트로 구성된 이중 에이전트 아키텍처를 설계했습니다. 초기 프레임워크는 직접 선호도 최적화(DPO)를 사용했지만, 그 오프-폴리시 특성으로 인해 강화 학습의 잠재력이 근본적으로 제한되었습니다. 특히 장기간 비디오 이해를 위해 이러한 한계를 극복하기 위해 Insight-V++는 시공간 추론을 강화하고 평가 견고성을 개선하는 두 가지 새로운 알고리즘인 ST-GRPO와 J-GRPO를 도입합니다. 중요한 것은 요약 에이전트의 신뢰할 수 있는 피드백을 활용하여 반복적인 추론 경로 생성 과정을 안내하고, 전체 다중 에이전트 시스템을 지속적이고 자기 개선적인 루프 내에서 재훈련한다는 점입니다. LLaVA-NeXT 및 Qwen2.5-VL과 같은 기본 모델에 대한 광범위한 실험을 통해 기존의 인식 중심 작업에 대한 강력한 능력을 유지하면서도 까다로운 이미지 및 비디오 추론 벤치마크 전반에 걸쳐 상당한 성능 향상을 입증했습니다.

English

Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

Insight-V++: 다중 모달 대규모 언어 모델을 활용한 고급 장기 시각 추론 방향

Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

초록

Support