柯南:基于多尺度视觉证据的渐进式侦探推理学习
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
October 23, 2025
作者: Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
cs.AI
摘要
视频推理需进行跨帧的多步推演,这始终是多模态大语言模型面临的主要挑战。基于强化学习的方法虽能增强推理能力,但常依赖纯文本推理链导致结论缺乏事实依据或出现幻觉。相比之下,帧检索方法虽引入视觉依据,却仍受限于证据定位不准的困境。为此,我们提出Conan框架,通过证据锚定实现多步视频推理。该框架能识别上下文帧与证据帧,基于跨帧线索进行推理,并自适应决定终止推理或继续探索。为实现这一目标,我们(1)构建了Conan-91K大规模数据集,其中自动生成的推理轨迹包含帧识别、证据推理与行动决策;(2)设计了多阶段渐进式冷启动策略,结合"识别-推理-行动"强化学习视频推理训练框架,共同提升多步视觉推理能力。在六个多步推理基准测试上的大量实验表明,Conan相较基线模型Qwen2.5-VL-7B-Instruct平均准确率提升超10%,达到业界最优水平。此外,Conan在长视频理解任务中展现出色泛化能力,验证了其强大的可扩展性与鲁棒性。
English
Video reasoning, which requires multi-step deduction across frames, remains a
major challenge for multimodal large language models (MLLMs). While
reinforcement learning (RL)-based methods enhance reasoning capabilities, they
often rely on text-only chains that yield ungrounded or hallucinated
conclusions. Conversely, frame-retrieval approaches introduce visual grounding
but still struggle with inaccurate evidence localization. To address these
challenges, we present Conan, a framework for evidence-grounded multi-step
video reasoning. Conan identifies contextual and evidence frames, reasons over
cross-frame clues, and adaptively decides when to conclude or explore further.
To achieve this, we (1) construct Conan-91K, a large-scale dataset of
automatically generated reasoning traces that includes frame identification,
evidence reasoning, and action decision, and (2) design a multi-stage
progressive cold-start strategy combined with an
Identification-Reasoning-Action (AIR) RLVR training framework to jointly
enhance multi-step visual reasoning. Extensive experiments on six multi-step
reasoning benchmarks demonstrate that Conan surpasses the baseline
Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving
state-of-the-art performance. Furthermore, Conan generalizes effectively to
long-video understanding tasks, validating its strong scalability and
robustness.