ChatPaper.aiChatPaper

名侦探柯南:基于多尺度视觉证据的渐进式推理学习

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

October 23, 2025
作者: Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
cs.AI

摘要

视频推理需进行跨帧多步推演,这始终是多模态大语言模型面临的核心挑战。基于强化学习的方法虽能提升推理能力,但常依赖纯文本推理链导致结论缺乏视觉依据或产生幻觉;而帧检索方法虽引入视觉锚定,却仍受限于证据定位不准。为此,我们提出证据锚定的多步视频推理框架Conan,通过识别上下文帧与证据帧、推理跨帧线索,并自适应决策终止或继续探索。具体实现包括:(1)构建Conan-91K大规模自动生成推理轨迹数据集,涵盖帧识别、证据推理与行动决策;(2)设计多阶段渐进式冷启动策略,结合识别-推理-行动强化学习训练框架,共同增强多步视觉推理能力。在六大推理基准测试中,Conan相较Qwen2.5-VL-7B-Instruct基线模型平均准确率提升超10%,达到最优性能。此外,该框架在长视频理解任务中展现出色泛化能力,验证了其强扩展性与鲁棒性。
English
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.
PDF112December 2, 2025