名侦探柯南:基于多尺度视觉证据的渐进式推理学习
Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence
October 23, 2025
作者: Kun Ouyang, Yuanxin Liu, Linli Yao, Yishuo Cai, Hao Zhou, Jie Zhou, Fandong Meng, Xu Sun
cs.AI
摘要
视频推理需进行跨帧多步推演,这始终是多模态大语言模型面临的核心挑战。基于强化学习的方法虽能提升推理能力,但常依赖纯文本推理链导致结论缺乏视觉依据或产生幻觉;而帧检索方法虽引入视觉锚定,却仍受限于证据定位不准。为此,我们提出证据锚定的多步视频推理框架Conan,通过识别上下文帧与证据帧、推理跨帧线索,并自适应决策终止或继续探索。具体实现包括:(1)构建Conan-91K大规模自动生成推理轨迹数据集,涵盖帧识别、证据推理与行动决策;(2)设计多阶段渐进式冷启动策略,结合识别-推理-行动强化学习训练框架,共同增强多步视觉推理能力。在六大推理基准测试中,Conan相较Qwen2.5-VL-7B-Instruct基线模型平均准确率提升超10%,达到最优性能。此外,该框架在长视频理解任务中展现出色泛化能力,验证了其强扩展性与鲁棒性。
English
Video reasoning, which requires multi-step deduction across frames, remains a
major challenge for multimodal large language models (MLLMs). While
reinforcement learning (RL)-based methods enhance reasoning capabilities, they
often rely on text-only chains that yield ungrounded or hallucinated
conclusions. Conversely, frame-retrieval approaches introduce visual grounding
but still struggle with inaccurate evidence localization. To address these
challenges, we present Conan, a framework for evidence-grounded multi-step
video reasoning. Conan identifies contextual and evidence frames, reasons over
cross-frame clues, and adaptively decides when to conclude or explore further.
To achieve this, we (1) construct Conan-91K, a large-scale dataset of
automatically generated reasoning traces that includes frame identification,
evidence reasoning, and action decision, and (2) design a multi-stage
progressive cold-start strategy combined with an
Identification-Reasoning-Action (AIR) RLVR training framework to jointly
enhance multi-step visual reasoning. Extensive experiments on six multi-step
reasoning benchmarks demonstrate that Conan surpasses the baseline
Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving
state-of-the-art performance. Furthermore, Conan generalizes effectively to
long-video understanding tasks, validating its strong scalability and
robustness.