名侦探柯南：基于多尺度视觉证据的渐进式推理学习

摘要

视频推理需进行跨帧多步推演，这始终是多模态大语言模型面临的核心挑战。基于强化学习的方法虽能提升推理能力，但常依赖纯文本推理链导致结论缺乏视觉依据或产生幻觉；而帧检索方法虽引入视觉锚定，却仍受限于证据定位不准。为此，我们提出证据锚定的多步视频推理框架Conan，通过识别上下文帧与证据帧、推理跨帧线索，并自适应决策终止或继续探索。具体实现包括：（1）构建Conan-91K大规模自动生成推理轨迹数据集，涵盖帧识别、证据推理与行动决策；（2）设计多阶段渐进式冷启动策略，结合识别-推理-行动强化学习训练框架，共同增强多步视觉推理能力。在六大推理基准测试中，Conan相较Qwen2.5-VL-7B-Instruct基线模型平均准确率提升超10%，达到最优性能。此外，该框架在长视频理解任务中展现出色泛化能力，验证了其强扩展性与鲁棒性。

English

Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.

名侦探柯南：基于多尺度视觉证据的渐进式推理学习

Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

摘要

Support