ChatPaper.aiChatPaper

Ego-R1:面向超长第一人称视频推理的工具链思维

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

June 16, 2025
作者: Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
cs.AI

摘要

我们推出Ego-R1,一个针对超长(即持续数天乃至数周)第一人称视频进行推理的新颖框架,该框架采用了一种结构化的工具思维链(CoTT)流程,由通过强化学习(RL)训练的Ego-R1智能体协调执行。受人类问题解决策略启发,CoTT将复杂推理分解为模块化步骤,RL智能体在每一步调用特定工具,迭代协作地解答子问题,以应对时间检索和多模态理解等任务。我们设计了一个两阶段训练范式,包括使用CoTT数据对预训练语言模型进行监督微调(SFT)以及RL训练,使我们的智能体能够动态地为长程推理逐步提出工具。为促进训练,我们构建了名为Ego-R1 Data的数据集,其中包含用于SFT的Ego-CoTT-25K和用于RL的Ego-QA-4.4K。此外,我们的Ego-R1智能体在一个新策划的持续一周的视频问答基准测试Ego-R1 Bench上接受评估,该基准包含来自混合来源的人工验证问答对。大量结果表明,Ego-R1智能体通过动态增强的工具思维链推理,能够有效应对理解超长第一人称视频的独特挑战,将时间覆盖范围从几小时显著扩展至一周。
English
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.
PDF392June 17, 2025