Ego-R1:面向超长第一人称视频推理的工具链思维
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
June 16, 2025
作者: Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
cs.AI
摘要
我们推出Ego-R1,一个针对超长(即持续数天乃至数周)第一人称视频进行推理的新颖框架,该框架采用了一种结构化的工具思维链(CoTT)流程,由通过强化学习(RL)训练的Ego-R1智能体协调执行。受人类问题解决策略启发,CoTT将复杂推理分解为模块化步骤,RL智能体在每一步调用特定工具,迭代协作地解答子问题,以应对时间检索和多模态理解等任务。我们设计了一个两阶段训练范式,包括使用CoTT数据对预训练语言模型进行监督微调(SFT)以及RL训练,使我们的智能体能够动态地为长程推理逐步提出工具。为促进训练,我们构建了名为Ego-R1 Data的数据集,其中包含用于SFT的Ego-CoTT-25K和用于RL的Ego-QA-4.4K。此外,我们的Ego-R1智能体在一个新策划的持续一周的视频问答基准测试Ego-R1 Bench上接受评估,该基准包含来自混合来源的人工验证问答对。大量结果表明,Ego-R1智能体通过动态增强的工具思维链推理,能够有效应对理解超长第一人称视频的独特挑战,将时间覆盖范围从几小时显著扩展至一周。
English
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e.,
in days and weeks) egocentric videos, which leverages a structured
Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained
via reinforcement learning (RL). Inspired by human problem-solving strategies,
CoTT decomposes complex reasoning into modular steps, with the RL agent
invoking specific tools, one per step, to iteratively and collaboratively
answer sub-questions tackling such tasks as temporal retrieval and multi-modal
understanding. We design a two-stage training paradigm involving supervised
finetuning (SFT) of a pretrained language model using CoTT data and RL to
enable our agent to dynamically propose step-by-step tools for long-range
reasoning. To facilitate training, we construct a dataset called Ego-R1 Data,
which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our
Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark,
Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.
Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought
reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of
understanding ultra-long egocentric videos, significantly extending the time
coverage from few hours to a week.