Ego-R1：面向超长第一人称视频推理的工具链思维

摘要

我们推出Ego-R1，一个针对超长（即持续数天乃至数周）第一人称视频进行推理的新颖框架，该框架采用了一种结构化的工具思维链（CoTT）流程，由通过强化学习（RL）训练的Ego-R1智能体协调执行。受人类问题解决策略启发，CoTT将复杂推理分解为模块化步骤，RL智能体在每一步调用特定工具，迭代协作地解答子问题，以应对时间检索和多模态理解等任务。我们设计了一个两阶段训练范式，包括使用CoTT数据对预训练语言模型进行监督微调（SFT）以及RL训练，使我们的智能体能够动态地为长程推理逐步提出工具。为促进训练，我们构建了名为Ego-R1 Data的数据集，其中包含用于SFT的Ego-CoTT-25K和用于RL的Ego-QA-4.4K。此外，我们的Ego-R1智能体在一个新策划的持续一周的视频问答基准测试Ego-R1 Bench上接受评估，该基准包含来自混合来源的人工验证问答对。大量结果表明，Ego-R1智能体通过动态增强的工具思维链推理，能够有效应对理解超长第一人称视频的独特挑战，将时间覆盖范围从几小时显著扩展至一周。

English

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Ego-R1：面向超长第一人称视频推理的工具链思维

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

摘要

Support