Ego-R1:超長第一人稱影片推理的鏈式工具思維
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning
June 16, 2025
作者: Shulin Tian, Ruiqi Wang, Hongming Guo, Penghao Wu, Yuhao Dong, Xiuying Wang, Jingkang Yang, Hao Zhang, Hongyuan Zhu, Ziwei Liu
cs.AI
摘要
我們推出Ego-R1,這是一個針對超長(即數天至數週)第一人稱視角影片進行推理的新穎框架,該框架利用由強化學習(RL)訓練的Ego-R1代理協調的結構化工具思維鏈(CoTT)過程。受人類問題解決策略的啟發,CoTT將複雜推理分解為模組化步驟,RL代理在每一步調用特定工具,以迭代和協作的方式回答子問題,處理諸如時間檢索和多模態理解等任務。我們設計了一個兩階段訓練範式,包括使用CoTT數據對預訓練語言模型進行監督微調(SFT)以及RL,使我們的代理能夠動態提出逐步工具,以進行長範圍推理。為了促進訓練,我們構建了一個名為Ego-R1 Data的數據集,其中包含用於SFT的Ego-CoTT-25K和用於RL的Ego-QA-4.4K。此外,我們的Ego-R1代理在一個新策劃的為期一週的影片問答基準Ego-R1 Bench上進行評估,該基準包含來自混合來源的人類驗證問答對。大量結果表明,我們的Ego-R1代理通過動態、工具增強的思維鏈推理,能夠有效應對理解超長第一人稱視角影片的獨特挑戰,顯著將時間覆蓋範圍從幾小時延長至一週。
English
We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e.,
in days and weeks) egocentric videos, which leverages a structured
Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained
via reinforcement learning (RL). Inspired by human problem-solving strategies,
CoTT decomposes complex reasoning into modular steps, with the RL agent
invoking specific tools, one per step, to iteratively and collaboratively
answer sub-questions tackling such tasks as temporal retrieval and multi-modal
understanding. We design a two-stage training paradigm involving supervised
finetuning (SFT) of a pretrained language model using CoTT data and RL to
enable our agent to dynamically propose step-by-step tools for long-range
reasoning. To facilitate training, we construct a dataset called Ego-R1 Data,
which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our
Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark,
Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources.
Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought
reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of
understanding ultra-long egocentric videos, significantly extending the time
coverage from few hours to a week.