Ego-R1：超長第一人稱影片推理的鏈式工具思維

摘要

我們推出Ego-R1，這是一個針對超長（即數天至數週）第一人稱視角影片進行推理的新穎框架，該框架利用由強化學習（RL）訓練的Ego-R1代理協調的結構化工具思維鏈（CoTT）過程。受人類問題解決策略的啟發，CoTT將複雜推理分解為模組化步驟，RL代理在每一步調用特定工具，以迭代和協作的方式回答子問題，處理諸如時間檢索和多模態理解等任務。我們設計了一個兩階段訓練範式，包括使用CoTT數據對預訓練語言模型進行監督微調（SFT）以及RL，使我們的代理能夠動態提出逐步工具，以進行長範圍推理。為了促進訓練，我們構建了一個名為Ego-R1 Data的數據集，其中包含用於SFT的Ego-CoTT-25K和用於RL的Ego-QA-4.4K。此外，我們的Ego-R1代理在一個新策劃的為期一週的影片問答基準Ego-R1 Bench上進行評估，該基準包含來自混合來源的人類驗證問答對。大量結果表明，我們的Ego-R1代理通過動態、工具增強的思維鏈推理，能夠有效應對理解超長第一人稱視角影片的獨特挑戰，顯著將時間覆蓋範圍從幾小時延長至一週。

English

We introduce Ego-R1, a novel framework for reasoning over ultra-long (i.e., in days and weeks) egocentric videos, which leverages a structured Chain-of-Tool-Thought (CoTT) process, orchestrated by an Ego-R1 Agent trained via reinforcement learning (RL). Inspired by human problem-solving strategies, CoTT decomposes complex reasoning into modular steps, with the RL agent invoking specific tools, one per step, to iteratively and collaboratively answer sub-questions tackling such tasks as temporal retrieval and multi-modal understanding. We design a two-stage training paradigm involving supervised finetuning (SFT) of a pretrained language model using CoTT data and RL to enable our agent to dynamically propose step-by-step tools for long-range reasoning. To facilitate training, we construct a dataset called Ego-R1 Data, which consists of Ego-CoTT-25K for SFT and Ego-QA-4.4K for RL. Furthermore, our Ego-R1 agent is evaluated on a newly curated week-long video QA benchmark, Ego-R1 Bench, which contains human-verified QA pairs from hybrid sources. Extensive results demonstrate that the dynamic, tool-augmented chain-of-thought reasoning by our Ego-R1 Agent can effectively tackle the unique challenges of understanding ultra-long egocentric videos, significantly extending the time coverage from few hours to a week.

Ego-R1：超長第一人稱影片推理的鏈式工具思維

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

摘要

Support