ChatPaper.aiChatPaper

EgoLCD:基于长上下文扩散模型的自中心视角视频生成

EgoLCD: Egocentric Video Generation with Long Context Diffusion

December 4, 2025
作者: Liuzhou Zhang, Jiarui Ye, Yuanlei Wang, Ming Zhong, Mingju Cao, Wanke Xia, Bowen Zeng, Zeyu Zhang, Hao Tang
cs.AI

摘要

生成长时连贯的第一人称视角视频具有挑战性,因为手物交互与流程化任务需要可靠的长时记忆能力。现有自回归模型存在内容漂移问题,即物体身份与场景语义会随时间推移逐渐退化。为解决这一难题,我们提出EgoLCD——一种面向第一人称视角长上下文视频生成的端到端框架,将长视频合成视为高效稳定的记忆管理问题。EgoLCD融合了用于稳定全局语境的长时稀疏KV缓存机制与基于注意力的短时记忆模块,并通过LoRA进行局部自适应扩展。记忆规整损失函数确保记忆使用的一致性,结构化叙事提示则提供显式时序引导。在EgoVid-5M基准上的大量实验表明,EgoLCD在感知质量与时序一致性方面均达到最先进水平,有效缓解生成式遗忘问题,为构建可扩展的具身AI世界模型迈出重要一步。代码地址:https://github.com/AIGeeksGroup/EgoLCD 项目网站:https://aigeeksgroup.github.io/EgoLCD
English
Generating long, coherent egocentric videos is difficult, as hand-object interactions and procedural tasks require reliable long-term memory. Existing autoregressive models suffer from content drift, where object identity and scene semantics degrade over time. To address this challenge, we introduce EgoLCD, an end-to-end framework for egocentric long-context video generation that treats long video synthesis as a problem of efficient and stable memory management. EgoLCD combines a Long-Term Sparse KV Cache for stable global context with an attention-based short-term memory, extended by LoRA for local adaptation. A Memory Regulation Loss enforces consistent memory usage, and Structured Narrative Prompting provides explicit temporal guidance. Extensive experiments on the EgoVid-5M benchmark demonstrate that EgoLCD achieves state-of-the-art performance in both perceptual quality and temporal consistency, effectively mitigating generative forgetting and representing a significant step toward building scalable world models for embodied AI. Code: https://github.com/AIGeeksGroup/EgoLCD. Website: https://aigeeksgroup.github.io/EgoLCD.
PDF21December 6, 2025