CurveStream:基于曲率感知分层视觉内存管理增强MLLM的流式视频理解能力
CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management
March 20, 2026
作者: Chao Wang, Xudong Tan, Jianjian Cao, Kangcong Li, Tao Chen
cs.AI
摘要
多模态大语言模型在离线视频理解领域已取得显著成功,但其在流式视频处理中的应用却因视觉标记数量的线性激增而严重受限,常引发内存溢出错误或灾难性遗忘问题。现有视觉记忆保留与管理方法通常依赖均匀采样、低层物理指标或被动缓存淘汰策略,然而这些方法往往缺乏内在语义感知能力,可能破坏上下文连贯性并模糊短暂但关键的语义转换节点。为解决这些局限,我们提出CurveStream——一种基于曲率感知的层级化视觉记忆管理框架,该框架无需训练即可运行。我们的研究动机源于关键发现:连续特征轨迹上的高曲率区域与全局关键语义转换高度吻合。基于这一几何洞察,CurveStream通过曲率评分实时评估语义强度,并集成在线K-Sigma动态阈值机制,在严格标记预算下自适应地将视频帧路由至清晰记忆与模糊记忆状态。跨多时间尺度的评估表明,该轻量级框架在StreamingBench和OVOBench基准测试中分别较基线模型实现10.69%和13.58%的绝对性能提升,创造了流式视频感知的新标杆。相关代码将在https://github.com/streamingvideos/CurveStream 开源。
English
Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.