CurveStream: 曲率認識型階層的視覚メモリ管理によるMLLMのストリーミング映像理解の向上

要旨

マルチモーダル大規模言語モデルは、オフライン映像理解において大きな成功を収めているが、ストリーミング映像への応用は、視覚トークンの線形的な爆発的増加により深刻に制限されており、メモリ不足（OOM）エラーや破滅的忘滅を引き起こしやすい。既存の視覚情報保持およびメモリ管理手法は、一様サンプリング、低水準の物理的指標、または受動的なキャッシュ削除に依存する場合がほとんどである。しかし、これらの戦略は本質的な意味的認識を欠いており、文脈の一貫性を損なったり、一時的かつ重要な意味的遷移を曖昧にしたりする可能性がある。これらの課題を解決するため、我々は学習不要な曲率認識型階層的視覚メモリ管理フレームワーク「CurveStream」を提案する。本手法は、連続的な特徴軌跡に沿った高曲率領域が、重要な大域的意味的遷移と密接に対応するという重要な観察に基づいている。この幾何学的知見に基づき、CurveStreamは曲率スコアによりリアルタイムの意味的強度を評価し、オンラインK-シグマ動的閾値を統合することで、厳格なトークン予算の下でフレームを明確記憶状態と曖昧記憶状態に適応的に振り分ける。様々な時間スケールでの評価により、この軽量フレームワークであるCurveStreamが、それぞれのベースラインに対して一貫して10%以上の絶対性能向上（例：StreamingBenchで10.69%、OVOBenchで13.58%）をもたらし、ストリーミング映像知覚における新たなstate-of-the-artを確立することが確認された。コードはhttps://github.com/streamingvideos/CurveStreamで公開予定である。

English

Multimodal Large Language Models have achieved significant success in offline video understanding, yet their application to streaming videos is severely limited by the linear explosion of visual tokens, which often leads to Out-of-Memory (OOM) errors or catastrophic forgetting. Existing visual retention and memory management methods typically rely on uniform sampling, low-level physical metrics, or passive cache eviction. However, these strategies often lack intrinsic semantic awareness, potentially disrupting contextual coherence and blurring transient yet critical semantic transitions. To address these limitations, we propose CurveStream, a training-free, curvature-aware hierarchical visual memory management framework. Our approach is motivated by the key observation that high-curvature regions along continuous feature trajectories closely align with critical global semantic transitions. Based on this geometric insight, CurveStream evaluates real-time semantic intensity via a Curvature Score and integrates an online K-Sigma dynamic threshold to adaptively route frames into clear and fuzzy memory states under a strict token budget. Evaluations across diverse temporal scales confirm that this lightweight framework, CurveStream, consistently yields absolute performance gains of over 10% (e.g., 10.69% on StreamingBench and 13.58% on OVOBench) over respective baselines, establishing new state-of-the-art results for streaming video perception.The code will be released at https://github.com/streamingvideos/CurveStream.

CurveStream: 曲率認識型階層的視覚メモリ管理によるMLLMのストリーミング映像理解の向上

CurveStream: Boosting Streaming Video Understanding in MLLMs via Curvature-Aware Hierarchical Visual Memory Management

要旨

Support