ChatPaper.aiChatPaper

OmniStream:在连续流中精通感知、重构与行动

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

March 12, 2026
作者: Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie
cs.AI

摘要

现代视觉智能体需要具备通用性、因果性和物理结构化的表征能力,才能在实时流式环境中运行。然而当前的视觉基础模型仍处于割裂状态,仅专长于图像语义感知、离线时序建模或空间几何等单一领域。本文提出OmniStream——一种统一的流式视觉主干网络,能够有效感知、重建并处理多样化视觉输入。通过引入因果时空注意力机制与三维旋转位置编码(3D-RoPE),我们的模型借助持久键值缓存实现了视频流的高效逐帧在线处理。我们采用协同多任务框架对OmniStream进行预训练,该框架耦合了静态与时序表征学习、流式几何重建以及视觉-语言对齐,共涵盖29个数据集。大量实验表明,即使在严格冻结主干网络的条件下,OmniStream仍在图像/视频探测、流式几何重建、复杂视频与空间推理以及机器人操控(训练未涉及场景)等任务中,持续取得与专业模型相媲美的性能。本研究不追求特定基准测试的绝对优势,而是证明训练单一通用视觉主干网络的可行性——该网络能在语义、空间和时序推理中实现泛化,这为交互式具身智能体迈向通用视觉理解迈出了更有意义的一步。
English
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.
PDF92March 15, 2026