OmniStream:驾驭连续流中的感知、重建与行动
OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams
March 12, 2026
作者: Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie
cs.AI
摘要
现代视觉智能体需要具备通用性、因果性和物理结构化的表征能力,才能在实时流式环境中运行。然而当前视觉基础模型仍处于割裂状态,仅专注于图像语义感知、离线时序建模或空间几何等单一领域。本文提出OmniStream——一种统一的流式视觉骨干网络,能够基于多样化视觉输入有效实现感知、重建与行动。通过融合因果时空注意力机制与三维旋转位置编码(3D-RoPE),我们的模型借助持久化键值缓存支持视频流的高效逐帧在线处理。我们在29个数据集上采用协同多任务框架进行预训练,该框架耦合了静态与时序表征学习、流式几何重建以及视觉-语言对齐。大量实验表明,即使在骨干网络严格冻结的情况下,OmniStream仍在图像/视频探测、流式几何重建、复杂视频与空间推理以及机器人操控(训练未涉及场景)等任务中,持续取得与专业模型相媲美的性能。我们的工作并非追求特定基准测试的极致性能,而是证明了训练单一通用视觉骨干网络的可行性——该网络能够泛化至语义、空间和时序推理领域,这标志着我们朝着实现交互式具身智能体通用视觉理解的目标迈出了更有意义的一步。
English
Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geometry. This paper introduces OmniStream, a unified streaming visual backbone that effectively perceives, reconstructs, and acts from diverse visual inputs. By incorporating causal spatiotemporal attention and 3D rotary positional embeddings (3D-RoPE), our model supports efficient, frame-by-frame online processing of video streams via a persistent KV-cache. We pre-train OmniStream using a synergistic multi-task framework coupling static and temporal representation learning, streaming geometric reconstruction, and vision-language alignment on 29 datasets. Extensive evaluations show that, even with a strictly frozen backbone, OmniStream achieves consistently competitive performance with specialized experts across image and video probing, streaming geometric reconstruction, complex video and spatial reasoning, as well as robotic manipulation (unseen at training). Rather than pursuing benchmark-specific dominance, our work demonstrates the viability of training a single, versatile vision backbone that generalizes across semantic, spatial, and temporal reasoning, i.e., a more meaningful step toward general-purpose visual understanding for interactive and embodied agents.