ChatPaper.aiChatPaper

# 流媒体抓取技术报告

StreamingClaw Technical Report

March 23, 2026
作者: Jiawei Chen, Zhe Chen, Chaoqun Du, Maokui He, Wei He, Hengtao Li, Qizhen Li, Zide Liu, Hao Ma, Xuhao Pan, Chang Ren, Xudong Rao, Xintian Shen, Chenfeng Wang, Tao Wei, Chengjun Yu, Pengfei Yu, Shengyu Yao, Chunpeng Zhou, Kun Zhan, Lihao Zheng, Pan Zhou, Xuhan Zhu, Yufei Zheng
cs.AI

摘要

诸如具身智能等应用依赖实时感知-决策-动作的闭环系统,对流式视频理解提出了严苛要求。然而现有智能体存在能力碎片化问题:或仅支持离线视频理解,或缺乏多模态长时记忆机制,或难以在流式输入下实现实时推理与主动交互。这些缺陷已成为阻碍其在真实环境中持续感知、实时决策并执行动作的关键瓶颈。为缓解这些问题,我们提出StreamingClaw——面向流式视频理解与具身智能的统一智能体框架。该框架同时兼容OpenClaw标准,支持实时多模态流式交互。StreamingClaw集成五大核心能力:(1)支持实时流式推理;(2)支持在线演化交互目标下的未来事件推理与主动交互;(3)支持多模态长时存储、分层演化及多智能体间共享记忆的高效检索;(4)实现感知-决策-动作闭环,除常规工具与技能外,还提供专为真实物理环境设计的流式工具及以动作为核心的技能;(5)兼容OpenClaw框架,可充分借助开源社区资源与支持。通过上述设计,StreamingClaw将在线实时推理、多模态长时记忆与主动交互整合于统一框架,并通过将决策转化为可执行动作实现对物理世界的直接控制,支撑具身交互的实际部署。
English
Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.
PDF31March 27, 2026