# 流媒体抓取技术报告

摘要

诸如具身智能等应用依赖实时感知-决策-动作的闭环系统，对流式视频理解提出了严苛要求。然而现有智能体存在能力碎片化问题：或仅支持离线视频理解，或缺乏多模态长时记忆机制，或难以在流式输入下实现实时推理与主动交互。这些缺陷已成为阻碍其在真实环境中持续感知、实时决策并执行动作的关键瓶颈。为缓解这些问题，我们提出StreamingClaw——面向流式视频理解与具身智能的统一智能体框架。该框架同时兼容OpenClaw标准，支持实时多模态流式交互。StreamingClaw集成五大核心能力：（1）支持实时流式推理；（2）支持在线演化交互目标下的未来事件推理与主动交互；（3）支持多模态长时存储、分层演化及多智能体间共享记忆的高效检索；（4）实现感知-决策-动作闭环，除常规工具与技能外，还提供专为真实物理环境设计的流式工具及以动作为核心的技能；（5）兼容OpenClaw框架，可充分借助开源社区资源与支持。通过上述设计，StreamingClaw将在线实时推理、多模态长时记忆与主动交互整合于统一框架，并通过将决策转化为可执行动作实现对物理世界的直接控制，支撑具身交互的实际部署。

English

Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.

# 流媒体抓取技术报告

StreamingClaw Technical Report

摘要

Support