ストリーミングクロウ技術報告書

要旨

エンボディード知能などのアプリケーションは、リアルタイムの知覚-判断-行動の閉ループに依存しており、ストリーミング映像理解に対して厳しい課題を提起している。しかし、現在のエージェントは、オフライン映像理解のみのサポート、長期的なマルチモーダル記憶メカニズムの欠如、あるいはストリーミング入力下でのリアルタイム推論と能動的対話の実現困難など、断片化した能力に悩まされている。これらの欠点は、実世界環境において知覚を維持し、リアルタイムで意思決定を行い、行動を実行することを阻む主要なボトルネックとなっている。これらの問題を緩和するため、我々はストリーミング映像理解とエンボディード知能のための統一エージェントフレームワークであるStreamingClawを提案する。これはまた、リアルタイムのマルチモーダルストリーミング対話をサポートするOpenClaw互換フレームワークでもある。StreamingClawは5つの核となる能力を統合している：(1) リアルタイムストリーミング推論をサポートする。(2) 対話目標のオンライン進化下での将来事象の推論と能動的対話をサポートする。(3) マルチモーダル長期記憶の保存、階層的進化、および複数エージェント間での共有メモリの効率的な検索をサポートする。(4) 知覚-判断-行動の閉ループをサポートする。従来のツールやスキルに加え、実世界の物理環境に特化したストリーミングツールと行動中心のスキルを提供する。(5) OpenClawフレームワークと互換性があり、オープンソースコミュニティのリソースとサポートを十分に活用できる。これらの設計により、StreamingClawはオンラインリアルタイム推論、マルチモーダル長期記憶、能動的対話を統一フレームワーク内に統合する。さらに、意思決定を実行可能な行動に変換することで、物理世界を直接制御し、エンボディード対話の実用的な展開を支援する。

English

Applications such as embodied intelligence rely on a real-time perception-decision-action closed loop, posing stringent challenges for streaming video understanding. However, current agents suffer from fragmented capabilities, such as supporting only offline video understanding, lacking long-term multimodal memory mechanisms, or struggling to achieve real-time reasoning and proactive interaction under streaming inputs. These shortcomings have become a key bottleneck for preventing them from sustaining perception, making real-time decisions, and executing actions in real-world environments. To alleviate these issues, we propose StreamingClaw, a unified agent framework for streaming video understanding and embodied intelligence. It is also an OpenClaw-compatible framework that supports real-time, multimodal streaming interaction. StreamingClaw integrates five core capabilities: (1) It supports real-time streaming reasoning. (2) It supports reasoning about future events and proactive interaction under the online evolution of interaction objectives. (3) It supports multimodal long-term storage, hierarchical evolution, and efficient retrieval of shared memory across multiple agents. (4) It supports a closed-loop of perception-decision-action. In addition to conventional tools and skills, it also provides streaming tools and action-centric skills tailored for real-world physical environments. (5) It is compatible with the OpenClaw framework, allowing it to fully leverage the resources and support of the open-source community. With these designs, StreamingClaw integrates online real-time reasoning, multimodal long-term memory, and proactive interaction within a unified framework. Moreover, by translating decisions into executable actions, it enables direct control of the physical world, supporting practical deployment of embodied interaction.

ストリーミングクロウ技術報告書

StreamingClaw Technical Report

要旨

Support