SARAH: 空間認識型リアルタイム・エージェント人間

要旨

身体性エージェントがVR、テレプレゼンス、デジタルヒューマン応用の中心となるにつれ、その動作は音声に連動したジェスチャーを超える必要がある。エージェントはユーザーの方向を向き、動きに反応し、自然な視線を維持すべきである。現在の手法にはこの空間的認識が欠けている。我々はこのギャップを埋める、初のリアルタイムで完全因果的な空間認識会話動作生成手法を提案し、ストリーミングVRヘッドセットへの展開を可能にする。ユーザーの位置と双方向音声を入力として、本手法は音声と同期したジェスチャーを生成すると同時に、ユーザーに応じてエージェントの方向制御を行う全身動作を生成する。提案アーキテクチャは、因果的TransformerベースのVAEとストリーミング推論のためのインターリーブ潜在トークン、ユーザー軌跡と音声を条件とするフローマッチングモデルを組み合わせている。様々な視線選好に対応するため、分類器不要ガイダンスを用いた視線スコアリング機構を導入し、学習と制御を分離する。モデルはデータから自然な空間的調整を学習し、推論時にユーザーがアイコンタクトの強度を調整可能である。Embody 3Dデータセットにおいて、本手法は300 FPS超の状態-of-the-art動作品質を達成（非因果的ベースライン比3倍高速）し、自然な会話の微妙な空間的ダイナミクスを捉える。実稼働VRシステムでの検証により、空間認識会話エージェントのリアルタイム展開を実現した。詳細はhttps://evonneng.github.io/sarah/ を参照されたい。

English

As embodied agents become central to VR, telepresence, and digital human applications, their motion must go beyond speech-aligned gestures: agents should turn toward users, respond to their movement, and maintain natural gaze. Current methods lack this spatial awareness. We close this gap with the first real-time, fully causal method for spatially-aware conversational motion, deployable on a streaming VR headset. Given a user's position and dyadic audio, our approach produces full-body motion that aligns gestures with speech while orienting the agent according to the user. Our architecture combines a causal transformer-based VAE with interleaved latent tokens for streaming inference and a flow matching model conditioned on user trajectory and audio. To support varying gaze preferences, we introduce a gaze scoring mechanism with classifier-free guidance to decouple learning from control: the model captures natural spatial alignment from data, while users can adjust eye contact intensity at inference time. On the Embody 3D dataset, our method achieves state-of-the-art motion quality at over 300 FPS -- 3x faster than non-causal baselines -- while capturing the subtle spatial dynamics of natural conversation. We validate our approach on a live VR system, bringing spatially-aware conversational agents to real-time deployment. Please see https://evonneng.github.io/sarah/ for details.

SARAH: 空間認識型リアルタイム・エージェント人間

SARAH: Spatially Aware Real-time Agentic Humans

要旨

Support