HomeSafe-Bench: 家庭環境における具身エージェントの不安全行動検出に関する視覚言語モデルの評価

要旨

具体化エージェントの急速な進化に伴い、家庭用ロボットの実環境への展開が加速している。しかし、構造化された産業環境とは異なり、家庭空間では予測不能な安全リスクが生じる。知覚遅延や常識知識の欠如といったシステム制限が危険な誤動作を引き起こす可能性がある。現在の安全性評価は、静止画像、テキスト、または一般的な危険要因に限定されることが多く、こうした特定の文脈における動的不安全行動検出を適切に評価できていない。この課題を解決するため、家庭内シナリオにおける不安全行動検出のためのVision-Language Models（VLM）評価ベンチマーク「HomeSafe-Bench」を提案する。本ベンチマークは、物理シミュレーションと高度な動画生成を組み合わせたハイブリッドパイプラインで構築され、6つの機能領域にわたる438の多様な事例と、細粒度多次元アノテーションを特徴とする。さらに、階層型ストリーミング安全監視アーキテクチャ「HD-Guard」を提案する。本アーキテクチャは、高頻度連続スクリーニングを行う軽量FastBrainと、非同期で深層マルチモーダル推論を行う大規模SlowBrainを協調させ、推論効率と検出精度の最適なバランスを実現する。評価実験により、HD-Guardが遅延と性能の優れたトレードオフを達成することを示すとともに、現行のVLMベース安全検出における重要なボトルネックを明らかにする。

English

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.

HomeSafe-Bench: 家庭環境における具身エージェントの不安全行動検出に関する視覚言語モデルの評価

HomeSafe-Bench: Evaluating Vision-Language Models on Unsafe Action Detection for Embodied Agents in Household Scenarios

要旨

Support