HomeSafe-Bench：面向家庭场景具身智能体的不安全行为检测的视觉语言模型评估基准

摘要

具身智能体的快速发展加速了家庭机器人在真实环境中的部署。然而与结构化的工业场景不同，家庭空间存在不可预测的安全风险，感知延迟与常识知识缺失等系统局限可能导致危险错误。当前的安全评估多局限于静态图像、文本或通用危险场景，难以有效衡量这些特定情境下的动态危险行为检测能力。为弥补这一空白，我们推出HomeSafe-Bench——一个专为评估视觉语言模型在家庭场景中危险行为检测能力设计的挑战性基准。该基准通过物理仿真与先进视频生成技术相结合的混合流程构建，涵盖六大功能区域的438个多样化案例，并配备细粒度的多维度标注。除基准测试外，我们提出面向家庭安全的层次化双脑监护系统（HD-Guard），采用分层流式架构实现实时安全监控。该系统通过轻量级快速脑模块进行连续高频筛查，并协同异步运行的大规模慢速脑模块进行深度多模态推理，有效平衡推理效率与检测精度。评估表明HD-Guard在延迟与性能间实现了更优权衡，同时我们的分析揭示了当前基于VLM的安全检测存在的关键瓶颈。

English

The rapid evolution of embodied agents has accelerated the deployment of household robots in real-world environments. However, unlike structured industrial settings, household spaces introduce unpredictable safety risks, where system limitations such as perception latency and lack of common sense knowledge can lead to dangerous errors. Current safety evaluations, often restricted to static images, text, or general hazards, fail to adequately benchmark dynamic unsafe action detection in these specific contexts. To bridge this gap, we introduce HomeSafe-Bench, a challenging benchmark designed to evaluate Vision-Language Models (VLMs) on unsafe action detection in household scenarios. HomeSafe-Bench is contrusted via a hybrid pipeline combining physical simulation with advanced video generation and features 438 diverse cases across six functional areas with fine-grained multidimensional annotations. Beyond benchmarking, we propose Hierarchical Dual-Brain Guard for Household Safety (HD-Guard), a hierarchical streaming architecture for real-time safety monitoring. HD-Guard coordinates a lightweight FastBrain for continuous high-frequency screening with an asynchronous large-scale SlowBrain for deep multimodal reasoning, effectively balancing inference efficiency with detection accuracy. Evaluations demonstrate that HD-Guard achieves a superior trade-off between latency and performance, while our analysis identifies critical bottlenecks in current VLM-based safety detection.