RealHarm: 実世界における言語モデルアプリケーションの失敗事例集

要旨

消費者向けアプリケーションにおける言語モデルの導入は、数多くのリスクを伴います。既存の研究では、規制フレームワークや理論的分析に基づくトップダウンアプローチによって、そのようなアプリケーションの害や危険性が検討されてきましたが、現実世界の失敗モードに関する実証的な証拠は十分に探究されていません。本研究では、公に報告されたインシデントを体系的にレビューして構築された、AIエージェントとの問題のあるインタラクションを注釈付きで収録したRealHarmデータセットを紹介します。展開者の視点から害、原因、危険性を分析した結果、組織的な害としては評判の損害が最も多く、危険性のカテゴリーとしては誤情報が最も一般的であることが明らかになりました。最先端のガードレールやコンテンツモデレーションシステムを実証的に評価し、そのようなシステムがインシデントを防げたかどうかを検証したところ、AIアプリケーションの保護において重大なギャップがあることが判明しました。

English

Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer's perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.

RealHarm: 実世界における言語モデルアプリケーションの失敗事例集

RealHarm: A Collection of Real-World Language Model Application Failures

要旨

Support