LegalHalluLens: 型付き幻覚監査と調整済みマルチエージェント討論による信頼できる法的AI

要旨

法務ワークフローに導入されたAIシステムは、総合指標で約52%と報告される割合で幻覚（ハルシネーション）を生じるが、この平均値はエラーがどこに集中し、どの方向に偏っているかを隠蔽してしまい、コンプライアンス担当者は信頼できる導入のための実用的なシグナルを得られない。本稿では、LegalHalluLensという監査フレームワークを提案する。これは、以下の3つの要素から構成される: CUAD（Hendrycksら、2021）上の4つの法的に動機づけられたクレームカテゴリ（数値的、時間的、義務/権利、事実的）にわたる型付き幻覚プロファイル、省略対創作バイアスを展開比較可能な単一スカラーに集約するリスク方向指標（RDI）、ならびにその大きさと方向の両方に較正された型付き討論パイプラインである。510件の契約書と249,252件の条項レベルのインスタンスにわたる評価では、総合報告では隠される義務/権利および数値的クレームと時間的クレームの間に約38〜40パーセンテージポイントのモデル内ギャップが計測され、さらに、一致した52%の割合を示す2つのシステムが逆のRDIを持つ可能性があることを示す。討論パイプラインは、診断に追従するカテゴリ別の利得とともに、捏造検出を45%削減し、大幅に小型のバックボーン（40億アクティブパラメータ）で商用APIに匹敵する性能を達成する。型付きプロファイルとRDIは、総合指標が隠す故障モードを表面化する。さらに、これらの診断がマルチエージェント討論パイプラインの較正入力として機能し、測定された故障モードを標的とする懐疑者の挑戦と非対称ゲートが、汎用的に調整された討論よりも優れた性能を示すことを実証する。本フレームワークは、実環境に展開される法務AIに対する方向認識型の調達、説明責任、およびエージェント設計を支援する。

English

AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.