LegalHalluLens: 유형별 환각 감사 및 보정된 다중 에이전트 논쟁을 통한 신뢰할 수 있는 법률 AI

초록

법률 워크플로우에 배포된 AI 시스템은 종합 지표가 약 52%로 보고하는 비율로 환각을 일으키지만, 이 평균값은 오류가 어디에 집중되고 어떤 방향으로 발생하는지를 가려 규정 준수 담당자가 신뢰할 수 있는 배포를 위한 실행 가능한 신호를 얻지 못하게 한다. 본 연구에서는 LegalHalluLens라는 감사 프레임워크를 제시한다. 이 프레임워크는 세 가지 구성요소로 이루어져 있다: CUAD(Hendrycks et al., 2021)를 기반으로 네 가지 법적 동기 부여 주장 범주(수치적, 시간적, 의무/권리, 사실적)에 걸친 유형별 환각 프로필, 누락 대 창작 편향을 단일 배포 비교 가능 스칼라로 축소하는 위험 방향 지수(RDI), 그리고 크기와 방향 모두에 맞춰 보정된 유형별 논쟁 파이프라인이다. 510개의 계약서와 249,252개의 조문 수준 인스턴스에 걸쳐, 종합 보고가 숨기는 의무/수치적 주장과 시간적 주장 간의 모델 내 격차가 약 38~40퍼센트포인트임을 측정했으며, 일치하는 52% 비율을 가진 두 시스템이 반대되는 RDI를 가질 수 있음을 보여준다. 논쟁 파이프라인은 진단을 추적하는 범주별 이득과 함께 허위 탐지를 45% 감소시키며, 훨씬 더 작은 백본(40억 활성 파라미터)으로 상용 API와 성능이 일치한다. 유형별 프로필과 RDI는 종합 지표가 숨기는 실패 모드를 표면화하며, 더 나아가 이러한 진단이 다중 에이전트 논쟁 파이프라인의 보정 입력으로 작용하여, 측정된 실패 모드를 대상으로 하는 회의론자 도전과 비대칭 게이트가 일반 조정된 논쟁보다 우수한 성능을 보임을 입증한다. 이 프레임워크는 실제 환경에 배포된 법률 AI의 방향 인식 조달, 책임성, 에이전트 설계를 지원한다.

English

AI systems deployed in legal workflows hallucinate at rates that aggregate metrics report at ~52%, but this average conceals where errors concentrate and in which direction they run, leaving compliance officers without an actionable signal for trustworthy deployment. We present LegalHalluLens, an auditing framework with three components: typed hallucination profiles across four legally-motivated claim categories (numeric, temporal, obligation/entitlement, factual) over CUAD (Hendrycks et al., 2021); a Risk Direction Index (RDI) that reduces omission-versus-invention bias to a single deployment-comparable scalar; and a typed debate pipeline calibrated to both magnitudes and directions. Across 510 contracts and 249,252 clause-level instances we measure a within-model gap of approximately 38-40 pp between obligation/numeric and temporal claims that aggregate reporting hides, and show that two systems with matched 52% rates can carry opposite RDIs. The debate pipeline reduces fabricated detections by 45% with per-category gains tracking the diagnosis, matching commercial APIs with a substantially smaller backbone (4B active parameters). Typed profiles and RDI surface failure modes that aggregate metrics hide; we further show these diagnostics serve as calibration inputs for multi-agent debate pipelines, where Skeptic challenges and asymmetric gates targeted at measured failure modes outperform generically-tuned debate. The framework supports direction-aware procurement, accountability, and agent design for legal AI deployed in the wild.