DFIR-Metric: 디지털 포렌식 및 사고 대응에서 대규모 언어 모델 평가를 위한 벤치마크 데이터셋

초록

디지털 포렌식 및 사고 대응(Digital Forensics and Incident Response, DFIR)은 법적 조사를 지원하기 위해 디지털 증거를 분석하는 과정을 포함합니다. 대형 언어 모델(Large Language Models, LLMs)은 로그 분석 및 메모리 포렌식과 같은 DFIR 작업에서 새로운 기회를 제공하지만, 이러한 모델의 오류와 환각(hallucination)에 대한 취약성은 높은 위험을 수반하는 상황에서 우려를 불러일으킵니다. 점점 증가하는 관심에도 불구하고, 이론적 및 실질적인 DFIR 영역 전반에 걸쳐 LLMs를 평가하기 위한 포괄적인 벤치마크가 부재한 상황입니다. 이러한 격차를 해소하기 위해, 우리는 DFIR-Metric이라는 벤치마크를 제안합니다. 이 벤치마크는 세 가지 구성 요소로 이루어져 있습니다: (1) 지식 평가: 산업 표준 인증 및 공식 문서에서 추출한 전문가 검토를 거친 700개의 객관식 질문 세트; (2) 현실적인 포렌식 도전 과제: 다단계 추론 및 증거 상관관계를 테스트하는 150개의 CTF(캡처 더 플래그) 스타일 작업; (3) 실질적 분석: NIST 컴퓨터 포렌식 도구 테스트 프로그램(CFTT)에서 제공한 500개의 디스크 및 메모리 포렌식 사례. 우리는 DFIR-Metric을 사용하여 14개의 LLMs를 평가하고, 정확도와 시행 간 일관성을 분석했습니다. 또한, 모델이 거의 제로에 가까운 정확도를 보이는 시나리오에서 더 효과적으로 평가하기 위해 새로운 지표인 작업 이해 점수(Task Understanding Score, TUS)를 도입했습니다. 이 벤치마크는 디지털 포렌식 분야에서 AI의 발전을 위한 엄격하고 재현 가능한 기반을 제공합니다. 모든 스크립트, 아티팩트 및 결과는 프로젝트 웹사이트(https://github.com/DFIR-Metric)에서 확인할 수 있습니다.

English

Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.

DFIR-Metric: 디지털 포렌식 및 사고 대응에서 대규모 언어 모델 평가를 위한 벤치마크 데이터셋

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

초록

Support