DFIR-Metric:一個用於評估大型語言模型在數位鑑識與事件回應中表現的基準數據集
DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response
May 26, 2025
作者: Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi
cs.AI
摘要
數字取證與事件響應(DFIR)涉及分析數字證據以支持法律調查。大型語言模型(LLMs)為DFIR任務(如日誌分析和內存取證)提供了新的機遇,但其在關鍵情境下易出錯和產生幻覺的特性引發了擔憂。儘管興趣日益增長,目前尚無全面基準來評估LLMs在理論與實踐DFIR領域的表現。為填補這一空白,我們提出了DFIR-Metric,一個包含三個組件的基準:(1) 知識評估:一套由專家審核的700道多選題,來源於行業標準認證和官方文檔;(2) 現實取證挑戰:150項CTF風格任務,測試多步推理與證據關聯能力;(3) 實踐分析:來自NIST計算機取證工具測試計劃(CFTT)的500個磁盤與內存取證案例。我們使用DFIR-Metric評估了14個LLMs,分析了它們的準確性和跨試驗的一致性。我們還引入了一項新指標——任務理解分數(TUS),旨在更有效地評估模型在接近零準確率情境下的表現。此基準為推進人工智能在數字取證中的應用提供了嚴謹、可重複的基礎。所有腳本、工件及結果均可於項目網站https://github.com/DFIR-Metric獲取。
English
Digital Forensics and Incident Response (DFIR) involves analyzing digital
evidence to support legal investigations. Large Language Models (LLMs) offer
new opportunities in DFIR tasks such as log analysis and memory forensics, but
their susceptibility to errors and hallucinations raises concerns in
high-stakes contexts. Despite growing interest, there is no comprehensive
benchmark to evaluate LLMs across both theoretical and practical DFIR domains.
To address this gap, we present DFIR-Metric, a benchmark with three components:
(1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice
questions sourced from industry-standard certifications and official
documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing
multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500
disk and memory forensics cases from the NIST Computer Forensics Tool Testing
Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their
accuracy and consistency across trials. We also introduce a new metric, the
Task Understanding Score (TUS), designed to more effectively evaluate models in
scenarios where they achieve near-zero accuracy. This benchmark offers a
rigorous, reproducible foundation for advancing AI in digital forensics. All
scripts, artifacts, and results are available on the project website at
https://github.com/DFIR-Metric.Summary
AI-Generated Summary