DFIR-Metric：一個用於評估大型語言模型在數位鑑識與事件回應中表現的基準數據集

摘要

數字取證與事件響應（DFIR）涉及分析數字證據以支持法律調查。大型語言模型（LLMs）為DFIR任務（如日誌分析和內存取證）提供了新的機遇，但其在關鍵情境下易出錯和產生幻覺的特性引發了擔憂。儘管興趣日益增長，目前尚無全面基準來評估LLMs在理論與實踐DFIR領域的表現。為填補這一空白，我們提出了DFIR-Metric，一個包含三個組件的基準：(1) 知識評估：一套由專家審核的700道多選題，來源於行業標準認證和官方文檔；(2) 現實取證挑戰：150項CTF風格任務，測試多步推理與證據關聯能力；(3) 實踐分析：來自NIST計算機取證工具測試計劃（CFTT）的500個磁盤與內存取證案例。我們使用DFIR-Metric評估了14個LLMs，分析了它們的準確性和跨試驗的一致性。我們還引入了一項新指標——任務理解分數（TUS），旨在更有效地評估模型在接近零準確率情境下的表現。此基準為推進人工智能在數字取證中的應用提供了嚴謹、可重複的基礎。所有腳本、工件及結果均可於項目網站https://github.com/DFIR-Metric獲取。

English

Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.

DFIR-Metric：一個用於評估大型語言模型在數位鑑識與事件回應中表現的基準數據集

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

摘要

Support