DFIR-Metric：一个用于评估大语言模型在数字取证与事件响应中表现的基准数据集

摘要

数字取证与事件响应（DFIR）涉及分析数字证据以支持法律调查。大型语言模型（LLMs）为DFIR任务（如日志分析和内存取证）提供了新的机遇，但其在关键场景中易出错和产生幻觉的特性引发了担忧。尽管兴趣日益增长，但目前尚无全面基准来评估LLMs在理论和实践DFIR领域的表现。为填补这一空白，我们提出了DFIR-Metric基准，包含三个组成部分：（1）知识评估：一套由行业标准认证和官方文档中提取的700道专家评审多选题；（2）真实取证挑战：150项CTF风格任务，测试多步推理和证据关联能力；（3）实际分析：来自NIST计算机取证工具测试计划（CFTT）的500个磁盘和内存取证案例。我们使用DFIR-Metric评估了14个LLMs，分析了它们在多次试验中的准确性和一致性。此外，我们还引入了一个新指标——任务理解得分（TUS），旨在更有效地评估模型在接近零准确率场景下的表现。该基准为推进人工智能在数字取证中的应用提供了严谨、可复现的基础。所有脚本、工件和结果均可在项目网站https://github.com/DFIR-Metric上获取。

English

Digital Forensics and Incident Response (DFIR) involves analyzing digital evidence to support legal investigations. Large Language Models (LLMs) offer new opportunities in DFIR tasks such as log analysis and memory forensics, but their susceptibility to errors and hallucinations raises concerns in high-stakes contexts. Despite growing interest, there is no comprehensive benchmark to evaluate LLMs across both theoretical and practical DFIR domains. To address this gap, we present DFIR-Metric, a benchmark with three components: (1) Knowledge Assessment: a set of 700 expert-reviewed multiple-choice questions sourced from industry-standard certifications and official documentation; (2) Realistic Forensic Challenges: 150 CTF-style tasks testing multi-step reasoning and evidence correlation; and (3) Practical Analysis: 500 disk and memory forensics cases from the NIST Computer Forensics Tool Testing Program (CFTT). We evaluated 14 LLMs using DFIR-Metric, analyzing both their accuracy and consistency across trials. We also introduce a new metric, the Task Understanding Score (TUS), designed to more effectively evaluate models in scenarios where they achieve near-zero accuracy. This benchmark offers a rigorous, reproducible foundation for advancing AI in digital forensics. All scripts, artifacts, and results are available on the project website at https://github.com/DFIR-Metric.

DFIR-Metric：一个用于评估大语言模型在数字取证与事件响应中表现的基准数据集

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

摘要

Support