Favia：漏洞修复识别与分析取证工具

摘要

识别与已披露CVE对应的漏洞修复提交对于安全软件维护至关重要，但在大规模代码库中仍面临挑战——大型代码库包含数百万次提交，其中仅极小部分涉及安全修复。现有自动化方法（包括传统机器学习技术和新兴基于大语言模型的方法）往往难以平衡精确率与召回率。由于常采用随机采样提交进行评估，我们发现这些方法严重低估了实际场景的难度：真实场景中的候选提交本身已具备安全相关性且高度相似。我们提出Favia框架，这是一种基于智能体的取证式漏洞修复识别方案，结合了可扩展的候选提交排序与深度迭代语义推理。Favia首先通过高效排序阶段缩小提交搜索空间，随后使用基于ReAct的大语言模型智能体对每个提交进行严格评估。通过为智能体提供提交前代码库作为环境并配备专用工具，智能体尝试定位漏洞组件、遍历代码库，并建立代码变更与漏洞根源之间的因果关联。这种证据驱动的方法能稳健识别间接修复、多文件修复及非平凡修复，克服单次扫描或基于相似性方法的局限。我们在CVEVC数据集（包含来自3,708个真实代码库的逾800万次提交）上评估Favia，结果表明在真实候选提交场景下，其持续优于最先进的传统方法和基于大语言模型的基线方法，实现了最优的精确率-召回率平衡和最高F1分数。

English

Identifying vulnerability-fixing commits corresponding to disclosed CVEs is essential for secure software maintenance but remains challenging at scale, as large repositories contain millions of commits of which only a small fraction address security issues. Existing automated approaches, including traditional machine learning techniques and recent large language model (LLM)-based methods, often suffer from poor precision-recall trade-offs. Frequently evaluated on randomly sampled commits, we uncover that they are substantially underestimating real-world difficulty, where candidate commits are already security-relevant and highly similar. We propose Favia, a forensic, agent-based framework for vulnerability-fix identification that combines scalable candidate ranking with deep and iterative semantic reasoning. Favia first employs an efficient ranking stage to narrow the search space of commits. Each commit is then rigorously evaluated using a ReAct-based LLM agent. By providing the agent with a pre-commit repository as environment, along with specialized tools, the agent tries to localize vulnerable components, navigates the codebase, and establishes causal alignment between code changes and vulnerability root causes. This evidence-driven process enables robust identification of indirect, multi-file, and non-trivial fixes that elude single-pass or similarity-based methods. We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories, and show that it consistently outperforms state-of-the-art traditional and LLM-based baselines under realistic candidate selection, achieving the strongest precision-recall trade-offs and highest F1-scores.