雅典娜基准:面向网络威胁情报领域的大语言模型动态评估体系
AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence
November 3, 2025
作者: Md Tanvirul Alam, Dipkamal Bhusal, Salman Ahmad, Nidhi Rastogi, Peter Worth
cs.AI
摘要
大型语言模型(LLM)在自然语言推理方面展现出强大能力,但其在网络威胁情报(CTI)领域的应用仍存在局限。CTI分析涉及将海量非结构化报告提炼为可操作知识,这一流程中LLM可显著减轻分析人员的工作负担。CTIBench曾提出用于评估LLM在多类CTI任务表现的综合性基准。本研究通过开发增强型基准AthenaBench扩展了CTIBench,该基准包含改进的数据集构建流程、重复数据删除机制、优化后的评估指标以及专注于风险缓解策略的新任务。我们评估了12个LLM,包括GPT-5和Gemini-2.5 Pro等尖端专有模型,以及来自LLaMA和Qwen系列的七个开源模型。尽管专有LLM整体表现更优,但在威胁行为者归因和风险缓解等推理密集型任务中仍显不足,开源模型的差距则更为明显。这些发现揭示了当前LLM推理能力的根本局限,凸显了亟需专门针对CTI工作流与自动化需求定制开发的新型模型。
English
Large Language Models (LLMs) have demonstrated strong capabilities in natural
language reasoning, yet their application to Cyber Threat Intelligence (CTI)
remains limited. CTI analysis involves distilling large volumes of unstructured
reports into actionable knowledge, a process where LLMs could substantially
reduce analyst workload. CTIBench introduced a comprehensive benchmark for
evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by
developing AthenaBench, an enhanced benchmark that includes an improved dataset
creation pipeline, duplicate removal, refined evaluation metrics, and a new
task focused on risk mitigation strategies. We evaluate twelve LLMs, including
state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside
seven open-source models from the LLaMA and Qwen families. While proprietary
LLMs achieve stronger results overall, their performance remains subpar on
reasoning-intensive tasks, such as threat actor attribution and risk
mitigation, with open-source models trailing even further behind. These
findings highlight fundamental limitations in the reasoning capabilities of
current LLMs and underscore the need for models explicitly tailored to CTI
workflows and automation.