ChatPaper.aiChatPaper

雅典娜基准:面向网络威胁情报领域的大语言模型动态评估体系

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

November 3, 2025
作者: Md Tanvirul Alam, Dipkamal Bhusal, Salman Ahmad, Nidhi Rastogi, Peter Worth
cs.AI

摘要

大型语言模型(LLM)在自然语言推理方面展现出强大能力,但其在网络威胁情报(CTI)领域的应用仍存在局限。CTI分析涉及将海量非结构化报告提炼为可操作知识,这一过程中LLM可显著减轻分析人员的工作负担。CTIBench曾推出用于评估LLM在多类CTI任务表现的综合性基准。本研究通过开发AthenaBench对CTIBench进行扩展,该增强型基准包含改进的数据集构建流程、去重机制、优化评估指标以及聚焦风险缓解策略的新任务。我们评估了12个LLM,包括GPT-5和Gemini-2.5 Pro等尖端专有模型,以及来自LLaMA和Qwen系列的七个开源模型。尽管专有LLM整体表现更优,但在威胁行为者归因和风险缓解等推理密集型任务中仍不尽如人意,开源模型的差距则更为明显。这些发现揭示了当前LLM推理能力的根本局限,凸显了需要专门针对CTI工作流与自动化需求定制化开发模型的必要性。
English
Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall, their performance remains subpar on reasoning-intensive tasks, such as threat actor attribution and risk mitigation, with open-source models trailing even further behind. These findings highlight fundamental limitations in the reasoning capabilities of current LLMs and underscore the need for models explicitly tailored to CTI workflows and automation.
PDF31January 19, 2026