技术RAG:面向网络威胁情报文本中对抗性技术标注的检索增强生成
TechniqueRAG: Retrieval Augmented Generation for Adversarial Technique Annotation in Cyber Threat Intelligence Text
May 17, 2025
作者: Ahmed Lekssays, Utsav Shukla, Husrev Taha Sencar, Md Rizwan Parvez
cs.AI
摘要
在安全文本中准确识别对抗技术对于有效的网络防御至关重要。然而,现有方法面临一个根本性的权衡:它们要么依赖于领域精度有限的通用模型,要么需要资源密集型的处理流程,这些流程依赖于大量标注数据集和任务特定的优化,如自定义硬负样本挖掘和去噪,而这些资源在专业领域中往往难以获取。
我们提出了TechniqueRAG,一个特定领域的检索增强生成(RAG)框架,通过整合现成的检索器、指令调优的大型语言模型(LLMs)以及少量的文本-技术对,弥合了这一差距。我们的方法通过在有限的领域内示例上仅微调生成组件,解决了数据稀缺问题,从而避免了资源密集型的检索训练需求。虽然传统的RAG通过结合检索和生成来缓解幻觉问题,但其对通用检索器的依赖常常引入噪声候选,限制了领域特定的精度。为了解决这一问题,我们通过零样本LLM重排序来提升检索质量和领域特异性,明确地将检索到的候选与对抗技术对齐。
在多个安全基准测试上的实验表明,TechniqueRAG无需广泛的任务特定优化或标注数据即可实现最先进的性能,同时全面的分析提供了更深入的见解。
English
Accurately identifying adversarial techniques in security texts is critical
for effective cyber defense. However, existing methods face a fundamental
trade-off: they either rely on generic models with limited domain precision or
require resource-intensive pipelines that depend on large labeled datasets and
task-specific optimizations, such as custom hard-negative mining and denoising,
resources rarely available in specialized domains.
We propose TechniqueRAG, a domain-specific retrieval-augmented generation
(RAG) framework that bridges this gap by integrating off-the-shelf retrievers,
instruction-tuned LLMs, and minimal text-technique pairs. Our approach
addresses data scarcity by fine-tuning only the generation component on limited
in-domain examples, circumventing the need for resource-intensive retrieval
training. While conventional RAG mitigates hallucination by coupling retrieval
and generation, its reliance on generic retrievers often introduces noisy
candidates, limiting domain-specific precision. To address this, we enhance
retrieval quality and domain specificity through zero-shot LLM re-ranking,
which explicitly aligns retrieved candidates with adversarial techniques.
Experiments on multiple security benchmarks demonstrate that TechniqueRAG
achieves state-of-the-art performance without extensive task-specific
optimizations or labeled data, while comprehensive analysis provides further
insights.Summary
AI-Generated Summary