揭露开源权重模型对预填充攻击的系统性脆弱性
Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks
February 16, 2026
作者: Lukas Struppek, Adam Gleave, Kellin Pelrine
cs.AI
摘要
随着大语言模型能力的持续进步,其被滥用的风险也同步增长。闭源模型通常依赖外部防御机制,而开源权重模型则主要需依靠内部安全措施来抑制有害行为。现有的红队测试研究多集中于基于输入的越狱攻击和参数级操控,但开源权重模型本身支持预填充功能,这使得攻击者能在生成开始前预定义初始响应标记。尽管存在潜在威胁,此类攻击向量却鲜少获得系统性关注。我们开展了迄今规模最大的预填充攻击实证研究,在多个模型系列及前沿开源权重模型上评估了20余种现有及新型攻击策略。研究结果表明,预填充攻击对所有主流当代开源权重模型均持续有效,揭示出一个关键且此前未被充分探索的部署安全隐患。虽然某些大型推理模型对通用预填充表现出一定抗性,但仍无法抵御针对性设计的模型专属策略。我们的发现强调,模型开发者亟需将防御预填充攻击列为开源大语言模型的重点安防任务。
English
As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily depend on internal safeguards to mitigate harmful behavior. Prior red-teaming research has largely focused on input-based jailbreaking and parameter-level manipulations. However, open-weight models also natively support prefilling, which allows an attacker to predefine initial response tokens before generation begins. Despite its potential, this attack vector has received little systematic attention. We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models. Our results show that prefill attacks are consistently effective against all major contemporary open-weight models, revealing a critical and previously underexplored vulnerability with significant implications for deployment. While certain large reasoning models exhibit some robustness against generic prefilling, they remain vulnerable to tailored, model-specific strategies. Our findings underscore the urgent need for model developers to prioritize defenses against prefill attacks in open-weight LLMs.