揭露开源权重模型在预填充攻击下的系统性漏洞

摘要

随着大型语言模型能力的持续增强，其被滥用的风险也相应提升。闭源模型通常依赖外部防御机制，而开源权重模型则主要需依靠内部防护措施来抑制有害行为。现有的红队测试研究多集中于基于输入的越狱攻击和参数级操控，然而开源权重模型天然支持预填充功能，这使得攻击者能在生成开始前预定义初始响应标记。尽管存在这种潜在威胁，该攻击向量却鲜少获得系统性关注。我们开展了迄今规模最大的预填充攻击实证研究，在多个模型系列和前沿开源权重模型上评估了20余种现有及新型攻击策略。研究结果表明，预填充攻击对所有主流当代开源权重模型均具有持续有效性，揭示出这一关键且此前未被充分探索的脆弱性对模型部署具有重大影响。虽然某些大型推理模型对通用预填充表现出一定鲁棒性，但它们仍无法抵御针对性设计的模型专属策略。我们的发现强调，模型开发者亟需将防御预填充攻击作为开源大语言模型的重点防护方向。

English

As the capabilities of large language models continue to advance, so does their potential for misuse. While closed-source models typically rely on external defenses, open-weight models must primarily depend on internal safeguards to mitigate harmful behavior. Prior red-teaming research has largely focused on input-based jailbreaking and parameter-level manipulations. However, open-weight models also natively support prefilling, which allows an attacker to predefine initial response tokens before generation begins. Despite its potential, this attack vector has received little systematic attention. We present the largest empirical study to date of prefill attacks, evaluating over 20 existing and novel strategies across multiple model families and state-of-the-art open-weight models. Our results show that prefill attacks are consistently effective against all major contemporary open-weight models, revealing a critical and previously underexplored vulnerability with significant implications for deployment. While certain large reasoning models exhibit some robustness against generic prefilling, they remain vulnerable to tailored, model-specific strategies. Our findings underscore the urgent need for model developers to prioritize defenses against prefill attacks in open-weight LLMs.

揭露开源权重模型在预填充攻击下的系统性漏洞

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

摘要

Support