一个字符就能决定你的大语言模型评估成败

摘要

常见的大型语言模型（LLM）评估依赖于示范样本来引导模型生成符合期望风格的响应。尽管使用的样本数量已被研究并标准化，但如何格式化这些样本的选择却较少被探讨。在评估协议和实际应用中，用户面临如何分隔上下文样本的选择：使用逗号？换行？分号？井号？等等。令人惊讶的是，我们发现这一看似微小的选择能显著影响模型响应的质量。在主流模型系列（如Llama、Qwen、Gemma）中，MMLU等任务上的表现可因分隔符的选择而波动高达±23%。实际上，仅通过修改分隔样本的单个字符，就能操控模型排名，使任一模型位居榜首。我们发现LLM的脆弱性普遍存在于不同主题和模型系列中，且不随模型规模扩大而改善。通过探查注意力头得分，我们发现表现良好的分隔符能引导注意力集中于输入中的关键标记。最后，我们探索了提升LLM对分隔符选择鲁棒性的方法。我们发现，在提示中明确指定所选分隔符能增强鲁棒性，并提供了关于最佳分隔符选择的实用建议。

English

Common Large Language model (LLM) evaluations rely on demonstration examples to steer models' responses to the desired style. While the number of examples used has been studied and standardized, the choice of how to format examples is less investigated. In evaluation protocols and real world usage, users face the choice how to separate in-context examples: use a comma? new line? semi-colon? hashtag? etc.? Surprisingly, we find this seemingly minor choice can dramatically alter model response quality. Across leading model families (Llama, Qwen, Gemma), performance on MMLU for example can vary by pm 23% depending on the choice of delimiter. In fact, one can manipulate model rankings to put any model in the lead by only modifying the single character separating examples. We find LLMs' brittleness pervades topics, model families, and doesn't improve with scale. By probing attention head scores, we find that good-performing delimiters steer attention towards key tokens in the input. Finally, we explore methods to improve LLMs' robustness to the choice of delimiter. We find specifying the selected delimiter in the prompt boosts robustness and offer practical recommendations for the best-performing delimiters to select.

一个字符就能决定你的大语言模型评估成败

A Single Character can Make or Break Your LLM Evals

摘要

Support