压缩偏好一致性而非真实性：语言模型何时及为何倾向正确信息

摘要

为何语言模型在混合质量数据训练下仍倾向于选择正确陈述？我们提出"压缩-一致性原则"：下一词预测会优先选择那些能以更短且内部一致的方式描述训练数据的假设。仅当错误替代方案在结构上更难压缩时，真理偏好才会显现。我们使用小型GPT-2风格的字符级Transformer模型（350万-8600万参数）在可控正误规则比例的合成数学语料库上进行测试。在随机错误设定中，模型在配对评估中强烈偏好正确补全：平衡数据准确率达83.1%，即便正确规则仅占语料库10%时仍保持67.0%准确率。若将随机错误替换为连贯但数学错误的规则体系，这种偏好基本消失（准确率接近随机水平）。在更接近自然语言的合成环境中，效应虽减弱但仍存在（57.7%）。附加实验表明，嵌入验证步骤即使在小规模模型中也能恢复正确性偏好，而增加一致规则数量会带来准确度的梯度提升。我们的结果表明，所谓的"真理偏好"主要是压缩压力与内部一致性偏好的副产品，而非对真理的内在追求。完整代码与数据详见https://github.com/Rai220/compression-drives-truth。

English

Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at https://github.com/Rai220/compression-drives-truth.

压缩偏好一致性而非真实性：语言模型何时及为何倾向正确信息

Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information

摘要

Support