압축은 정확성보다 일관성을 선호한다: 언어 모델이 언제, 왜 올바른 정보를 선호하는가

초록

언어 모델이 혼합 품질의 데이터로 훈련되었음에도 때로 올바른 진술을 선호하는 이유는 무엇일까? 우리는 압축-일관성 원칙을 제안한다: 다음 토큰 예측은 훈련 데이터를 더 짧고 내부적으로 일관된 방식으로 설명할 수 있는 가설을 선호한다. 진실 편향은 오직 거짓 대안들이 구조적으로 압축하기 더 어려울 때만 나타난다. 우리는 이 원리를 검증하기 위해 합성 수학 코퍼스(올바른 규칙과 잘못된 규칙이 통제된 비율로 혼합됨)에서 소규모 GPT-2 스타일 문자 수준 트랜스포머(3.5M~86M 매개변수)를 사용했다. 무작위 오류 설정에서 모델은 짝지어진 평가에서 올바른 완성을 강력히 선호했다: 데이터 균형이 맞을 때 83.1% 정확도, 올바른 규칙이 코퍼스의 10%에만 등장할 때도 67.0% 정확도를 보였다. 무작위 오류를 일관적이지만 수학적으로 틀린 규칙 체계로 대체하면 이 선호도가 크게 사라졌다(거의 무작위 수준의 정확도). 더 자연어에 가까운 합성 세계에서는 효과가 약했지만 여전히 존재했다(57.7%). 추가 실험은 임베딩 검증 단계를 통해 소규모에서도 올바름에 대한 선호를 회복할 수 있음을 보여주며, 일관된 규칙의 수를 증가시키면 정확도가 점진적으로 향상됨을 확인했다. 우리의 결과는 "진실 편향"으로 보이는 현상이 진실을 향한 내재적 동기보다는 압축 압력과 내부 일관성 선호의 부수적 결과임을 시사한다. 전체 코드와 데이터는 https://github.com/Rai220/compression-drives-truth 에서 이용할 수 있다.

English

Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at https://github.com/Rai220/compression-drives-truth.

압축은 정확성보다 일관성을 선호한다: 언어 모델이 언제, 왜 올바른 정보를 선호하는가

Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information

초록

Support