CausalLM은 인컨텍스트 학습(in-context learning)에 최적화되어 있지 않다.

초록

최근의 실험적 증거에 따르면, 트랜스포머 기반의 인컨텍스트 학습은 인컨텍스트 샘플들이 서로를 모두 참조할 수 있는 프리픽스 언어 모델(prefixLM)을 사용할 때, 미래 샘플을 참조하는 것을 금지하는 자기회귀적 어텐션을 사용하는 인과적 언어 모델(causalLM)에 비해 더 나은 성능을 보인다. 이 결과는 직관적으로 이해되지만, 이론적 관점에서는 아직 명확히 이해되지 않고 있다. 본 논문에서는 이론적 접근을 통해 특정 파라미터 구성 하에서 prefixLM과 causalLM의 수렴 행동을 분석한다. 우리의 분석은 두 언어 모델 유형 모두 선형 속도로 정상점에 수렴하지만, prefixLM은 선형 회귀의 최적 해에 수렴하는 반면, causalLM의 수렴 동역학은 온라인 경사 하강법 알고리즘과 유사하며, 이는 샘플 수가 무한히 증가하더라도 최적임을 보장하지 않음을 보여준다. 우리는 이론적 주장을 합성 및 실제 작업에서 다양한 유형의 트랜스포머를 사용한 실험적 결과로 보완한다. 우리의 실험은 모든 설정에서 causalLM이 일관적으로 prefixLM보다 낮은 성능을 보인다는 것을 검증한다.

English

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

CausalLM은 인컨텍스트 학습(in-context learning)에 최적화되어 있지 않다.

CausalLM is not optimal for in-context learning

초록

Support