CausalLM 不適合於上下文學習。

摘要

最近的實證證據表明，基於Transformer的上下文學習在使用前綴語言模型（prefixLM）時表現更好，其中上下文樣本可以互相參考，相較於使用因果語言模型（causalLM），後者使用自回歸注意力，禁止上下文樣本參考未來樣本。儘管這個結果在直觀上是合理的，但從理論角度來看尚未被理解。本文採取了理論方法，分析了在特定參數構造下前綴LM和因果LM的收斂行為。我們的分析顯示，兩種LM類型均以線性速率收斂到其穩定點，但前綴LM收斂到線性回歸的最優解，而因果LM的收斂動態則遵循在線梯度下降算法，即使樣本數增長到無窮大，也無法保證達到最優解。我們通過對合成和真實任務以及使用各種類型的Transformer進行實驗，補充我們的理論主張。我們的實驗證實，因果LM在所有情況下始終表現不如前綴LM。

English

Recent empirical evidence indicates that transformer based in-context learning performs better when using a prefix language model (prefixLM), in which in-context samples can all attend to each other, compared to causal language models (causalLM), which use auto-regressive attention that prohibits in-context samples to attend to future samples. While this result is intuitive, it is not understood from a theoretical perspective. In this paper we take a theoretical approach and analyze the convergence behavior of prefixLM and causalLM under a certain parameter construction. Our analysis shows that both LM types converge to their stationary points at a linear rate, but that while prefixLM converges to the optimal solution of linear regression, causalLM convergence dynamics follows that of an online gradient descent algorithm, which is not guaranteed to be optimal even as the number of samples grows infinitely. We supplement our theoretical claims with empirical experiments over synthetic and real tasks and using various types of transformers. Our experiments verify that causalLM consistently underperforms prefixLM in all settings.

CausalLM 不適合於上下文學習。

CausalLM is not optimal for in-context learning

摘要

Support