電球を交換するのにいくつのパラメータが必要か？会話ゲームの自己対戦における性能をモデル特性の関数として評価する

要旨

優れた大規模言語モデル（LLM）とは何か？それは、関連するベンチマークで良好な性能を発揮するモデルであると言える。理想的には、これらのベンチマークは、実際のアプリケーションで求められる能力をある程度有効に測定するものであるべきだ。しかし、モデルが良好な性能を発揮する要因は何か？モデルにその能力を与えるものは何か？本稿では、目標指向的で主体的な文脈において能力を試すために最近導入された、会話ゲームの自己プレイを通じたベンチマークを採用し、パラメータ数やトレーニングの種類といったモデル特性の関数として性能がどのように発展するかを分析する。その結果、パラメータ数と性能の間には明確な関係があるものの、特定のサイズ範囲内でも性能ポイントには広範なばらつきが見られ、これはファインチューニングデータの品質や方法といったトレーニングパラメータによって説明されることがわかった。より実用的な観点からは、アクセス方法による性能の予測不可能性が一定程度存在し、これは未公開のサンプリングパラメータによる可能性がある。また、推論中の少なくとも中程度の重み量子化に対して性能が安定していることは非常に歓迎すべき発見である。

English

What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.

電球を交換するのにいくつのパラメータが必要か？会話ゲームの自己対戦における性能をモデル特性の関数として評価する

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

要旨

Support