需要多少参数才能换一颗灯泡？评估对话游戏自我对弈性能的模型特征函数

摘要

什么是一个优秀的大型语言模型（LLM）？它在相关基准测试中表现良好，这些测试希望能够以某种有效的方式衡量模型具备的能力，这些能力在实际应用中也会受到挑战。但是，是什么让模型表现良好？是什么赋予了模型其能力？我们采用了一种最近引入的基准测试类型，旨在通过对话游戏的自我对弈挑战目标导向、主动性背景下的能力，并分析了性能如何随着模型特征（如参数数量或训练类型）的变化而发展。我们发现，虽然参数数量和性能之间存在明显关系，但在给定大小范围内仍然存在广泛的性能点分布，这需要通过训练参数（如微调数据质量和方法）来解释。从更实际的角度来看，我们还发现在不同访问方法下性能存在一定程度的不可预测性，可能是由于未暴露的采样参数，而且在推理过程中至少对中等权重量化表现出了非常受欢迎的性能稳定性。

English

What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.

需要多少参数才能换一颗灯泡？评估对话游戏自我对弈性能的模型特征函数

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

摘要

Support