需要多少参数才能换一颗灯泡?评估对话游戏自我对弈性能的模型特征函数
How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
June 20, 2024
作者: Nidhir Bhavsar, Jonathan Jordan, Sherzod Hakimov, David Schlangen
cs.AI
摘要
什么是一个优秀的大型语言模型(LLM)?它在相关基准测试中表现良好,这些测试希望能够以某种有效的方式衡量模型具备的能力,这些能力在实际应用中也会受到挑战。但是,是什么让模型表现良好?是什么赋予了模型其能力?我们采用了一种最近引入的基准测试类型,旨在通过对话游戏的自我对弈挑战目标导向、主动性背景下的能力,并分析了性能如何随着模型特征(如参数数量或训练类型)的变化而发展。我们发现,虽然参数数量和性能之间存在明显关系,但在给定大小范围内仍然存在广泛的性能点分布,这需要通过训练参数(如微调数据质量和方法)来解释。从更实际的角度来看,我们还发现在不同访问方法下性能存在一定程度的不可预测性,可能是由于未暴露的采样参数,而且在推理过程中至少对中等权重量化表现出了非常受欢迎的性能稳定性。
English
What makes a good Large Language Model (LLM)? That it performs well on the
relevant benchmarks -- which hopefully measure, with some validity, the
presence of capabilities that are also challenged in real application. But what
makes the model perform well? What gives a model its abilities? We take a
recently introduced type of benchmark that is meant to challenge capabilities
in a goal-directed, agentive context through self-play of conversational games,
and analyse how performance develops as a function of model characteristics
like number of parameters, or type of training. We find that while there is a
clear relationship between number of parameters and performance, there is still
a wide spread of performance points within a given size bracket, which is to be
accounted for by training parameters such as fine-tuning data quality and
method. From a more practical angle, we also find a certain degree of
unpredictability about performance across access methods, possible due to
unexposed sampling parameters, and a, very welcome, performance stability
against at least moderate weight quantisation during inference.Summary
AI-Generated Summary