需要多少參數才能換一顆燈泡？評估對話遊戲自我對弈的表現，並根據模型特性進行分析。

摘要

一個優秀的大型語言模型（LLM）有何特質？它在相關基準測試中表現出色，這些測試希望能夠合理地評估模型在真實應用中所面臨的挑戰。但是，是什麼讓模型表現出色？是什麼賦予模型其能力？我們採用了最近引入的一種基準測試，旨在通過對話遊戲的自我對弈，在目標導向、主動代理的情境中挑戰能力，並分析模型特徵（如參數數量或訓練類型）對性能發展的影響。我們發現，雖然參數數量與性能之間存在明顯關係，但在特定大小範圍內仍存在性能點的廣泛分佈，這需要通過訓練參數（如微調數據質量和方法）來解釋。從更實際的角度來看，我們還發現，不同存取方法的性能存在一定程度的不可預測性，可能是由於未暴露的抽樣參數，並且在推論期間至少對中等權重量化具有非常歡迎的性能穩定性。

English

What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.

需要多少參數才能換一顆燈泡？評估對話遊戲自我對弈的表現，並根據模型特性進行分析。

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics

摘要

Support