需要多少參數才能換一顆燈泡?評估對話遊戲自我對弈的表現,並根據模型特性進行分析。
How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
June 20, 2024
作者: Nidhir Bhavsar, Jonathan Jordan, Sherzod Hakimov, David Schlangen
cs.AI
摘要
一個優秀的大型語言模型(LLM)有何特質?它在相關基準測試中表現出色,這些測試希望能夠合理地評估模型在真實應用中所面臨的挑戰。但是,是什麼讓模型表現出色?是什麼賦予模型其能力?我們採用了最近引入的一種基準測試,旨在通過對話遊戲的自我對弈,在目標導向、主動代理的情境中挑戰能力,並分析模型特徵(如參數數量或訓練類型)對性能發展的影響。我們發現,雖然參數數量與性能之間存在明顯關係,但在特定大小範圍內仍存在性能點的廣泛分佈,這需要通過訓練參數(如微調數據質量和方法)來解釋。從更實際的角度來看,我們還發現,不同存取方法的性能存在一定程度的不可預測性,可能是由於未暴露的抽樣參數,並且在推論期間至少對中等權重量化具有非常歡迎的性能穩定性。
English
What makes a good Large Language Model (LLM)? That it performs well on the
relevant benchmarks -- which hopefully measure, with some validity, the
presence of capabilities that are also challenged in real application. But what
makes the model perform well? What gives a model its abilities? We take a
recently introduced type of benchmark that is meant to challenge capabilities
in a goal-directed, agentive context through self-play of conversational games,
and analyse how performance develops as a function of model characteristics
like number of parameters, or type of training. We find that while there is a
clear relationship between number of parameters and performance, there is still
a wide spread of performance points within a given size bracket, which is to be
accounted for by training parameters such as fine-tuning data quality and
method. From a more practical angle, we also find a certain degree of
unpredictability about performance across access methods, possible due to
unexposed sampling parameters, and a, very welcome, performance stability
against at least moderate weight quantisation during inference.Summary
AI-Generated Summary