SimpleStrat: 層化を用いた言語モデル生成の多様化

要旨

大規模言語モデル（LLM）から多様な応答を生成することは、計画/検索や合成データ生成などのアプリケーションにとって重要であり、多様性は世代間で異なる回答を提供します。従来のアプローチは、多様性を高めるために温度を上げることに依存していました。しかし、一般的な考えとは異なり、このアプローチが温度が上昇するにつれて個々の生成物の品質が低下するだけでなく、モデルの次トークンの確率が真の回答の分布に類似していることに依存していることを示します。我々は、代わりのアプローチ、言語モデル自体を使用して空間を層に分割する方法を提案します。推論時には、ランダムな層が選択され、その層内からサンプルが抽出されます。多様性を測定するために、我々はCoverageQAという、複数の同様にありえる回答を持つ未明確な質問のデータセットを導入し、出力分布と有効な正解回答の均一分布との間のKLダイバージェンスを測定して多様性を評価します。プロプライエタリモデルの各応答/解決策の確率を計算することは不可能なため、我々は正解解決策のリコールを測定します。我々の評価結果は、SimpleStratを使用することで、GPT-4oと比較してリコールが0.05高く、Llama 3と比較してKLダイバージェンスが平均0.36低下することを示しています。

English

Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model's next-token probabilities being similar to the true distribution of answers. We propose , an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.

SimpleStrat: 層化を用いた言語モデル生成の多様化

SimpleStrat: Diversifying Language Model Generation with Stratification

要旨

Support