サンプリングにおける迷い：単語カバレッジスコア（WCS）によるLLMの語彙到達可能性の評価

要旨

現代の大規模言語モデル（LLMs）は、膨大な潜在語彙を有しているにもかかわらず、反復的で均質なテキストを生成することでしばしば批判されている。従来の研究はモデルの知識や訓練データに焦点を当ててきたが、我々は言語的多様性を抑制する復号メカニズムの役割を調査する。我々は、文脈に適した人間の語彙が標準的なサンプリングフィルター（例：Top-p、Top-k、Min-p）によって数学的に刈り込まれる程度を定量化する指標である単語カバレッジスコア（WCS）を導入する。静的な知識を評価するのではなく、WCSはサンプリングパラメータの関数として、低頻度で高情報量の人間の単語の語彙生存率を測定する。オープンウェイトモデルを人間作成のコーパス断片で監査することにより、確率空間内に存在していても復号器によって到達不可能にされる論理的語彙選択を特定する。我々の結果は、業界標準のサンプリングデフォルトが意図せざる検閲メカニズムとして機能し、人間表現の独自の質感を均質化された言説に平滑化していることを定量的に示す証拠を提供する。WCSは、テキストの一貫性と語彙の豊かさとの間のトレードオフを最適化するための厳密な枠組みを提供し、生成モデルにおける人間言語の多様性を保存するための診断ツールを提供する。

English

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-p, Top-k, and Min-p). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.