采样之困：通过词汇覆盖率分数(WCS)评估大语言模型中的词汇可达性

摘要

现代大型语言模型（LLMs）常因生成重复且同质化的文本而受到批评，尽管它们拥有庞大的潜在词汇库。以往研究多聚焦于模型知识与训练数据，而我们则探究解码机制在抑制语言多样性中的作用。我们提出"词覆盖率得分"（Word Coverage Score, WCS）这一指标，用以量化上下文恰当的人类词汇被标准采样过滤器（如Top-p、Top-k和Min-p）从数学上剔除的程度。WCS并非评估静态知识，而是衡量低频、高信息量人类词汇的词汇存活率如何随采样参数变化。通过审计开放权重模型在人类撰写的语料片段上的表现，我们识别出那些位于概率空间内、却因解码器而变得不可达的逻辑词汇选择。研究结果提供了定量证据，表明行业标准的采样默认设置充当了无意的审查机制，将人类表达的独特纹理平滑为同质化的话语。WCS为优化文本连贯性与词汇丰富性之间的权衡提供了严谨框架，成为在生成模型中保留人类语言多样性的诊断工具。

English

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-p, Top-k, and Min-p). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.