熵哨兵:基于解码熵迹的STEM领域LLM准确性持续监测
Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM
January 13, 2026
作者: Pedro Memoli Buffa, Luciano Del Corro
cs.AI
摘要
部署大型语言模型面临两个相互关联的挑战:(1) 监控——在流量和领域发生漂移时评估模型在哪些方面表现不佳;(2) 改进——通过优先获取数据来弥补最大的性能差距。我们验证了推理时信号能否在领域偏移下估计分片级准确率。针对每个响应,我们基于最终层的下一词元概率(来自top-k对数概率)计算输出熵分布,并用十一项统计量进行概括。通过轻量级分类器预测实例正确性,再对预测概率取平均即可获得领域级准确率估计。我们在十个STEM推理基准上开展评估(包含详尽的训练/测试组合,k取{1,2,3,4}的所有"10选k"组合),覆盖六个系列的九款LLM(参数量3B-20B)。估计值往往能跟踪预留基准准确率,多个模型呈现出近乎单调的领域排序。因此,输出熵分布可作为可扩展监控和数据采集目标定位的有效信号。
English
Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all "10 choose k" combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.