ChatPaper.aiChatPaper

熵值哨兵:基於解碼熵追蹤的STEM領域大型語言模型準確度持續監測系統

Entropy Sentinel: Continuous LLM Accuracy Monitoring from Decoding Entropy Traces in STEM

January 13, 2026
作者: Pedro Memoli Buffa, Luciano Del Corro
cs.AI

摘要

部署大型語言模型時面臨兩個相互關聯的挑戰:(1) 監控層面——在流量與領域發生偏移時,評估模型在哪些環節表現不佳;(2) 改進層面——通過優先獲取數據來彌補最大的性能差距。我們驗證了在領域遷移情境下,能否透過推理階段的信號來估算細分層級的準確率。針對每個模型回應,我們根據最終層的下一個詞元概率(來自top-k對數概率)計算輸出熵分佈曲線,並用十一項統計量進行特徵概括。透過輕量級分類器預測單個實例的正確性,再對預測概率取平均值即可得到領域層級的準確率估算。我們在十個STEM推理基準測試上進行評估(採用完整的訓練/測試組合,k值取{1,2,3,4}的所有「10選k」組合),並涵蓋六個模型家族的九款LLM(參數量3B-20B)。實驗顯示估算值能有效追蹤隱藏基準的準確率,且多個模型呈現出近乎單調的領域排序規律。這表明輸出熵分佈曲線可作為可擴展監控及數據獲取目標定位的有效信號。
English
Deploying LLMs raises two coupled challenges: (1) monitoring - estimating where a model underperforms as traffic and domains drift - and (2) improvement - prioritizing data acquisition to close the largest performance gaps. We test whether an inference-time signal can estimate slice-level accuracy under domain shift. For each response, we compute an output-entropy profile from final-layer next-token probabilities (from top-k logprobs) and summarize it with eleven statistics. A lightweight classifier predicts instance correctness, and averaging predicted probabilities yields a domain-level accuracy estimate. We evaluate on ten STEM reasoning benchmarks with exhaustive train/test compositions (k in {1,2,3,4}; all "10 choose k" combinations), across nine LLMs from six families (3B-20B). Estimates often track held-out benchmark accuracy, and several models show near-monotonic ordering of domains. Output-entropy profiles are thus an accessible signal for scalable monitoring and for targeting data acquisition.
PDF102January 20, 2026