EvasionBench:基于多模型共识与LLM评判机制的金融问答规避性回答检测
EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
January 14, 2026
作者: Shijian Ma, Yan Lin, Yi Yang
cs.AI
摘要
在财报电话会议中检测规避性回答对财务透明度至关重要,但大规模标注数据集的缺乏阻碍了研究进展。我们推出EvasionBench数据集,包含3万个训练样本和1000个人工标注测试样本(科恩卡帕系数0.835),涵盖三个规避等级。本研究的核心创新在于提出多模型标注框架,其关键洞见是:前沿大语言模型之间的分歧信号标志着对训练最具价值的困难样本。我们通过挖掘两个强标注模型产生冲突的边界案例,并引入裁判模型确定最终标签。该方法比单模型蒸馏技术性能提升2.4%,尽管训练损失更高(0.421对比0.393),但裁判模型确定的样本显著提升了泛化能力——证明分歧挖掘可视为隐式正则化手段。基于此训练的Eva-4B模型(40亿参数)达到81.3%的准确率,较基础模型提升25个百分点,仅以微小推理成本逼近前沿大语言模型性能。
English
Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.