校准集的期望特征是什么?识别长篇科学摘要的相关特征
What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization
May 12, 2023
作者: Griffin Adams, Bichlien H Nguyen, Jake Smith, Yingce Xia, Shufang Xie, Anna Ostropolets, Budhaditya Deb, Yuan-Jyue Chen, Tristan Naumann, Noémie Elhadad
cs.AI
摘要
总结模型通常生成的文本与质量度量不够校准,因为它们被训练以最大化单个参考(MLE)的可能性。为了解决这个问题,最近的研究增加了一个校准步骤,通过将模型暴露于其自身的排名输出来改善相关性,或者在另一条研究线上,对比正面和负面集以提高忠实度。尽管这些方法有效,但大部分工作都集中在如何生成和优化这些集合上。我们对哪种设置比另一种更有效知之甚少。在这项工作中,我们揭示了有效集合的潜在特征。对于每个训练实例,我们形成了一个大型、多样化的候选池,并系统地变化用于校准微调的子集。每种选择策略都针对集合的不同方面,比如词汇多样性或正面和负面之间的差距大小。在三个不同的科学长篇摘要数据集(涵盖生物医学、临床和化学领域)上,我们发现,信实度校准在负面集是抽取式且更有可能生成时最佳,而对于相关性校准,候选者之间的度量间隔应最大化,惊喜——模型与度量定义的候选者排名之间的分歧——应最小化。可用于创建、选择和优化校准集的代码位于https://github.com/griff4692/calibrating-summaries
English
Summarization models often generate text that is poorly calibrated to quality
metrics because they are trained to maximize the likelihood of a single
reference (MLE). To address this, recent work has added a calibration step,
which exposes a model to its own ranked outputs to improve relevance or, in a
separate line of work, contrasts positive and negative sets to improve
faithfulness. While effective, much of this work has focused on how to generate
and optimize these sets. Less is known about why one setup is more effective
than another. In this work, we uncover the underlying characteristics of
effective sets. For each training instance, we form a large, diverse pool of
candidates and systematically vary the subsets used for calibration
fine-tuning. Each selection strategy targets distinct aspects of the sets, such
as lexical diversity or the size of the gap between positive and negatives. On
three diverse scientific long-form summarization datasets (spanning biomedical,
clinical, and chemical domains), we find, among others, that faithfulness
calibration is optimal when the negative sets are extractive and more likely to
be generated, whereas for relevance calibration, the metric margin between
candidates should be maximized and surprise--the disagreement between model and
metric defined candidate rankings--minimized. Code to create, select, and
optimize calibration sets is available at
https://github.com/griff4692/calibrating-summaries