ChatPaper.aiChatPaper

校準集的期望特徵是什麼?識別長形式科學摘要的相關性

What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific Summarization

May 12, 2023
作者: Griffin Adams, Bichlien H Nguyen, Jake Smith, Yingce Xia, Shufang Xie, Anna Ostropolets, Budhaditya Deb, Yuan-Jyue Chen, Tristan Naumann, Noémie Elhadad
cs.AI

摘要

摘要模型通常生成的文本與質量指標不夠校準,因為它們是經過訓練以最大化單個參考(MLE)的可能性。為了解決這個問題,最近的研究添加了一個校準步驟,將模型暴露於其自身的排名輸出,以改善相關性,或在另一個研究方向上,對比正面和負面集以提高忠實度。儘管有效,但許多研究都集中在如何生成和優化這些集合上。我們對哪種設置比另一種更有效知之甚少。在這項研究中,我們揭示了有效集合的潛在特徵。對於每個訓練實例,我們形成了一個大型、多樣化的候選人池,並系統地變化用於校準微調的子集。每個選擇策略都針對集合的不同方面,如詞彙多樣性或正面和負面之間的差距大小。在三個不同的科學長文摘要數據集(涵蓋生物醫學、臨床和化學領域)中,我們發現,忠實度校準在負面集合是抽取性且更有可能生成時最佳,而對於相關性校準,候選人之間的指標間隔應該被最大化,並且應該最小化驚喜——模型和指標定義的候選人排名之間的不一致性。創建、選擇和優化校準集合的代碼可在以下鏈接找到:https://github.com/griff4692/calibrating-summaries
English
Summarization models often generate text that is poorly calibrated to quality metrics because they are trained to maximize the likelihood of a single reference (MLE). To address this, recent work has added a calibration step, which exposes a model to its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of this work has focused on how to generate and optimize these sets. Less is known about why one setup is more effective than another. In this work, we uncover the underlying characteristics of effective sets. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between candidates should be maximized and surprise--the disagreement between model and metric defined candidate rankings--minimized. Code to create, select, and optimize calibration sets is available at https://github.com/griff4692/calibrating-summaries
PDF11December 15, 2024