通过同源模型的指导和上下文感知度量选择对长上下文对齐具有影响力的样本

摘要

大型语言模型扩展以有效处理具有极长上下文的指令尚未得到充分调查。主要障碍在于构建一个专为长上下文对齐而设计的高质量长指令遵循数据集。现有研究已尝试通过合成长指令遵循样本来扩大可用数据量。然而，如果没有明确定义确保数据质量的策略，盲目增加数据量可能会引入低质量样本并限制最终性能。为弥合这一差距，我们旨在解决长上下文对齐的独特挑战，即对处理指令和长输入上下文的长距离依赖进行建模。我们提出了GATEAU，这是一个新颖的框架，旨在通过利用精心设计的同源模型引导（HMG）和上下文感知度量（CAM）来识别富含长距离依赖关系的有影响力和高质量样本。具体而言，HMG 试图通过使用两个具有不同上下文窗口的同源模型的回应的困惑度分数来衡量由于长距离依赖关系而生成相应回应的困难程度。此外，CAM 的作用是通过评估模型的注意力是否集中在重要部分，来衡量由于长距离依赖关系而理解长输入上下文的困难程度。基于提出的两种方法，我们选择最具挑战性的样本作为有影响力的数据，以有效构建长距离依赖关系，从而提高LLMs的性能。全面的实验表明，GATEAU 能够有效识别富含长距离依赖关系的样本，而在这些选定样本上训练的模型表现出更好的指令遵循和长上下文理解能力。

English

The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance. To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts. We propose GATEAU, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted Homologous Models' Guidance (HMG) and Contextual Awareness Measurement (CAM). Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows. Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model's attention is focused on important segments. Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs. Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.

通过同源模型的指导和上下文感知度量选择对长上下文对齐具有影响力的样本

Selecting Influential Samples for Long Context Alignment via Homologous Models' Guidance and Contextual Awareness Measurement

摘要

Support