ChatPaper.aiChatPaper

从开放模型中提取对齐数据

Extracting alignment data in open models

October 21, 2025
作者: Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veličković, Ilia Shumailov, Jamie Hayes
cs.AI

摘要

在本研究中,我们证明了从经过后训练的模型中提取大量对齐训练数据是可行的——这些数据可用于引导模型提升特定能力,如长上下文推理、安全性、指令遵循及数学运算。尽管多数相关研究在衡量训练数据提取成功与否时侧重于字符串匹配,但我们认为嵌入模型更契合我们的特定目标。通过高质量嵌入模型测量的距离能够识别字符串间的语义相似性,而诸如编辑距离等不同度量标准则难以捕捉这些相似性。事实上,在我们的调查中,近似字符串匹配会严重低估(保守估计为10倍)可提取的数据量,原因在于一些降低度量值的琐碎人工痕迹。有趣的是,我们发现模型容易复述用于后训练阶段(如SFT或RL)的训练数据。我们展示这些数据随后可用于训练基础模型,恢复相当一部分原始性能。我们相信,本研究揭示了一个可能被忽视的提取对齐数据的风险。最后,我们的工作引发了对蒸馏实践下游效应的有趣讨论:既然模型似乎在复述其训练集的某些方面,因此蒸馏可被视为间接地在模型原始数据集上进行训练。
English
In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of 10times) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.
PDF52October 22, 2025