ChatPaper.aiChatPaper

在開放模型中提取對齊數據

Extracting alignment data in open models

October 21, 2025
作者: Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veličković, Ilia Shumailov, Jamie Hayes
cs.AI

摘要

在本研究中,我們展示了從後訓練模型中提取大量對齊訓練數據的可能性——這些數據可用於引導模型提升特定能力,如長上下文推理、安全性、指令遵循及數學運算。儘管多數相關研究在衡量訓練數據提取的成功率時依賴於字符串匹配,我們主張嵌入模型更適合我們的特定目標。通過高質量嵌入模型測量的距離能夠識別字符串間的語義相似性,而其他度量標準如編輯距離則難以捕捉。事實上,在我們的調查中,近似字符串匹配會嚴重低估(保守估計為10倍)可提取數據的數量,這是由於一些降低度量值的細微人工痕跡所致。有趣的是,我們發現模型容易重現用於後訓練階段(如SFT或RL)的訓練數據。我們證明,這些數據可用於訓練基礎模型,從而恢復相當一部分原始性能。我們認為,這項工作揭示了提取對齊數據時可能被忽視的風險。最後,我們的研究開啟了關於蒸餾實踐下游影響的有趣討論:既然模型似乎會重現其訓練集的某些方面,因此蒸餾可以被視為間接地在模型的原始數據集上進行訓練。
English
In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of 10times) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.
PDF52October 22, 2025