尋寶行動:利用訓練時標記實時定位長尾分佈
Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers
June 17, 2025
作者: Daniel D'souza, Julia Kreutzer, Adrien Morisot, Ahmet Üstün, Sara Hooker
cs.AI
摘要
現代機器學習面臨的最深刻挑戰之一,是如何在稀有和代表性不足特徵的長尾分佈上表現良好。大型通用模型雖然針對多種任務進行訓練,但在高頻使用場景中表現最佳。訓練完成後,模型很難適應訓練語料庫中代表性不足的特定使用場景。依賴提示工程或少量樣例來最大化特定測試案例的輸出質量,往往令人沮喪,因為模型可能對微小變化極為敏感,以不可預測的方式反應,或依賴固定的系統提示來維持性能。在本研究中,我們提出疑問:「我們能否優化訓練協議,以同時提升推理時的可控性和在代表性不足使用場景上的表現?」我們重新審視訓練與推理技術之間的界限,以改善長尾性能,同時為用戶提供一組模型被訓練為能夠響應的控制槓桿。我們創建了數據特徵和任務來源的詳細分類法,以在推理時顯式控制生成屬性並隱式條件化生成。我們對基礎模型進行微調,使其能夠自動推斷這些標記,從而使它們在推理時成為可選項。這種原則性且靈活的方法帶來了顯著的性能提升,尤其是在訓練分佈長尾的樣例上。我們觀察到,使用我們的標記,在開放式生成質量上平均提升了5.7%的勝率,而在代表性不足的領域中,提升超過9.1%。我們還觀察到,在如代碼修復等代表性不足的任務上,相對提升高達14.1%,在長度指令遵循評估上,絕對改進達35.3%。
English
One of the most profound challenges of modern machine learning is performing
well on the long-tail of rare and underrepresented features. Large
general-purpose models are trained for many tasks, but work best on
high-frequency use cases. After training, it is hard to adapt a model to
perform well on specific use cases underrepresented in the training corpus.
Relying on prompt engineering or few-shot examples to maximize the output
quality on a particular test case can be frustrating, as models can be highly
sensitive to small changes, react in unpredicted ways or rely on a fixed system
prompt for maintaining performance. In this work, we ask: "Can we optimize our
training protocols to both improve controllability and performance on
underrepresented use cases at inference time?" We revisit the divide between
training and inference techniques to improve long-tail performance while
providing users with a set of control levers the model is trained to be
responsive to. We create a detailed taxonomy of data characteristics and task
provenance to explicitly control generation attributes and implicitly condition
generations at inference time. We fine-tune a base model to infer these markers
automatically, which makes them optional at inference time. This principled and
flexible approach yields pronounced improvements in performance, especially on
examples from the long tail of the training distribution. While we observe an
average lift of 5.7% win rates in open-ended generation quality with our
markers, we see over 9.1% gains in underrepresented domains. We also observe
relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and
absolute improvements of 35.3% on length instruction following evaluations.