ChatPaper.aiChatPaper

尋寶行動:利用訓練時標記實時定位長尾分佈

Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers

June 17, 2025
作者: Daniel D'souza, Julia Kreutzer, Adrien Morisot, Ahmet Üstün, Sara Hooker
cs.AI

摘要

現代機器學習面臨的最深刻挑戰之一,是如何在稀有和代表性不足特徵的長尾分佈上表現良好。大型通用模型雖然針對多種任務進行訓練,但在高頻使用場景中表現最佳。訓練完成後,模型很難適應訓練語料庫中代表性不足的特定使用場景。依賴提示工程或少量樣例來最大化特定測試案例的輸出質量,往往令人沮喪,因為模型可能對微小變化極為敏感,以不可預測的方式反應,或依賴固定的系統提示來維持性能。在本研究中,我們提出疑問:「我們能否優化訓練協議,以同時提升推理時的可控性和在代表性不足使用場景上的表現?」我們重新審視訓練與推理技術之間的界限,以改善長尾性能,同時為用戶提供一組模型被訓練為能夠響應的控制槓桿。我們創建了數據特徵和任務來源的詳細分類法,以在推理時顯式控制生成屬性並隱式條件化生成。我們對基礎模型進行微調,使其能夠自動推斷這些標記,從而使它們在推理時成為可選項。這種原則性且靈活的方法帶來了顯著的性能提升,尤其是在訓練分佈長尾的樣例上。我們觀察到,使用我們的標記,在開放式生成質量上平均提升了5.7%的勝率,而在代表性不足的領域中,提升超過9.1%。我們還觀察到,在如代碼修復等代表性不足的任務上,相對提升高達14.1%,在長度指令遵循評估上,絕對改進達35.3%。
English
One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: "Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?" We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.
PDF22June 18, 2025