寻宝行动：利用训练时标记实时捕捉长尾目标

摘要

现代机器学习面临的最深刻挑战之一，在于如何有效处理稀有和代表性不足特征的长尾分布问题。大型通用模型虽经多任务训练，但在高频使用场景中表现最佳。训练完成后，模型难以针对训练语料中代表性不足的特定用例进行优化。依赖提示工程或少量示例来最大化特定测试案例的输出质量，往往令人沮丧，因为模型可能对微小变化极为敏感，以不可预测的方式响应，或依赖固定系统提示来维持性能。本研究中，我们提出：“能否优化训练协议，以在推理时同时提升对代表性不足用例的可控性和性能？”我们重新审视训练与推理技术之间的界限，旨在提升长尾性能的同时，为用户提供一组模型训练时即学会响应的控制杠杆。我们构建了详细的数据特征与任务来源分类体系，以在推理时显式控制生成属性并隐式条件化生成过程。我们微调基础模型，使其能自动推断这些标记，从而在推理时使它们成为可选。这一原则性强且灵活的方法显著提升了性能，特别是在训练分布长尾部分的示例上。使用我们的标记，我们在开放式生成质量上平均提升了5.7%的胜率，而在代表性不足的领域中，提升幅度超过9.1%。此外，在如代码修复等代表性不足的任务上，我们观察到相对提升高达14.1%，在遵循长度指令的评估中，绝对改进达到35.3%。

English

One of the most profound challenges of modern machine learning is performing well on the long-tail of rare and underrepresented features. Large general-purpose models are trained for many tasks, but work best on high-frequency use cases. After training, it is hard to adapt a model to perform well on specific use cases underrepresented in the training corpus. Relying on prompt engineering or few-shot examples to maximize the output quality on a particular test case can be frustrating, as models can be highly sensitive to small changes, react in unpredicted ways or rely on a fixed system prompt for maintaining performance. In this work, we ask: "Can we optimize our training protocols to both improve controllability and performance on underrepresented use cases at inference time?" We revisit the divide between training and inference techniques to improve long-tail performance while providing users with a set of control levers the model is trained to be responsive to. We create a detailed taxonomy of data characteristics and task provenance to explicitly control generation attributes and implicitly condition generations at inference time. We fine-tune a base model to infer these markers automatically, which makes them optional at inference time. This principled and flexible approach yields pronounced improvements in performance, especially on examples from the long tail of the training distribution. While we observe an average lift of 5.7% win rates in open-ended generation quality with our markers, we see over 9.1% gains in underrepresented domains. We also observe relative lifts of up to 14.1% on underrepresented tasks like CodeRepair and absolute improvements of 35.3% on length instruction following evaluations.

寻宝行动：利用训练时标记实时捕捉长尾目标

Treasure Hunt: Real-time Targeting of the Long Tail using Training-Time Markers

摘要

Support