FlexOlmo:面向灵活数据使用的开源语言模型
FlexOlmo: Open Language Models for Flexible Data Use
July 9, 2025
作者: Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
cs.AI
摘要
我们推出FlexOlmo,一种新型语言模型(LM),它支持(1)无需数据共享的分布式训练,其中不同模型参数在封闭数据集上独立训练,以及(2)数据灵活推理,这些参数及其关联数据可在无需额外训练的情况下灵活地纳入或排除于模型推理之外。FlexOlmo采用专家混合(MoE)架构,每个专家在封闭数据集上独立训练,随后通过一种新的领域感知路由机制整合,无需联合训练。FlexOlmo在FlexMix语料库上进行训练,该语料库由我们精心挑选,包含公开可用数据集及七个特定领域数据集,作为封闭集的现实近似。我们评估了参数高达370亿(活跃参数200亿)的模型在31项多样化下游任务上的表现。结果表明,在公共数据上训练的通用专家能有效与其他数据所有者独立训练的专家结合,带来平均41%的相对性能提升,同时允许用户根据数据许可或权限要求选择退出特定数据的使用。我们的方法还平均优于先前的模型融合方法10.1%,并在相同训练FLOPs下超越了未受数据限制训练的标准MoE。总之,这项研究为拥有敏感或受保护数据的监管行业数据所有者和研究者提供了一种解决方案。FlexOlmo使得在尊重数据所有者偏好的前提下,通过保持其数据本地化并支持推理期间数据访问的细粒度控制,从封闭数据中获益成为可能。
English
We introduce FlexOlmo, a new class of language models (LMs) that supports (1)
distributed training without data sharing, where different model parameters are
independently trained on closed datasets, and (2) data-flexible inference,
where these parameters along with their associated data can be flexibly
included or excluded from model inferences with no further training. FlexOlmo
employs a mixture-of-experts (MoE) architecture where each expert is trained
independently on closed datasets and later integrated through a new
domain-informed routing without any joint training. FlexOlmo is trained on
FlexMix, a corpus we curate comprising publicly available datasets alongside
seven domain-specific sets, representing realistic approximations of closed
sets. We evaluate models with up to 37 billion parameters (20 billion active)
on 31 diverse downstream tasks. We show that a general expert trained on public
data can be effectively combined with independently trained experts from other
data owners, leading to an average 41% relative improvement while allowing
users to opt out of certain data based on data licensing or permission
requirements. Our approach also outperforms prior model merging methods by
10.1% on average and surpasses the standard MoE trained without data
restrictions using the same training FLOPs. Altogether, this research presents
a solution for both data owners and researchers in regulated industries with
sensitive or protected data. FlexOlmo enables benefiting from closed data while
respecting data owners' preferences by keeping their data local and supporting
fine-grained control of data access during inference.