Apertus:為全球語言環境民主化開放且合規的大型語言模型
Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
September 17, 2025
作者: Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Ines Altemir Marinas, Mohammad Hossein Amani, Matin Ansaripour, Ilia Badanin, Harold Benoit, Emanuela Boros, Nicholas Browning, Fabian Bösch, Maximilian Böther, Niklas Canova, Camille Challier, Clement Charmillot, Jonathan Coles, Jan Deriu, Arnout Devos, Lukas Drescher, Daniil Dzenhaliou, Maud Ehrmann, Dongyang Fan, Simin Fan, Silin Gao, Miguel Gila, María Grandury, Diba Hashemi, Alexander Hoyle, Jiaming Jiang, Mark Klein, Andrei Kucharavy, Anastasiia Kucherenko, Frederike Lübeck, Roman Machacek, Theofilos Manitaras, Andreas Marfurt, Kyle Matoba, Simon Matrenok, Henrique Mendoncça, Fawzi Roberto Mohamed, Syrielle Montariol, Luca Mouchel, Sven Najem-Meyer, Jingwei Ni, Gennaro Oliva, Matteo Pagliardini, Elia Palme, Andrei Panferov, Léo Paoletti, Marco Passerini, Ivan Pavlov, Auguste Poiroux, Kaustubh Ponkshe, Nathan Ranchin, Javi Rando, Mathieu Sauser, Jakhongir Saydaliev, Muhammad Ali Sayfiddinov, Marian Schneider, Stefano Schuppli, Marco Scialanga, Andrei Semenov, Kumar Shridhar, Raghav Singhal, Anna Sotnikova, Alexander Sternfeld, Ayush Kumar Tarun, Paul Teiletche, Jannis Vamvas, Xiaozhe Yao, Hao Zhao Alexander Ilic, Ana Klimovic, Andreas Krause, Caglar Gulcehre, David Rosenthal, Elliott Ash, Florian Tramèr, Joost VandeVondele, Livio Veraldi, Martin Rajman, Thomas Schulthess, Torsten Hoefler, Antoine Bosselut, Martin Jaggi, Imanol Schlag
cs.AI
摘要
我们推出Apertus,一套完全开源的大型语言模型(LLMs)系列,旨在解决当前开源模型生态系统中两个系统性不足:数据合规性与多语言代表性。与许多先前仅发布权重而缺乏可复现数据管道或忽视内容所有者权利的模型不同,Apertus模型仅基于公开可用数据进行预训练,回溯性地尊重robots.txt排除规则,并过滤非许可、有害及个人可识别内容。为降低记忆风险,我们在预训练阶段采用Goldfish目标函数,强力抑制数据的逐字回忆,同时保持下游任务性能。Apertus模型还扩展了多语言覆盖范围,训练数据包含来自1800多种语言的15万亿个标记,其中约40%的预训练数据分配给了非英语内容。Apertus以80亿和700亿参数规模发布,在多语言基准测试中接近完全开源模型的最先进水平,与开源权重模型相媲美甚至超越。除模型权重外,我们还以宽松许可发布了开发周期中的所有科学成果,包括数据准备脚本、检查点、评估套件及训练代码,支持透明审计与扩展。
English
We present Apertus, a fully open suite of large language models (LLMs)
designed to address two systemic shortcomings in today's open model ecosystem:
data compliance and multilingual representation. Unlike many prior models that
release weights without reproducible data pipelines or regard for content-owner
rights, Apertus models are pretrained exclusively on openly available data,
retroactively respecting robots.txt exclusions and filtering for
non-permissive, toxic, and personally identifiable content. To mitigate risks
of memorization, we adopt the Goldfish objective during pretraining, strongly
suppressing verbatim recall of data while retaining downstream task
performance. The Apertus models also expand multilingual coverage, training on
15T tokens from over 1800 languages, with ~40% of pretraining data allocated to
non-English content. Released at 8B and 70B scales, Apertus approaches
state-of-the-art results among fully open models on multilingual benchmarks,
rivalling or surpassing open-weight counterparts. Beyond model weights, we
release all scientific artifacts from our development cycle with a permissive
license, including data preparation scripts, checkpoints, evaluation suites,
and training code, enabling transparent audit and extension.