基因组下一标记预测器具备上下文学习能力
Genomic Next-Token Predictors are In-Context Learners
November 16, 2025
作者: Nathan Breslow, Aayush Mishra, Mahler Revsine, Michael C. Schatz, Anqi Liu, Daniel Khashabi
cs.AI
摘要
情境學習(ICL)——即模型從輸入內容中的示例推斷並應用抽象模式的能力——已在基於人類文本進行下一詞預測訓練的大型語言模型中得到廣泛研究。事實上,先前研究常將這種湧現能力歸因於人類語言獨特的統計特性。這引發了一個根本性問題:情境學習能否在其他序列領域中,純粹通過大規模預測性訓練自然湧現?
為探討此問題,我們轉向基因組序列這一富含統計結構的替代符號領域。具體而言,我們研究了主要通過下一核苷酸(A/T/C/G)預測訓練的Evo2基因組模型,其訓練規模可與中型LLM相媲美。我們開發了一套受控實驗框架,包含以語言形式和基因組形式實例化的符號推理任務,從而實現對基因組模型與語言模型情境學習能力的直接比較。結果表明,與語言模型類似,基因組模型在模式歸納能力上會隨著情境示例數量的增加呈現對數線性增長。據我們所知,這是基因組序列中自然湧現情境學習的首個證據,支持了「情境學習是大規模數據預測建模的產物」這一假說。這些發現將湧現元學習拓展至語言領域之外,為構建跨模態的情境學習統一理論指明方向。
English
In-context learning (ICL) -- the capacity of a model to infer and apply abstract patterns from examples provided within its input -- has been extensively studied in large language models trained for next-token prediction on human text. In fact, prior work often attributes this emergent behavior to distinctive statistical properties in human language. This raises a fundamental question: can ICL arise organically in other sequence domains purely through large-scale predictive training?
To explore this, we turn to genomic sequences, an alternative symbolic domain rich in statistical structure. Specifically, we study the Evo2 genomic model, trained predominantly on next-nucleotide (A/T/C/G) prediction, at a scale comparable to mid-sized LLMs. We develop a controlled experimental framework comprising symbolic reasoning tasks instantiated in both linguistic and genomic forms, enabling direct comparison of ICL across genomic and linguistic models. Our results show that genomic models, like their linguistic counterparts, exhibit log-linear gains in pattern induction as the number of in-context demonstrations increases. To the best of our knowledge, this is the first evidence of organically emergent ICL in genomic sequences, supporting the hypothesis that ICL arises as a consequence of large-scale predictive modeling over rich data. These findings extend emergent meta-learning beyond language, pointing toward a unified, modality-agnostic view of in-context learning.