ChatPaper.aiChatPaper

机器学习语言模型(MachineLearningLM):通过在数百万个合成表格预测任务上持续预训练语言模型,实现了上下文机器学习能力的规模化提升。

MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

September 8, 2025
作者: Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
cs.AI

摘要

大型语言模型(LLMs)虽具备广泛的世界知识与强大的通用推理能力,但在标准机器学习(ML)任务中,它们难以通过上下文学习(ICL)从大量示例中汲取知识,即在不依赖梯度下降的情况下,仅凭上下文演示实现多示例学习。为此,我们推出了MachineLearningLM,一个便携式的持续预训练框架,旨在赋予通用LLM强大的上下文ML能力,同时保留其广泛的知识与推理能力,以适应更广泛的对话工作流。 我们的预训练流程通过合成来自数百万结构因果模型(SCMs)的ML任务,涵盖示例数量多达1,024个。我们首先采用随机森林作为教师模型,将基于树的决策策略蒸馏至LLM中,以增强其在数值建模中的鲁棒性。所有任务均通过一种高效的提示序列化方法处理,使得每个上下文窗口内可容纳3至6倍多的示例,并通过批量推理实现高达50倍的摊销吞吐量。 尽管采用了一个相对简单的配置(Qwen-2.5-7B-Instruct配合LoRA秩8),MachineLearningLM在金融、物理、生物及医疗领域的分布外表格分类任务上,平均超越强基线LLM(如GPT-5-mini)约15%。它展现出了显著的多示例扩展规律:随着上下文演示从8增至1,024,准确率单调上升。无需任何任务特定训练,它便能在数百个示例上达到随机森林级别的准确率。同时,其通用对话能力,包括知识与推理,得以保持:在MMLU测试中取得了75.4%的成绩。
English
Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
PDF223September 12, 2025