ChatPaper.aiChatPaper

MachineLearningLM:在數百萬個合成表格預測任務上持續預訓練語言模型,實現上下文機器學習的規模化擴展

MachineLearningLM: Continued Pretraining Language Models on Millions of Synthetic Tabular Prediction Tasks Scales In-Context ML

September 8, 2025
作者: Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
cs.AI

摘要

大型語言模型(LLMs)具備廣泛的世界知識和強大的通用推理能力,然而在標準機器學習(ML)任務中,它們難以從多個上下文範例中學習,即僅通過上下文學習(ICL)而不依賴梯度下降來利用多樣本示範。我們引入了MachineLearningLM,這是一個便攜的持續預訓練框架,它賦予通用LLM強大的上下文ML能力,同時保留其通用知識和推理能力,以支持更廣泛的聊天工作流程。 我們的預訓練程序從數百萬個結構因果模型(SCMs)中合成ML任務,涵蓋樣本數量高達1,024。我們首先使用隨機森林教師模型,將基於樹的決策策略蒸餾到LLM中,以增強數值建模的魯棒性。所有任務均通過一個高效的提示序列化,使得每個上下文窗口能容納3到6倍的範例,並通過批量推理實現高達50倍的攤銷吞吐量。 儘管設置簡約(使用Qwen-2.5-7B-Instruct模型,LoRA秩為8),MachineLearningLM在金融、物理、生物和醫療領域的分佈外表格分類任務上,平均優於強勁的LLM基線(如GPT-5-mini)約15%。它展現了顯著的多樣本擴展定律:隨著上下文示範從8增加到1,024,準確率單調上升。在沒有任何任務特定訓練的情況下,它在數百個樣本上達到了隨機森林級別的準確率。通用聊天能力,包括知識和推理,得到了保留:在MMLU上取得了75.4%的成績。
English
Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
PDF223September 12, 2025