通过潜变量推断训练思维链。

摘要

大型语言模型（LLMs）在使用“思维链”（CoT）提示逐步解答问题时，能够更准确和可解释地解决问题。通过在一些可调参数上使用梯度上升最大化标记训练集中正确答案的平均对数似然，可以通过监督微调来提高LLMs在特定任务上的性能。将CoT与监督微调天真地结合需要监督不仅正确答案，还需要导致这些答案的详细原因; 这些原因手工制作成本高昂。相反，我们提出了一种微调策略，试图通过CoT提示最大化生成正确答案的边际对数似然，大致平均考虑所有可能的原因。核心挑战是从条件于正确答案的原因后验中进行采样; 我们使用受自学推理者（STaR）、记忆化唤醒-睡眠、马尔可夫分数爬升和持续对比散度启发的简单马尔可夫链蒙特卡洛（MCMC）期望最大化（EM）算法来解决这个问题。该算法还采用一种新颖的控制变量技术，随着模型的改进，将我们的梯度估计方差驱动到零。将我们的技术应用于GSM8K和BIG-Bench Hard中的任务时，我们发现，与STaR或带有或不带有CoT的提示微调相比，这种MCMC-EM微调技术通常能够更多地提高模型对留存示例的准确性。

English

Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the marginal log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.

通过潜变量推断训练思维链。

Training Chain-of-Thought via Latent-Variable Inference

摘要

Support