ChatPaper.aiChatPaper

通過潛變量推論訓練思維鏈。

Training Chain-of-Thought via Latent-Variable Inference

November 28, 2023
作者: Du Phan, Matthew D. Hoffman, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, Rif A. Saurous
cs.AI

摘要

大型語言模型(LLMs)在使用“思維鏈”(CoT)提示逐步解答問題時,能夠更準確且易於解釋。通過監督微調,即對某些可調參數進行梯度上升,以最大化來自標記訓練集的正確答案的平均對數概率,可以提高LLMs在特定任務上的性能。將CoT與監督微調天真地結合需要監督不僅是正確答案,還有導致這些答案的詳細原因;這些原因手工製作成本昂貴。相反,我們提出了一種微調策略,試圖最大化使用CoT提示生成正確答案的邊際對數概率,近似平均所有可能的原因。核心挑戰是從條件於正確答案的原因後驗中進行抽樣;我們使用受自學推理者(STaR)、記憶式覺醒-睡眠、馬爾可夫分數爬升和持續對比散度啟發的簡單馬爾可夫鏈蒙特卡羅(MCMC)期望最大化(EM)算法來解決這個問題。該算法還包括一種新穎的控制變量技術,隨著模型的改進,將我們的梯度估計的變異推向零。將我們的技術應用於GSM8K和BIG-Bench Hard中的任務時,我們發現這種MCMC-EM微調技術通常比STaR或使用或不使用CoT的提示微調更能提高模型對留存示例的準確性。
English
Large language models (LLMs) solve problems more accurately and interpretably when instructed to work out the answer step by step using a ``chain-of-thought'' (CoT) prompt. One can also improve LLMs' performance on a specific task by supervised fine-tuning, i.e., by using gradient ascent on some tunable parameters to maximize the average log-likelihood of correct answers from a labeled training set. Naively combining CoT with supervised tuning requires supervision not just of the correct answers, but also of detailed rationales that lead to those answers; these rationales are expensive to produce by hand. Instead, we propose a fine-tuning strategy that tries to maximize the marginal log-likelihood of generating a correct answer using CoT prompting, approximately averaging over all possible rationales. The core challenge is sampling from the posterior over rationales conditioned on the correct answer; we address it using a simple Markov-chain Monte Carlo (MCMC) expectation-maximization (EM) algorithm inspired by the self-taught reasoner (STaR), memoized wake-sleep, Markovian score climbing, and persistent contrastive divergence. This algorithm also admits a novel control-variate technique that drives the variance of our gradient estimates to zero as the model improves. Applying our technique to GSM8K and the tasks in BIG-Bench Hard, we find that this MCMC-EM fine-tuning technique typically improves the model's accuracy on held-out examples more than STaR or prompt-tuning with or without CoT.
PDF110December 15, 2024