LLM细化的艺术:询问、细化和信任
The ART of LLM Refinement: Ask, Refine, and Trust
November 14, 2023
作者: Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, Asli Celikyilmaz
cs.AI
摘要
近年来,大型语言模型(LLMs)展示了出色的生成能力,但它们能否评判自己生成的质量呢?一个流行的概念,称为自我完善,假设LLMs在被要求时能够检测并纠正其生成中的错误。然而,最近的实证证据指向相反的方向,表明在涉及推理时,LLMs经常难以准确识别错误。为了解决这个问题,我们提出了一个名为ART的推理与完善目标,该目标通过提出必要的问题来决定LLM何时应该完善其输出,并通过对完善和初始预测进行排名来确认或保留对其完善的信任。在两个多步推理任务中,即数学文字问题(GSM8K)和问答(StrategyQA)中,ART相对于自我完善基线实现了+5分的性能增益,同时使用一个更小的模型作为决策者。我们还展示了使用更小的模型来做出完善决策的好处,作为一种成本效益高的替代方案,而不是对更大模型进行微调。
English
In recent years, Large Language Models (LLMs) have demonstrated remarkable
generative abilities, but can they judge the quality of their own generations?
A popular concept, referred to as self-refinement, postulates that LLMs can
detect and correct the errors in their generations when asked to do so.
However, recent empirical evidence points in the opposite direction, suggesting
that LLMs often struggle to accurately identify errors when reasoning is
involved. To address this, we propose a reasoning with refinement objective
called ART: Ask, Refine, and Trust, which asks necessary questions to decide
when an LLM should refine its output, and either affirm or withhold trust in
its refinement by ranking the refinement and the initial prediction. On two
multistep reasoning tasks of mathematical word problems (GSM8K) and question
answering (StrategyQA), ART achieves a performance gain of +5 points over
self-refinement baselines, while using a much smaller model as the decision
maker. We also demonstrate the benefit of using smaller models to make
refinement decisions as a cost-effective alternative to fine-tuning a larger
model.