LLM 精煉的藝術:詢問、精煉和信任
The ART of LLM Refinement: Ask, Refine, and Trust
November 14, 2023
作者: Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, Asli Celikyilmaz
cs.AI
摘要
近年來,大型語言模型(LLMs)展現出卓越的生成能力,但它們能否判斷自身生成的質量呢?一個流行的概念,被稱為自我完善,假設LLMs在被要求時能夠檢測並修正其生成中的錯誤。然而,最近的實證證據指向相反的方向,表明當涉及推理時,LLMs往往難以準確識別錯誤。為了應對這一問題,我們提出了一個名為ART的推理與完善目標,該目標要求提出必要的問題以決定LLMs何時應該完善其輸出,並通過對完善和初始預測進行排名來肯定或保留對其完善的信任。在兩個多步推理任務中,包括數學文字問題(GSM8K)和問答(StrategyQA),ART相對於自我完善基線實現了+5分的性能增益,同時使用一個規模更小的模型作為決策者。我們還展示了使用較小模型做出完善決策的好處,作為與微調較大模型相比的一種具有成本效益的替代方案。
English
In recent years, Large Language Models (LLMs) have demonstrated remarkable
generative abilities, but can they judge the quality of their own generations?
A popular concept, referred to as self-refinement, postulates that LLMs can
detect and correct the errors in their generations when asked to do so.
However, recent empirical evidence points in the opposite direction, suggesting
that LLMs often struggle to accurately identify errors when reasoning is
involved. To address this, we propose a reasoning with refinement objective
called ART: Ask, Refine, and Trust, which asks necessary questions to decide
when an LLM should refine its output, and either affirm or withhold trust in
its refinement by ranking the refinement and the initial prediction. On two
multistep reasoning tasks of mathematical word problems (GSM8K) and question
answering (StrategyQA), ART achieves a performance gain of +5 points over
self-refinement baselines, while using a much smaller model as the decision
maker. We also demonstrate the benefit of using smaller models to make
refinement decisions as a cost-effective alternative to fine-tuning a larger
model.