LLM 정제의 기술: 질문, 개선, 그리고 신뢰

초록

최근 몇 년 동안 대형 언어 모델(LLM)은 놀라운 생성 능력을 보여주었지만, 이들이 자신이 생성한 결과물의 품질을 판단할 수 있을까요? 자기 개선(self-refinement)이라고 불리는 인기 있는 개념은 LLM이 요청받을 때 자신의 생성 결과에서 오류를 감지하고 수정할 수 있다고 가정합니다. 그러나 최근의 실증적 증거는 이와 반대 방향을 가리키며, LLM이 추론이 필요한 상황에서 오류를 정확히 식별하는 데 종종 어려움을 겪는다는 것을 시사합니다. 이를 해결하기 위해, 우리는 ART(Ask, Refine, and Trust)라는 추론과 개선을 결합한 목표를 제안합니다. ART는 LLM이 언제 자신의 출력을 개선해야 하는지를 결정하기 위해 필요한 질문을 하고, 개선된 결과와 초기 예측을 순위 매김하여 개선을 확신하거나 신뢰를 보류합니다. 수학적 단어 문제(GSM8K)와 질문 응답(StrategyQA)이라는 두 가지 다단계 추론 과제에서 ART는 자기 개선 기준선보다 +5점의 성능 향상을 달성하면서, 훨씬 더 작은 모델을 의사 결정자로 사용합니다. 또한, 더 큰 모델을 미세 조정하는 대신 더 작은 모델을 사용하여 개선 결정을 내리는 것이 비용 효율적인 대안임을 입증합니다.

English

In recent years, Large Language Models (LLMs) have demonstrated remarkable generative abilities, but can they judge the quality of their own generations? A popular concept, referred to as self-refinement, postulates that LLMs can detect and correct the errors in their generations when asked to do so. However, recent empirical evidence points in the opposite direction, suggesting that LLMs often struggle to accurately identify errors when reasoning is involved. To address this, we propose a reasoning with refinement objective called ART: Ask, Refine, and Trust, which asks necessary questions to decide when an LLM should refine its output, and either affirm or withhold trust in its refinement by ranking the refinement and the initial prediction. On two multistep reasoning tasks of mathematical word problems (GSM8K) and question answering (StrategyQA), ART achieves a performance gain of +5 points over self-refinement baselines, while using a much smaller model as the decision maker. We also demonstrate the benefit of using smaller models to make refinement decisions as a cost-effective alternative to fine-tuning a larger model.

LLM 정제의 기술: 질문, 개선, 그리고 신뢰

The ART of LLM Refinement: Ask, Refine, and Trust

초록

Support