예제 프로그래밍(Programming by Example)이 LLM으로 해결되었는가?

초록

예제 기반 프로그래밍(Programming-by-Examples, PBE)은 입력-출력 예제로부터 알고리즘을 생성하는 것을 목표로 합니다. 이러한 시스템은 실용적 및 이론적으로 중요한데, 최종 사용자 관점에서는 수백만 명에게 배포되고 있으며, AI 관점에서 PBE는 매우 일반적인 형태의 소수 샷 귀납 추론에 해당합니다. 대규모 언어 모델(Large Language Models, LLMs)이 코드 생성 작업에서 성공을 거두었음을 감안할 때, 본 연구에서는 LLM이 PBE를 '해결'했다고 말할 수 있는 정도를 조사합니다. 우리는 리스트와 문자열과 같은 고전적인 도메인과 일반적인 사전 학습 데이터에서 잘 표현되지 않는 드문 그래픽 프로그래밍 도메인에서 실험을 진행했습니다. 그 결과, 사전 학습된 모델은 PBE에 효과적이지 않지만, 테스트 문제가 분포 내에 있을 경우 미세 조정을 통해 훨씬 더 높은 성능을 달성할 수 있음을 발견했습니다. 우리는 이러한 모델이 성공하고 실패하는 원인을 실증적으로 분석하고, 분포 외 일반화를 더 잘 달성하기 위한 방법을 이해하기 위한 단계를 밟았습니다. 종합적으로 이러한 결과는 LLM이 일반적인 PBE 작업군을 해결하는 데 있어 강력한 진전을 이루었음을 시사하며, PBE 시스템의 유연성과 적용 가능성을 잠재적으로 높이는 동시에 LLM이 여전히 부족한 부분을 식별합니다.

English

Programming-by-Examples (PBE) aims to generate an algorithm from input-output examples. Such systems are practically and theoretically important: from an end-user perspective, they are deployed to millions of people, and from an AI perspective, PBE corresponds to a very general form of few-shot inductive inference. Given the success of Large Language Models (LLMs) in code-generation tasks, we investigate here the extent to which LLMs can be said to have `solved' PBE. We experiment on classic domains such as lists and strings, and an uncommon graphics programming domain not well represented in typical pretraining data. We find that pretrained models are not effective at PBE, but that they can be fine-tuned for much higher performance, provided the test problems are in-distribution. We analyze empirically what causes these models to succeed and fail, and take steps toward understanding how to achieve better out-of-distribution generalization. Collectively these results suggest that LLMs make strong progress toward solving the typical suite of PBE tasks, potentially increasing the flexibility and applicability of PBE systems, while also identifying ways in which LLMs still fall short.

예제 프로그래밍(Programming by Example)이 LLM으로 해결되었는가?

Is Programming by Example solved by LLMs?

초록

Support