LLM 생성 휴리스틱을 활용한 고전적 계획: Python 코드로 최신 기술에 도전하기

초록

최근 몇 년 동안, 대형 언어 모델(LLM)은 다양한 인공지능 문제에서 놀라운 능력을 보여주었습니다. 그러나 이들은 계획 작업에 대한 상세한 정의를 프롬프트로 제공하더라도 신뢰할 만한 계획을 세우지 못합니다. 사고의 연쇄(chain-of-thought) 프롬프트, 미세 조정(fine-tuning), 명시적 "추론"과 같은 방법으로 계획 능력을 개선하려는 시도도 잘못된 계획을 생성하며, 일반적으로 더 큰 작업으로 일반화하지 못합니다. 본 논문에서는 LLM을 사용하여 증가하는 크기의 분포 외(out-of-distribution) 작업에 대해서도 올바른 계획을 생성하는 방법을 보여줍니다. 주어진 계획 도메인에 대해, LLM에게 Python 코드 형태의 여러 도메인 의존적 휴리스틱 함수를 생성하도록 요청하고, 이를 탐욕적 최상우선 탐색(greedy best-first search) 내의 훈련 작업 집합에서 평가한 후 가장 강력한 것을 선택합니다. 그 결과, LLM이 생성한 휴리스틱은 고전적 계획(classical planning)을 위한 최첨단 도메인 독립적 휴리스틱보다 훨씬 더 많은 보이지 않는 테스트 작업을 해결합니다. 이는 도메인 의존적 계획을 위한 가장 강력한 학습 알고리즘과도 경쟁력이 있습니다. 이러한 결과는 특히 우리의 개념 증명(proof-of-concept) 구현이 최적화되지 않은 Python 플래너를 기반으로 하고, 비교 대상들이 모두 고도로 최적화된 C++ 코드를 기반으로 한다는 점에서 주목할 만합니다. 일부 도메인에서는 LLM이 생성한 휴리스틱이 비교 대상보다 더 적은 상태를 확장하며, 이는 이들이 효율적으로 계산 가능할 뿐만 아니라 때로는 최첨단 휴리스틱보다 더 유익할 수 있음을 보여줍니다. 전반적으로, 우리의 결과는 계획 휴리스틱 함수 프로그램 집합을 샘플링하는 것이 LLM의 계획 능력을 크게 향상시킬 수 있음을 보여줍니다.

English

In recent years, large language models (LLMs) have shown remarkable capabilities in various artificial intelligence problems. However, they fail to plan reliably, even when prompted with a detailed definition of the planning task. Attempts to improve their planning capabilities, such as chain-of-thought prompting, fine-tuning, and explicit "reasoning" still yield incorrect plans and usually fail to generalize to larger tasks. In this paper, we show how to use LLMs to generate correct plans, even for out-of-distribution tasks of increasing size. For a given planning domain, we ask an LLM to generate several domain-dependent heuristic functions in the form of Python code, evaluate them on a set of training tasks within a greedy best-first search, and choose the strongest one. The resulting LLM-generated heuristics solve many more unseen test tasks than state-of-the-art domain-independent heuristics for classical planning. They are even competitive with the strongest learning algorithm for domain-dependent planning. These findings are especially remarkable given that our proof-of-concept implementation is based on an unoptimized Python planner and the baselines all build upon highly optimized C++ code. In some domains, the LLM-generated heuristics expand fewer states than the baselines, revealing that they are not only efficiently computable, but sometimes even more informative than the state-of-the-art heuristics. Overall, our results show that sampling a set of planning heuristic function programs can significantly improve the planning capabilities of LLMs.

LLM 생성 휴리스틱을 활용한 고전적 계획: Python 코드로 최신 기술에 도전하기

Classical Planning with LLM-Generated Heuristics: Challenging the State of the Art with Python Code

초록

Support