복합 논리 명령어 생성

초록

명령어 수행 능력은 최근 대형 언어 모델(LLM) 시대를 촉진시키는 동력이 되었으며, 추론 및 에이전트 행동과 같은 더 고급 기능을 뒷받침하는 기초적인 기술이다. 과제가 점점 더 복잡해짐에 따라 자연어 명령어에 내재된 논리 구조도 점점 더 정교해지고 있다. 그러나 이러한 논리가 풍부한 명령어에 대해 LLM이 얼마나 잘 수행하는지는 아직 충분히 탐구되지 않았다. 본 연구에서는 LogicIFGen과 LogicIFEval을 제안한다. LogicIFGen은 코드 함수에서 검증 가능한 명령어를 생성하기 위한 확장 가능한 자동화 프레임워크로, 조건문, 중첩, 재귀, 함수 호출과 같은 풍부한 논리를 자연스럽게 표현할 수 있다. 또한 복잡한 코드 함수 컬렉션을 선별하고 LogicIFGen을 사용하여 426개의 검증 가능한 논리 풍부한 명령어로 구성된 벤치마크인 LogicIFEval을 구축하였다. 실험 결과, 현재 최첨단 LLM들도 LogicIFEval의 명령어를 올바르게 수행하는 데 어려움을 겪는 것으로 나타났다. 대부분의 LLM은 60% 미만의 명령어만을 수행할 수 있었으며, 이는 명령어 수행 능력에 상당한 결함이 있음을 보여준다. 코드 및 벤치마크: https://github.com/mianzhang/LogicIF

English

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

복합 논리 명령어 생성

Complex Logical Instruction Generation

초록

Support