複雑論理命令生成

要旨

指示追従能力は、近年の大規模言語モデル（LLMs）の時代を牽引し、推論やエージェント的行動といったより高度な能力の基盤となる重要なスキルである。タスクがより複雑になるにつれて、自然言語の指示に埋め込まれた論理構造はますます入り組んだものとなる。しかし、LLMsがそのような論理的に豊富な指示に対してどの程度の性能を発揮するかについては、まだ十分に検証されていない。本研究では、LogicIFGenとLogicIFEvalを提案する。LogicIFGenは、コード関数から検証可能な指示を生成するためのスケーラブルで自動化されたフレームワークであり、条件分岐、ネスト、再帰、関数呼び出しといった豊富な論理を自然に表現することができる。さらに、複雑なコード関数のコレクションをキュレーションし、LogicIFGenを用いてLogicIFEvalを構築した。LogicIFEvalは、426の検証可能な論理的に豊富な指示からなるベンチマークである。実験の結果、現在の最先端のLLMsでさえ、LogicIFEvalの指示を正しく追従することに苦戦することが明らかとなった。ほとんどのLLMsは、指示の60%未満しか追従できず、指示追従能力に重大な欠陥があることが示された。コードとベンチマークは以下のURLで公開されている：https://github.com/mianzhang/LogicIF

English

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditionals, nesting, recursion, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly follow the instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the instructions, revealing significant deficiencies in the instruction-following ability. Code and Benchmark: https://github.com/mianzhang/LogicIF

複雑論理命令生成

Complex Logical Instruction Generation

要旨

Support