复杂逻辑指令生成
Complex Logical Instruction Generation
August 12, 2025
作者: Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song
cs.AI
摘要
指令遵循能力推动了大规模语言模型(LLMs)的新时代,并构成了诸如推理和代理行为等更高级能力的基础技能。随着任务难度增加,自然语言指令中嵌入的逻辑结构变得愈发复杂。然而,LLMs在此类富含逻辑的指令上的表现仍未被充分探索。我们提出了LogicIFGen和LogicIFEval。LogicIFGen是一个可扩展的自动化框架,用于从代码函数生成可验证的指令,这些指令能自然表达丰富的逻辑,如条件语句、嵌套、递归和函数调用。我们进一步精选了一系列复杂代码函数,并利用LogicIFGen构建了LogicIFEval,这是一个包含426条可验证的富含逻辑指令的基准测试集。我们的实验表明,当前最先进的LLMs在正确遵循LogicIFEval中的指令方面仍存在困难。大多数LLMs只能遵循不到60%的指令,揭示了其在指令遵循能力上的显著不足。代码与基准测试集:https://github.com/mianzhang/LogicIF
English
Instruction following has catalyzed the recent era of Large Language Models
(LLMs) and is the foundational skill underpinning more advanced capabilities
such as reasoning and agentic behaviors. As tasks grow more challenging, the
logic structures embedded in natural language instructions becomes increasingly
intricate. However, how well LLMs perform on such logic-rich instructions
remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a
scalable, automated framework for generating verifiable instructions from code
functions, which can naturally express rich logic such as conditionals,
nesting, recursion, and function calls. We further curate a collection of
complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark
comprising 426 verifiable logic-rich instructions. Our experiments demonstrate
that current state-of-the-art LLMs still struggle to correctly follow the
instructions in LogicIFEval. Most LLMs can only follow fewer than 60% of the
instructions, revealing significant deficiencies in the instruction-following
ability. Code and Benchmark: https://github.com/mianzhang/LogicIF