CipherBank:通过密码学挑战探索大语言模型推理能力的边界
CipherBank: Exploring the Boundary of LLM Reasoning Capabilities through Cryptography Challenges
April 27, 2025
作者: Yu Li, Qizhi Pei, Mengyuan Sun, Honglin Lin, Chenlin Ming, Xin Gao, Jiang Wu, Conghui He, Lijun Wu
cs.AI
摘要
大型语言模型(LLMs)已展现出卓越的能力,特别是在推理方面的最新进展,如o1和o3,不断突破人工智能的边界。尽管在数学和编程领域取得了令人瞩目的成就,LLMs在需要密码学专业知识的领域中的推理能力仍未被充分探索。本文介绍了CipherBank,一个旨在评估LLMs在密码解密任务中推理能力的综合基准。CipherBank包含2,358个精心设计的问题,涵盖5个领域和14个子领域的262个独特明文,重点关注需要加密的隐私敏感和现实世界场景。从密码学角度来看,CipherBank整合了3大类加密方法,涵盖9种不同的算法,从古典密码到自定义密码技术。我们在CipherBank上评估了最先进的LLMs,如GPT-4o、DeepSeek-V3,以及专注于推理的尖端模型如o1和DeepSeek-R1。我们的研究结果揭示了通用聊天LLMs与专注于推理的LLMs之间在推理能力上的显著差距,同时也揭示了当前专注于推理的模型在应用于古典密码解密任务时的表现不足,凸显了这些模型在理解和操作加密数据方面面临的挑战。通过详细分析和错误调查,我们提供了几个关键观察,揭示了LLMs在密码推理中的局限性和潜在改进领域。这些发现强调了持续提升LLM推理能力的必要性。
English
Large language models (LLMs) have demonstrated remarkable capabilities,
especially the recent advancements in reasoning, such as o1 and o3, pushing the
boundaries of AI. Despite these impressive achievements in mathematics and
coding, the reasoning abilities of LLMs in domains requiring cryptographic
expertise remain underexplored. In this paper, we introduce CipherBank, a
comprehensive benchmark designed to evaluate the reasoning capabilities of LLMs
in cryptographic decryption tasks. CipherBank comprises 2,358 meticulously
crafted problems, covering 262 unique plaintexts across 5 domains and 14
subdomains, with a focus on privacy-sensitive and real-world scenarios that
necessitate encryption. From a cryptographic perspective, CipherBank
incorporates 3 major categories of encryption methods, spanning 9 distinct
algorithms, ranging from classical ciphers to custom cryptographic techniques.
We evaluate state-of-the-art LLMs on CipherBank, e.g., GPT-4o, DeepSeek-V3, and
cutting-edge reasoning-focused models such as o1 and DeepSeek-R1. Our results
reveal significant gaps in reasoning abilities not only between general-purpose
chat LLMs and reasoning-focused LLMs but also in the performance of current
reasoning-focused models when applied to classical cryptographic decryption
tasks, highlighting the challenges these models face in understanding and
manipulating encrypted data. Through detailed analysis and error
investigations, we provide several key observations that shed light on the
limitations and potential improvement areas for LLMs in cryptographic
reasoning. These findings underscore the need for continuous advancements in
LLM reasoning capabilities.Summary
AI-Generated Summary