TROPT：一个用于统一和推进离散文本优化的开放框架

摘要

离散文本触发器优化——即搜索能够使模型朝向特定目标行为的文本序列——支撑着模型红队测试（例如，大语言模型越狱）以及审计与可解释性研究。然而，当前离散优化器的现状阻碍了其应用与进展。首先，现有优化器即便开源，也分散在绑定特定模型、目标及问题领域的研究代码库中。其次，优化器变体层出不穷，每个都需要额外的工程开销才能使用或扩展，且难以进行直接对比。这些问题共同提升了在现有或新领域采用优化器、以及通过新策略推动其发展的门槛。我们通过TROPT弥补了这些不足——这是首个统一离散优化器执行流程并标准化其开发接口的开源框架。TROPT能够轻松定制端到端的优化方案：通过任意替换组件（模型、目标、优化器），将其适用范围扩展到不同领域和新应用。目前TROPT内置了30多种优化方案（涵盖越狱、探测模型内部结构等应用场景），这些方案由15种以上优化器（从白盒到黑盒访问）和15种以上损失函数（从基础方法到前沿技术）构建而成。为展示其实用性，我们利用TROPT进行了多项研究：（i）在大规模受控实验中对比并改进大语言模型越狱的优化策略，揭示了强大但尚未被充分采用的技术；（ii）将优化器从一个领域（如LLM越狱）迁移到新领域（如语料投毒嵌入模型）。总体而言，TROPT显著降低了采用和推进离散文本优化的门槛。

English

Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants proliferate, each requiring engineering overhead to use or extend, and remaining hard to compare head-to-head. Together, these raise the bar for adopting optimizers in existing or new domains, and for advancing them via new strategies. We address these gaps with TROPT, the first open-source framework that unifies discrete optimizers' execution and standardizes their development under a single interface. TROPT makes it easy to customize end-to-end optimization recipes by swapping any component -- models, objectives, and optimizers -- extending its reach across domains and new applications. TROPT currently ships with 30+ optimization recipes -- covering applications such as jailbreaking and probing model internals -- built from 15+ optimizers (spanning white-box to black-box access) and 15+ losses, from foundational to state-of-the-art methods. Demonstrating its utility, we leverage TROPT in several studies: (i) controlled, large-scale experiments comparing and enhancing optimization strategies for LLM jailbreaks, revealing potent-yet-underadopted techniques; and (ii) porting optimizers from one domain (e.g., LLM jailbreak) to new domains (e.g., corpus-poisoning embedding model). In all, TROPT significantly lowers the barrier to adopting and advancing discrete text optimization.