TROPT: 離散テキスト最適化を統合・進展させるためのオープンフレームワーク

要旨

離散的なテキストトリガー最適化（モデルに入力されると指定された目的に誘導するテキストシーケンスを探索する手法）は、モデルのレッドチーミング（例：LLM脱獄）や監査、解釈可能性を支える基盤技術である。しかし、現状の離散最適化ツールはその普及と進展を阻んでいる。第一に、既存の最適化ツールは（たとえオープンソース化されていても）特定のモデル、目的、問題領域に紐づいた研究用コードベースに散在している。第二に、最適化ツールの亜種が増殖しており、それぞれに使用や拡張に工数がかかり、直接比較することも困難である。これらの要因が相まって、既存・新規領域への最適化ツールの導入や、新たな戦略による発展のハードルを高めている。我々はこれらの課題に対し、TROPTを提案する。TROPTは、離散最適化ツールの実行を統合し、単一のインターフェースの下で標準化された開発を実現する、初のオープンソースフレームワークである。TROPTでは、モデル・目的・最適化ツールといった任意の構成要素を差し替えることで、エンドツーエンドの最適化レシピを容易にカスタマイズでき、その適用範囲を多様な領域や新たな応用に拡張できる。現在TROPTには、15以上の最適化ツール（ホワイトボックスからブラックボックスアクセスまでを網羅）と15以上の損失関数（基礎的手法から最先端手法まで）から構成される、30以上の最適化レシピが同梱されており、脱獄やモデル内部の探索といった応用をカバーしている。その有用性を示すため、我々はTROPTを用いて以下の研究を実施した。(i) LLM脱獄のための最適化戦略を比較・改善する、制御された大規模実験。これにより、強力でありながらあまり採用されていない手法を明らかにした。(ii) ある領域（例：LLM脱獄）の最適化ツールを新たな領域（例：コーパス汚染による埋め込みモデルへの攻撃）へ移植。総じてTROPTは、離散テキスト最適化の導入と発展のハードルを大幅に引き下げるものである。

English

Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants proliferate, each requiring engineering overhead to use or extend, and remaining hard to compare head-to-head. Together, these raise the bar for adopting optimizers in existing or new domains, and for advancing them via new strategies. We address these gaps with TROPT, the first open-source framework that unifies discrete optimizers' execution and standardizes their development under a single interface. TROPT makes it easy to customize end-to-end optimization recipes by swapping any component -- models, objectives, and optimizers -- extending its reach across domains and new applications. TROPT currently ships with 30+ optimization recipes -- covering applications such as jailbreaking and probing model internals -- built from 15+ optimizers (spanning white-box to black-box access) and 15+ losses, from foundational to state-of-the-art methods. Demonstrating its utility, we leverage TROPT in several studies: (i) controlled, large-scale experiments comparing and enhancing optimization strategies for LLM jailbreaks, revealing potent-yet-underadopted techniques; and (ii) porting optimizers from one domain (e.g., LLM jailbreak) to new domains (e.g., corpus-poisoning embedding model). In all, TROPT significantly lowers the barrier to adopting and advancing discrete text optimization.