TROPT: 이산 텍스트 최적화의 통합 및 발전을 위한 개방형 프레임워크

초록

이산 텍스트 트리거 최적화(모델이 특정 텍스트 시퀀스를 입력받아 지정된 목표를 향하도록 유도하는 검색)는 모델 레드팀(예: LLM 제일브레이킹)과 감사(監査) 및 해석 가능성의 기반이 된다. 그러나 현재 이산 최적화 도구의 상태는 그 채택과 발전을 저해한다. 첫째, 기존 최적화 도구는 오픈소스화된 경우에도 특정 모델, 목표, 문제 영역에 묶인 연구 코드베이스에 흩어져 있다. 둘째, 다양한 변종들이 proliferate하여 각각을 사용하거나 확장하는 데 엔지니어링 오버헤드가 발생하며, 직접 비교하기 어렵다. 이러한 문제들은 기존 영역이나 새 영역에서 최적화 도구를 채택하거나 새로운 전략을 통해 이를 발전시키는 데 장벽을 높인다. 우리는 이러한 격차를 해소하기 위해 TROPT를 제시한다. TROPT는 이산 최적화 도구의 실행을 통합하고 단일 인터페이스 아래에서 개발을 표준화하는 최초의 오픈소스 프레임워크이다. TROPT는 모델, 목표, 최적화 도구 등 구성 요소를 교체하여 종단간 최적화 레시피를 쉽게 사용자 정의할 수 있으므로, 다양한 영역과 새로운 애플리케이션으로 확장이 가능하다. 현재 TROPT는 30개 이상의 최적화 레시피(제일브레이킹 및 모델 내부 탐색과 같은 애플리케이션 포함)를 제공하며, 이는 15개 이상의 최적화 도구(화이트박스에서 블랙박스 접근까지 포괄)와 15개 이상의 손실 함수(기본 기법부터 최신 기법까지)로 구축되었다. 그 유용성을 입증하기 위해 우리는 TROPT를 여러 연구에 활용했다: (i) LLM 제일브레이킹을 위한 최적화 전략을 비교 및 개선하는 통제된 대규모 실험을 통해 강력하지만 덜 채택된 기법을 발견했으며, (ii) 최적화 도구를 한 도메인(예: LLM 제일브레이킹)에서 새 도메인(예: 임베딩 모델에 대한 말뭉치 오염)으로 이식했다. 결론적으로 TROPT는 이산 텍스트 최적화의 채택 및 발전에 대한 장벽을 크게 낮춘다.

English

Discrete text-trigger optimization -- searching for text sequences that, when ingested by a model, steer it toward a specified objective -- underpins model red-teaming (e.g., LLM jailbreaks), as well as auditing and interpretability. However, the current state of discrete optimizers hinders their adoption and progress. First, existing optimizers, when open-sourced at all, are scattered across research codebases tied to specific models, objectives, and problem domains. Second, optimizer variants proliferate, each requiring engineering overhead to use or extend, and remaining hard to compare head-to-head. Together, these raise the bar for adopting optimizers in existing or new domains, and for advancing them via new strategies. We address these gaps with TROPT, the first open-source framework that unifies discrete optimizers' execution and standardizes their development under a single interface. TROPT makes it easy to customize end-to-end optimization recipes by swapping any component -- models, objectives, and optimizers -- extending its reach across domains and new applications. TROPT currently ships with 30+ optimization recipes -- covering applications such as jailbreaking and probing model internals -- built from 15+ optimizers (spanning white-box to black-box access) and 15+ losses, from foundational to state-of-the-art methods. Demonstrating its utility, we leverage TROPT in several studies: (i) controlled, large-scale experiments comparing and enhancing optimization strategies for LLM jailbreaks, revealing potent-yet-underadopted techniques; and (ii) porting optimizers from one domain (e.g., LLM jailbreak) to new domains (e.g., corpus-poisoning embedding model). In all, TROPT significantly lowers the barrier to adopting and advancing discrete text optimization.