HiLo-Token: 입력 적응형 고저주파 토큰 압축을 통한 효율적인 이미지 편집

초록

Photoshop의 제거(Remove) 또는 생성형 채우기(Generative Fill) 버튼과 같은 창의적 이미지 편집 도구는 일상적인 고객 사용의 핵심이며, Photoshop과 Lightroom에서 트래픽의 상당 부분을 차지한다. 그러나 현재 생성형 AI 모델은 심각한 지연 시간 문제에 직면해 있으며, 이는 합성곱 기반 U-Net에서 확산 트랜스포머(DiT)로 전환할 때 더욱 두드러진다. 다양한 마스크 비율을 포괄하는 수백 개의 대표적 이미지 편집 샘플을 평가한 결과, DiT 모듈 자체가 50타임스텝에서 8타임스텝으로 증류된 후에도 전체 모델 지연 시간의 평균 73%를 차지한다. 이 문제를 해결하기 위해, 본 논문에서는 고주파수 및 풍부한 문맥 영역에는 더 많은 토큰 예산을 할당하고 저주파수 영역에는 적은 토큰을 배정하는 입력 적응형 토큰 압축 프레임워크인 HiLo-Token을 제안한다. 구체적으로, 사용자 마스크로 지정된 편집 영역 내에서는 확장 마스크 내의 모든 토큰을 유지하여 강력한 지역성과 맥락적 관련성을 보존한다. 편집 영역 외부에서는 공간 주파수 기반의 단순하면서도 효과적인 고주파수 토큰 선택 전략을 도입하여 중요한 지역적 세부 정보를 포착하는 동시에, 16배 다운샘플링된 이미지의 토큰을 사용하여 저주파수 성분을 표현하고 흐릿하지만 전역적인 구조를 보존한다. 프로덕션 수준의 평가 데이터에 대한 광범위한 실험을 통해 제안된 방법의 효과성을 검증했으며, 평균 비율이 각각 6.38%, 15.92%, 35.36%인 소형, 중형, 대형 마스크 비율 범주의 이미지 편집 작업에서 A100-80GB 기준으로 3.13배, 2.59배, 1.67배의 DiT 속도 향상을 달성했으며 생성 품질의 저하도 없었다.

English

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.