HiLo-Token: 入力適応型高低周波トークン圧縮による効率的な画像編集

要旨

Photoshopの「Remove」や「Generative Fill」といったボタンを備えたクリエイティブな画像編集ツールは、日常的な顧客利用の中核をなしており、PhotoshopやLightroomにおけるトラフィックの大部分を占めています。しかし、現在の生成AIモデルは大きなレイテンシ問題に直面しており、特に畳み込みベースのU-Netから拡散トランスフォーマー（DiT）への移行に伴い、その問題はさらに顕著になっています。多様なマスク比率にわたる数百の代表的な画像編集サンプルを用いた評価では、DiTモジュールが50タイムステップから8タイムステップに蒸留された後でも、モデル全体のレイテンシの平均73%を占めることが判明しました。この課題に取り組むため、我々はHiLo-Tokenを提案します。これは、高周波でリッチなコンテキスト領域により多くのトークン予算を割り当て、低周波領域には少ないトークンを割り当てる、入力適応型のトークン圧縮フレームワークです。具体的には、ユーザーマスクで指定された編集領域に対しては、膨張マスク内のすべてのトークンを保持し、強い局所性とコンテキストの関連性を維持します。編集領域外では、空間周波数に基づくシンプルかつ効果的な高周波トークン選択戦略を導入して重要な局所的詳細を捉える一方、16倍ダウンサンプリングされた画像のトークンを用いて低周波成分を表現し、ぼやけた全体構造を保持します。プロダクションレベルの評価データを用いた広範な実験により、提案手法の有効性が検証されました。画像編集タスクにおいて、小、中、大のマスク比率カテゴリ（平均比率6.38%、15.92%、35.36%）に対して、A100-80GB上でそれぞれ3.13倍、2.59倍、1.67倍のDiT高速化を達成し、生成品質の低下は全く見られませんでした。

English

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.