HiLo-Token：輸入自適應的高低頻令牌壓縮以實現高效圖像編輯

摘要

創意影像編輯工具（如 Photoshop 的「移除」或「生成填色」按鈕）是日常使用者核心功能，並佔據 Photoshop 與 Lightroom 主要流量。然而，當前的生成式 AI 模型面臨顯著的延遲挑戰，尤其在從以卷積為基礎的 U-Net 過渡到擴散變壓器（DiT）時更加明顯。在我們針對數百個代表性影像編輯樣本（涵蓋多種遮罩比例）的評估中，即使將 DiT 模組從 50 個時間步長蒸餾至 8 個時間步長，該模組仍平均佔總模型延遲的 73%。為解決此挑戰，我們提出 HiLo-Token，一種輸入自適應的令牌壓縮框架：將更多令牌預算分配給高頻、富含語境的區域，同時對低頻區域分配較少令牌。具體而言，針對使用者遮罩指定的編輯區域，我們保留擴張遮罩內的所有令牌，以維持強烈的局部性與上下文相關性。在編輯區域之外，我們引入一種基於空間頻率的簡單有效的高頻令牌選擇策略，以捕捉重要的局部細節，同時使用來自 16 倍降採樣影像的令牌來表示低頻成分，保留模糊但整體的結構。在生產級評估資料上的大量實驗驗證了所提方法的有效性：在 A100-80GB 上，針對小、中、大三種遮罩比例類別的影像編輯任務（平均遮罩比例分別為 6.38%、15.92% 與 35.36%），分別實現了 3.13 倍、2.59 倍與 1.67 倍的 DiT 加速，且生成品質無任何衰退。

English

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.