ChatPaper.aiChatPaper

HiLo-Token:輸入自適應的高低頻令牌壓縮以實現高效圖像編輯

HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

June 11, 2026
作者: Haoran You, Yotam Nitzan, Lingzhi Zhang, Yifan Gong, Mang-Tik Chiu, Connelly Barnes, Yan Kang, Yuqian Zhou, Eli Shechtman, Sohrab Amirghodsi
cs.AI

摘要

創意影像編輯工具(如 Photoshop 的「移除」或「生成填色」按鈕)是日常使用者核心功能,並佔據 Photoshop 與 Lightroom 主要流量。然而,當前的生成式 AI 模型面臨顯著的延遲挑戰,尤其在從以卷積為基礎的 U-Net 過渡到擴散變壓器(DiT)時更加明顯。在我們針對數百個代表性影像編輯樣本(涵蓋多種遮罩比例)的評估中,即使將 DiT 模組從 50 個時間步長蒸餾至 8 個時間步長,該模組仍平均佔總模型延遲的 73%。為解決此挑戰,我們提出 HiLo-Token,一種輸入自適應的令牌壓縮框架:將更多令牌預算分配給高頻、富含語境的區域,同時對低頻區域分配較少令牌。具體而言,針對使用者遮罩指定的編輯區域,我們保留擴張遮罩內的所有令牌,以維持強烈的局部性與上下文相關性。在編輯區域之外,我們引入一種基於空間頻率的簡單有效的高頻令牌選擇策略,以捕捉重要的局部細節,同時使用來自 16 倍降採樣影像的令牌來表示低頻成分,保留模糊但整體的結構。在生產級評估資料上的大量實驗驗證了所提方法的有效性:在 A100-80GB 上,針對小、中、大三種遮罩比例類別的影像編輯任務(平均遮罩比例分別為 6.38%、15.92% 與 35.36%),分別實現了 3.13 倍、2.59 倍與 1.67 倍的 DiT 加速,且生成品質無任何衰退。
English
Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.