HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing
Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer...
Source Evidence
What Changed
Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer...
Why It Matters
This research offers a critical path to significantly faster generative image editing, directly impacting user experience and operational costs for platforms like Photoshop, especially as they transition to more powerful but latency-prone Diffusion Transformers. By strategically compressing tokens, it provides a practical solution to scaling advanced creative AI tools without sacrificing quality, which is crucial for widespread adoption and efficient resource utilization.
Confirmed Facts
Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.
Who Is Affected
- AI product teams
What To Watch Next
- Watch for independent replications, benchmark scrutiny, and whether labs turn this work into shipped systems.
- Watch whether additional sources confirm the same claim.
You will be redirected to huggingface.co.