HiLo-Token: Input-Adaptive High-Low Frequency Token Compression for Efficient Image Editing

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer...

Signal 44

Source Confidence 80%

Claim Status: verified

Source Evidence

Verified

Signal 44

Source Confidence 80%

Primary Source

Hugging Face (Haoran You)

huggingface.co

Source Type

research

Source Published

Jun 10, 2026, 20:00 UTC

AIFreshWire Pipeline

Ingested: 4 days ago / Jun 18, 2026, 18:47 UTC

Last checked: 4 days ago / Jun 18, 2026, 18:47 UTC

What Changed

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer...

Why It Matters

This research offers a critical path to significantly faster generative image editing, directly impacting user experience and operational costs for platforms like Photoshop, especially as they transition to more powerful but latency-prone Diffusion Transformers. By strategically compressing tokens, it provides a practical solution to scaling advanced creative AI tools without sacrificing quality, which is crucial for widespread adoption and efficient resource utilization.

Confirmed Facts

Creative image editing tools, such as Photoshop's Remove or Generative Fill buttons, are central to everyday customer use and account for a major share of traffic in Photoshop and Lightroom. However, current generative AI models face significant latency challenges, which become even more pronounced when transitioning from convolution-based U-Nets to Diffusion Transformers (DiTs). In our evaluation on hundreds of representative image editing samples spanning a wide range of mask ratios, the DiT module alone accounts for an average of 73% of the total model latency, even after being distilled from 50 timesteps down to 8 timesteps. To tackle this challenge, we propose HiLo-Token, an input-adaptive token compression framework that allocates more token budget to high-frequency, rich-context regions while assigning fewer tokens to low-frequency areas. Specifically, for the editing region specified by the user mask, we retain all tokens within a dilated mask to preserve strong locality and contextual relevance. Outside the editing region, we introduce a simple yet effective high-frequency token selection strategy based on spatial frequency to capture important local details, while using tokens from a 16x downsampled image to represent low-frequency components and preserve the blurry but global structure. Extensive experiments on production-level evaluation data validate the effectiveness of the proposed method, achieving 3.13x, 2.59x, and 1.67x DiT speedups on A100-80GB for image editing tasks across small, medium, and large mask ratio categories with average ratios of 6.38%, 15.92%, and 35.36%, respectively, without any regression in generation quality.

Who Is Affected

AI product teams

What To Watch Next

Watch for independent replications, benchmark scrutiny, and whether labs turn this work into shipped systems.
Watch whether additional sources confirm the same claim.

Read Original Source

You will be redirected to huggingface.co.