ViT-Up: Faithful Feature Upsampling for Vision Transformers
Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptio...
Source Evidence
What Changed
Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptio...
Why It Matters
ViT-Up cuts the dense‑prediction bottleneck by eliminating external image guidance, enabling sub‑pixel, continuous coordinate inference directly from ViT hidden states; this removes feature leakage and scales linearly with backbone size, giving sizable gains on benchmarks and paving the way for high‑resolution transformer‑based vision in real‑time applications.
Confirmed Facts
Vision Transformers (ViTs) have become a dominant architecture for visual representation learning, providing exceptionally strong and broadly reusable backbone features. However, ViTs are commonly operated on relatively small patch-token grids due to the quadratic cost of global self-attention, which creates a persistent bottleneck for dense prediction tasks such as semantic segmentation and depth estimation. This has motivated the development of task-agnostic feature upsamplers. While recent state-of-the-art methods produce visually sharp dense representations, their reliance on shallow image encoders for guided upsampling can introduce feature leakage, fragmentation, and blur. We introduce ViT-Up, an implicit feature upsampling framework that replaces external image guidance with layer-wise query construction from intermediate ViT hidden states. This enables feature prediction at arbitrary continuous image coordinates while preserving alignment with the backbone feature space. Experiments demonstrate that ViT-Up consistently outperforms state-of-the-art image-guided upsamplers across dense prediction and semantic correspondence. On DINOv3-S+, ViT-Up improves over prior methods by up to +2.07 mIoU on Cityscapes and +4.17 PCK@0.10 on SPair-71k. With the larger DINOv3-B backbone, these gains increase to +3.36 mIoU and +8.09 PCK@0.10, demonstrating that ViT-Up scales favorably with backbone capacity.
Who Is Affected
- Hugging Face
- AI product teams
What To Watch Next
- Watch for independent replications, benchmark scrutiny, and whether labs turn this work into shipped systems.
- Look for corroboration from an official source or a second reliable report.
- Watch whether additional sources confirm the same claim.
Still Developing
- The claim is plausible but still developing.
You will be redirected to huggingface.co.