RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured ...

Signal 50

Source Confidence 80%

Claim Status: developing

Source Evidence

Developing

Signal 50

Source Confidence 80%

Primary Source

arXiv (Timing Yang)

arxiv.org

Source Type

research

Source Published

Jun 12, 2026, 17:59 UTC

AIFreshWire Pipeline

Ingested: 8 days ago / Jun 15, 2026, 03:02 UTC

Last checked: 8 days ago / Jun 15, 2026, 03:02 UTC

What Changed

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured ...

Why It Matters

**Why it matters** RATS introduces a minimalist architectural prior that implicitly learns part‑level knowledge without supervision, boosting segmentation performance while providing a reusable, interpretable register dictionary that can transfer across categories—an attractive trade‑off for both research and production systems seeking modular, explainable vision models.

Confirmed Facts

When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.

Who Is Affected

AI product teams

What To Watch Next

Watch for independent replications, benchmark scrutiny, and whether labs turn this work into shipped systems.
Look for corroboration from an official source or a second reliable report.
Watch whether additional sources confirm the same claim.

Still Developing

The claim is plausible but still developing.

Read Original Source

You will be redirected to arxiv.org.