Complementary Attention Head Pruning for Efficient Transformers

The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, w...

Signal 71

Source Confidence 100%

Claim Status: verified

Source Evidence

Verified

Signal 71

Source Confidence 100%

Primary Source

arXiv (Yaniv Livertovsky)

arxiv.org

Source Type

research

Source Published

Jun 17, 2026, 14:56 UTC

AIFreshWire Pipeline

Ingested: about 21 hours ago / Jun 18, 2026, 01:31 UTC

Last checked: about 21 hours ago / Jun 18, 2026, 01:31 UTC

What Changed

The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, w...

Why It Matters

CAHP enables true, automatic compression of large-scale Transformers by selecting a strategically diverse set of attention heads via graph theory, eliminating the need for manual pruning ratios and mitigating proximity bias that plagues gradient‑based methods. This means production‑ready models can achieve higher compression ratios without performance collapse—critical for deploying cutting‑edge NLP at scale in mobile and edge environments while preserving competitive accuracy.

Confirmed Facts

The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments. While structured pruning offers a pathway to compression, existing state-of-the-art methods often rely on gradient-based importance ranking or stochastic gating, which suffer from instability, structural degeneration, and the need for extensive manual hyperparameter tuning. In this paper, we introduce CAHP (Complementary Attention Head Pruning), a novel post-hoc framework that redefines head selection as a global graph-theoretical problem. Rather than evaluating heads in isolation, CAHP utilizes graph-based clustering combined with information-theoretic distance measures to identify and preserve a topologically diverse subset of complementary attention heads. Without requiring a predefined sparsity level or pruning ratio, the framework automatically determines the number of selected attention heads across layers by identifying a diminishing marginal performance curve, where pruning additional heads leads to a sharp degradation in performance, as determined by the chosen polynomial degree. Extensive evaluations on the SST-5 and MNLI benchmarks, across different Transformer model scales, demonstrate that CAHP consistently outperforms competitive baselines, particularly in high-compression regimes. Furthermore, our structural analysis shows that CAHP avoids the "proximity bias" of gradient-based pruning methods, which tend to preserve heads mainly in layers close to the output, and instead retains a functionally critical set of attention heads in the model's intermediate layers.

Who Is Affected

AI product teams

What To Watch Next

Watch for independent replications, benchmark scrutiny, and whether labs turn this work into shipped systems.
Watch whether additional sources confirm the same claim.

Read Original Source

You will be redirected to arxiv.org.