Complementary Attention Head Pruning for Efficient Transformers
The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, w...
Source Evidence
What Changed
The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, w...
Why It Matters
CAHP enables true, automatic compression of large-scale Transformers by selecting a strategically diverse set of attention heads via graph theory, eliminating the need for manual pruning ratios and mitigating proximity bias that plagues gradient‑based methods. This means production‑ready models can achieve higher compression ratios without performance collapse—critical for deploying cutting‑edge NLP at scale in mobile and edge environments while preserving competitive accuracy.
Confirmed Facts
The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments. While structured pruning offers a pathway to compression, existing state-of-the-art methods often rely on gradient-based importance ranking or stochastic gating, which suffer from instability, structural degeneration, and the need for extensive manual hyperparameter tuning. In this paper, we introduce CAHP (Complementary Attention Head Pruning), a novel post-hoc framework that redefines head selection as a global graph-theoretical problem. Rather than evaluating heads in isolation, CAHP utilizes graph-based clustering combined with information-theoretic distance measures to identify and preserve a topologically diverse subset of complementary attention heads. Without requiring a predefined sparsity level or pruning ratio, the framework automatically determines the number of selected attention heads across layers by identifying a diminishing marginal performance curve, where pruning additional heads leads to a sharp degradation in performance, as determined by the chosen polynomial degree. Extensive evaluations on the SST-5 and MNLI benchmarks, across different Transformer model scales, demonstrate that CAHP consistently outperforms competitive baselines, particularly in high-compression regimes. Furthermore, our structural analysis shows that CAHP avoids the "proximity bias" of gradient-based pruning methods, which tend to preserve heads mainly in layers close to the output, and instead retains a functionally critical set of attention heads in the model's intermediate layers.
Who Is Affected
- AI product teams
What To Watch Next
- Watch for independent replications, benchmark scrutiny, and whether labs turn this work into shipped systems.
- Watch whether additional sources confirm the same claim.
You will be redirected to arxiv.org.