RepSelect: Robust LLM Unlearning via Representation Selectivity
Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilit...
Source Evidence
What Changed
Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilit...
Why It Matters
RepSelect achieves genuinely deep forgetting by isolating only the representations that encode the forget‑set, leaving the rest of the model’s knowledge untouched; this lets deployable LLMs scrub sensitive or harmful data without retraining or performance loss. The technique’s resilience to fine‑tuning and prompting attacks gives operators a practical path to enforce compliance, auditability, and safe‑harbor release, positioning companies that adopt it as early movers in the forthcoming regulation‑driven market.
Confirmed Facts
Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.
Who Is Affected
- DeepSeek
- Meta AI
- Qwen
- AI product teams
What To Watch Next
- Watch for benchmark validation, API availability, pricing, limits, and early customer adoption.
- Watch whether additional sources confirm the same claim.
You will be redirected to huggingface.co.