Sub-Billion, Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction
Large language models (LLMs) achieve strong relation extraction (RE), but their computational demands and reliance on...
Source Evidence
What Changed
Large language models (LLMs) achieve strong relation extraction (RE), but their computational demands and reliance on...
Why It Matters
**Why it Matters** If even a 0.5 B‑parameter model can outperform multi‑thousand‑parameter GPT‑5.4 in relation extraction, privacy‑sensitive deployments can shift from expensive, cloud‑based LLMs to on‑prem, GPU‑only solutions—reducing both cost and data‑exposure risk while preserving state‑of‑the‑art accuracy. This pivot could accelerate adoption in regulated industries and enable fine‑tuned, domain‑specific inference at scale.
Confirmed Facts
Large language models (LLMs) achieve strong relation extraction (RE), but their computational demands and reliance on proprietary APIs limit deployment in resource-constrained or privacy-sensitive settings. We investigate how far small language models (SLMs) can close this gap across general-domain and literary text. We evaluate five models from 360M to 3B parameters under three domain-composition regimes and two prompt-conditioned tuning styles (30 configurations), comparing them with zero-shot frontier LLMs and a discriminative RoBERTa baseline. Across nine benchmarks, the best sub-billion model, Qwen2.5-0.5B fine-tuned on pooled general-domain data, achieves a general-domain positive-class micro-F1 of 0.83, versus 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 evaluated zero-shot. This does not imply that SLMs are intrinsically stronger; rather, targeted task adaptation enables 4-bit models deployable on a single consumer GPU to outperform general-purpose frontier systems under this protocol. An in-domain RoBERTa baseline also exceeds both frontier models, indicating that the gain stems from task adaptation rather than generative decoding. On literary RE, tuned SLMs reach 0.92 on the human-annotated Biographical benchmark versus 0.83 for GPT-5.4, and 0.833 versus 0.578 on the two-benchmark literary average. A targeted domain-adaptive pretraining case study yields no practically meaningful gain over supervised fine-tuning, while the cleanest within-family scale comparison shows only marginal improvement. These results show that, when task-specific data are available, compact task-adapted models can provide accurate, private, and hardware-efficient RE.
Who Is Affected
- Qwen
- Anthropic
- GPT
- AI product teams
What To Watch Next
- Watch for benchmark validation, API availability, pricing, limits, and early customer adoption.
- Watch whether additional sources confirm the same claim.
You will be redirected to arxiv.org.