Sub-Billion, Super-Frontier: Small Language Models Rival Zero-Shot Frontier LLMs on General and Literary Relation Extraction

Large language models (LLMs) achieve strong relation extraction (RE), but their computational demands and reliance on...

Signal 70

Source Confidence 100%

Claim Status: verified

Source Evidence

Verified

Signal 70

Source Confidence 100%

Primary Source

arXiv (Despina Christou)

arxiv.org

Source Type

research

Source Published

Jun 21, 2026, 17:24 UTC

AIFreshWire Pipeline

Ingested: 1 day ago / Jun 23, 2026, 03:01 UTC

Last checked: 1 day ago / Jun 23, 2026, 03:01 UTC

What Changed

Large language models (LLMs) achieve strong relation extraction (RE), but their computational demands and reliance on...

Why It Matters

**Why it Matters** If even a 0.5 B‑parameter model can outperform multi‑thousand‑parameter GPT‑5.4 in relation extraction, privacy‑sensitive deployments can shift from expensive, cloud‑based LLMs to on‑prem, GPU‑only solutions—reducing both cost and data‑exposure risk while preserving state‑of‑the‑art accuracy. This pivot could accelerate adoption in regulated industries and enable fine‑tuned, domain‑specific inference at scale.

Confirmed Facts

Large language models (LLMs) achieve strong relation extraction (RE), but their computational demands and reliance on proprietary APIs limit deployment in resource-constrained or privacy-sensitive settings. We investigate how far small language models (SLMs) can close this gap across general-domain and literary text. We evaluate five models from 360M to 3B parameters under three domain-composition regimes and two prompt-conditioned tuning styles (30 configurations), comparing them with zero-shot frontier LLMs and a discriminative RoBERTa baseline. Across nine benchmarks, the best sub-billion model, Qwen2.5-0.5B fine-tuned on pooled general-domain data, achieves a general-domain positive-class micro-F1 of 0.83, versus 0.69 for GPT-5.4 and 0.66 for Claude Sonnet 4.6 evaluated zero-shot. This does not imply that SLMs are intrinsically stronger; rather, targeted task adaptation enables 4-bit models deployable on a single consumer GPU to outperform general-purpose frontier systems under this protocol. An in-domain RoBERTa baseline also exceeds both frontier models, indicating that the gain stems from task adaptation rather than generative decoding. On literary RE, tuned SLMs reach 0.92 on the human-annotated Biographical benchmark versus 0.83 for GPT-5.4, and 0.833 versus 0.578 on the two-benchmark literary average. A targeted domain-adaptive pretraining case study yields no practically meaningful gain over supervised fine-tuning, while the cleanest within-family scale comparison shows only marginal improvement. These results show that, when task-specific data are available, compact task-adapted models can provide accurate, private, and hardware-efficient RE.

Who Is Affected

Qwen
Anthropic
GPT
AI product teams

What To Watch Next

Watch for benchmark validation, API availability, pricing, limits, and early customer adoption.
Watch whether additional sources confirm the same claim.

Read Original Source

You will be redirected to arxiv.org.