ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues

Reproducing research results from papers and released code is central to scientific progress. Existing works have int...

Signal 55

Source Confidence 100%

Claim Status: developing

Source Evidence

Developing

Signal 55

Source Confidence 100%

Primary Source

arXiv (Shanda Li)

arxiv.org

Source Type

research

Source Published

Jun 16, 2026, 17:58 UTC

AIFreshWire Pipeline

Ingested: 6 days ago / Jun 17, 2026, 02:50 UTC

Last checked: 6 days ago / Jun 17, 2026, 02:50 UTC

What Changed

Reproducing research results from papers and released code is central to scientific progress. Existing works have int...

Why It Matters

**Why it matters:** ReproRepo transforms the reproducibility audit into a high‑throughput, human‑informed process that validates AI agents on genuine research barriers, enabling large‑scale quality assurance and accelerating trustworthy ML innovation. By showing that LLMs can flag real‑world blockers at ~90 % recall, it signals a cost‑effective shift toward automated audit pipelines that could become a standard compliance layer for conferences, funding agencies, and publishers—tightening the feedback loop between research output and reproducible practice.

Confirmed Facts

Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.

Who Is Affected

GPT
AI product teams

What To Watch Next

Watch for benchmark validation, API availability, pricing, limits, and early customer adoption.
Look for corroboration from an official source or a second reliable report.
Watch whether additional sources confirm the same claim.

Still Developing

The claim is plausible but still developing.

Read Original Source

You will be redirected to arxiv.org.