ReproRepo: Scaling Reproducibility Audits with GitHub Repository Issues
Reproducing research results from papers and released code is central to scientific progress. Existing works have int...
Source Evidence
What Changed
Reproducing research results from papers and released code is central to scientific progress. Existing works have int...
Why It Matters
**Why it matters:** ReproRepo transforms the reproducibility audit into a high‑throughput, human‑informed process that validates AI agents on genuine research barriers, enabling large‑scale quality assurance and accelerating trustworthy ML innovation. By showing that LLMs can flag real‑world blockers at ~90 % recall, it signals a cost‑effective shift toward automated audit pipelines that could become a standard compliance layer for conferences, funding agencies, and publishers—tightening the feedback loop between research output and reproducible practice.
Confirmed Facts
Reproducing research results from papers and released code is central to scientific progress. Existing works have introduced benchmarks to evaluate whether LLM agents can assist with reproducibility, but they are difficult to scale due to their reliance on substantial manual effort for data curation and evaluation. We introduce ReproRepo, a scalable framework for reproducibility evaluation that leverages human-raised GitHub issues as naturally occurring supervision on realistic reproduction blockers. We instantiate ReproRepo on 1,149 recent machine learning papers from major conferences and evaluate four frontier model-agent configurations. Our results show that LLM agents, even without executing code, can identify many real-world reproducibility problems from paper-repository pairs: the best agent in our study, namely Codex with GPT-5.5, surfaces at least one semantically related human-reported blocker for ~90% of papers in the study. Further analysis shows that agents are particularly effective for surfacing visible failures and identifying the right semantic region, but may still be insufficient in exact localization. ReproRepo can serve as a reusable, scalable framework for future evaluations of LLM agents on real-world reproducibility auditing. Our code is released at https://github.com/LithiumDA/ReproRepo.
Who Is Affected
- GPT
- AI product teams
What To Watch Next
- Watch for benchmark validation, API availability, pricing, limits, and early customer adoption.
- Look for corroboration from an official source or a second reliable report.
- Watch whether additional sources confirm the same claim.
Still Developing
- The claim is plausible but still developing.
You will be redirected to arxiv.org.