LLM4SCREENLIT: Recommendations on assessing the performance of large language models for screening literature in systematic reviews

Madeyski, L; Kitchenham, B; Shepperd, M

Please use this identifier to cite or link to this item: https://bura.brunel.ac.uk/handle/2438/33549

Title:	LLM4SCREENLIT: Recommendations on assessing the performance of large language models for screening literature in systematic reviews
Authors:	Madeyski, L Kitchenham, B Shepperd, M
Keywords:	Large language models;LLM;classification metrics;class imbalance;systematic reviews;lost evidence;cost-sensitive
Issue Date:	8-Jun-2026
Publisher:	Elsevier
Citation:	Madeyski, L. et al. (2026) ‘LLM4SCREENLIT: Recommendations on assessing the performance of large language models for screening literature in systematic reviews’, Information and Software Technology, 198, 108204, pp. 1–20. doi: 10.1016/j.infsof.2026.108204.
Abstract:	Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT — practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies — differentiated by study type (retrospective benchmarking vs. deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC’s chance-correction with asymmetric misclassification costs, and validated it on three software-engineering (SE) reanalyses (Felizardo et al. 2024; Syriani et al. 2024; Huotala et al. 2025), the largest covering 9 LLMs 24 SE secondary studies (34,528 articles). Results: Across the 29 papers, only 10% reported MCC, only 24% reported full confusion matrices, and none of the five papers claiming workload savings priced false-negative cost. In the largest SE reanalysis, MCC and WMCC disagree on the best LLM in 55% of evaluable studies; in the most striking 9695-article SE study, the Accuracy-best LLM loses 63.3% of relevant evidence (Lost Evidence), the MCC-best 43.9%, but the WMCC-best only 5.8%. Sensitivity analysis (median crossover at , all ) supports as a conservative default. Conclusions: SR-screening evaluations should prioritise Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available. Editors and reviewers should require these elements as routine. Extension to full-text screening and data extraction is principled but pending empirical validation.
Description:	Data availability: The replication package (extracted data, analysis scripts, and documentation) is available at https://doi.org/10.6084/m9.figshare.31356613 . Supplementary data are available online at: https://www.sciencedirect.com/science/article/pii/S095058492600193X?via%3Dihub#appSC .
URI:	https://bura.brunel.ac.uk/handle/2438/33549
DOI:	https://doi.org/10.1016/j.infsof.2026.108204
ISSN:	0950-5849
Other Identifiers:	ORCiD: Lech Madeyski https://orcid.org/0000-0003-3907-3357 ORCiD: Barbara Kitchenham https://orcid.org/0000-0002-6134-8460 ORCiD: Martin Shepperd https://orcid.org/0000-0003-1874-6145
Appears in Collections:	Department of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © 2026 The Author(s). Published by Elsevier B.V. This is an open access article under a Creative Commons license (https://creativecommons.org/licenses/by/4.0/).	2.59 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License