Please use this identifier to cite or link to this item:
http://bura.brunel.ac.uk/handle/2438/33549Full metadata record
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Madeyski, L | - |
| dc.contributor.author | Kitchenham, B | - |
| dc.contributor.author | Shepperd, M | - |
| dc.date.accessioned | 2026-07-02T08:47:00Z | - |
| dc.date.available | 2026-10 | - |
| dc.date.available | 2026-07-02T08:47:00Z | - |
| dc.date.issued | 2026-06-08 | - |
| dc.identifier | 108204 | - |
| dc.identifier | 108204 | - |
| dc.identifier.citation | Madeyski, L. et al. (2026) ‘LLM4SCREENLIT: Recommendations on assessing the performance of large language models for screening literature in systematic reviews’, Information and Software Technology, 198, p. 108204. https://doi.org/10.1016/j.infsof.2026.108204 | en_US |
| dc.identifier.issn | 108204 | - |
| dc.identifier.issn | 108204 | - |
| dc.identifier.issn | 0950-5849 | - |
| dc.identifier.uri | http://bura.brunel.ac.uk/handle/2438/33549 | - |
| dc.description.abstract | Context: Large language models (LLMs) are increasingly used to screen literature for systematic reviews (SRs), but the standard confusion-matrix metrics used to evaluate them can mislead under the imbalanced, cost-asymmetric conditions of screening. Objective: We develop and justify LLM4SCREENLIT — practical recommendations for researchers conducting LLM-screening evaluations and for editors and reviewers assessing such studies — differentiated by study type (retrospective benchmarking vs. deployment for a specific SR). Method: Using Delgado-Chaves et al. (2025), an 18-LLM benchmark across three biomedical SRs, as a motivating example, we reviewed 28 additional papers and extracted their reported metrics. We propose a Weighted Matthews Correlation Coefficient (WMCC) that integrates MCC’s chance-correction with asymmetric misclassification costs, and validated it on three software-engineering (SE) reanalyses (Felizardo et al. 2024; Syriani et al. 2024; Huotala et al. 2025), the largest covering 9 LLMs 24 SE secondary studies (34,528 articles). Results: Across the 29 papers, only 10% reported MCC, only 24% reported full confusion matrices, and none of the five papers claiming workload savings priced false-negative cost. In the largest SE reanalysis, MCC and WMCC disagree on the best LLM in 55% of evaluable studies; in the most striking 9695-article SE study, the Accuracy-best LLM loses 63.3% of relevant evidence (Lost Evidence), the MCC-best 43.9%, but the WMCC-best only 5.8%. Sensitivity analysis (median crossover at , all ) supports as a conservative default. Conclusions: SR-screening evaluations should prioritise Lost Evidence and use cost-sensitive WMCC alongside MCC for ranking. Reporting must include the full confusion matrix and treat unclassifiable outputs as positives requiring human review. Designs should be leakage-aware, with non-LLM baselines when the study aims to inform SR practice and labels are available. Editors and reviewers should require these elements as routine. Extension to full-text screening and data extraction is principled but pending empirical validation. | en_US |
| dc.format.extent | 108204 - 108204 | - |
| dc.language | en | - |
| dc.language.iso | en | en_US |
| dc.publisher | Elsevier BV | en_US |
| dc.subject | Large language models | en_US |
| dc.subject | LLM | en_US |
| dc.subject | Classification metrics | en_US |
| dc.subject | Class imbalance | en_US |
| dc.subject | Systematic reviews | en_US |
| dc.subject | Lost evidence | en_US |
| dc.subject | Cost-sensitive | en_US |
| dc.title | LLM4SCREENLIT: Recommendations on assessing the performance of large language models for screening literature in systematic reviews | en_US |
| dc.identifier.doi | http://dx.doi.org/10.1016/j.infsof.2026.108204 | - |
| dc.relation.isPartOf | Information and Software Technology | - |
| pubs.publication-status | Accepted | - |
| pubs.volume | 198 | - |
| Appears in Collections: | Department of Computer Science Research Papers | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| LLM4SCREENLIT_IST_R2_20260413.pdf | 1.05 MB | Adobe PDF | View/Open |
Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.