Identifying and handling data bias within primary healthcaredata using synthetic data generators

Draghi, B; Wang, Z; Myles, P; Tucker, A

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/27695

Title:	Identifying and handling data bias within primary healthcaredata using synthetic data generators
Authors:	Draghi, B Wang, Z Myles, P Tucker, A
Keywords:	synthetic data generators;data bias;over-sampling;Bayesian networks;machine learning
Issue Date:	10-Jan-2024
Publisher:	Elsevier
Citation:	Draghi, B. et al. (2024) 'Identifying and handling data bias within primary healthcaredata using synthetic data generators', Heliyon, 10 (2), e24164, pp. 1 - 15. doi: 10.1016/j.heliyon.2024.e24164.
Abstract:	Copyright © 2024 The Authors. Advanced synthetic data generators can simulate data samples that closely resemble sensitive personal datasets while significantly reducing the risk of individual identification. The use of these advanced generators holds enormous potential in the medical field, as it allows for the simulation and sharing of sensitive patient data. This enables the development and rigorous validation of novel AI technologies for accurate diagnosis and efficient disease management. Despite the availability of massive ground truth datasets (such as UK-NHS databases that contain millions of patient records), the risk of biases being carried over to data generators still exists. These biases may arise from the under-representation of specific patient cohorts due to cultural sensitivities within certain communities or standardised data collection procedures. Machine learning models can exhibit bias in various forms, including the under-representation of certain groups in the data. This can lead to missing data and inaccurate correlations and distributions, which may also be reflected in synthetic data. Our paper aims to improve synthetic data generators by introducing probabilistic approaches to first detect difficult-to-predict data samples in ground truth data and then boost them when applying the generator. In addition, we explore strategies to generate synthetic data that can reduce bias and, at the same time, improve the performance of predictive models.
Description:	Data availability: The anonymised electronic healthcare record data used in this research is not publicly available but can be requested from CPRD subject to a data licence and research data governance (RDG) approval. The generated synthetic data set discussed in this paper can also be requested from CPRD subject to a data sharing agreement (DSA). Data access licence fees apply (https://cprd.com/data). Code availability: All our R code is available via GitHub (https://github.com/barbaraDraghi/BayesBoost). The R package bnlearn (v4.8.1) is used for all Bayesian network inference. Appendix A. Additional results are available online at: https://www.sciencedirect.com/science/article/pii/S2405844024001956#se0130 .
URI:	https://bura.brunel.ac.uk/handle/2438/27695
DOI:	https://doi.org/10.1016/j.heliyon.2024.e24164
Other Identifiers:	ORCID iD: Allan Tucker https://orcid.org/0000-0001-5105-3506 e24164
Appears in Collections:	Dept of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (https://creativecommons.org/licenses/by/4.0/).	2.14 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License