Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software

Tucker, A; Wang, Z; Rotalinti, Y; Myles, P

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/21683

Full metadata record

DC Field	Value	Language
dc.contributor.author	Tucker, A	-
dc.contributor.author	Wang, Z	-
dc.contributor.author	Rotalinti, Y	-
dc.contributor.author	Myles, P	-
dc.date.accessioned	2020-10-23T14:26:24Z	-
dc.date.available	2020-10-23T14:26:24Z	-
dc.date.issued	2020-11-09	-
dc.identifier	147	-
dc.identifier.citation	Tucker, A., Wang, Z., Rotalinti, Y. and Myles, P. (2020) 'Generating high-fidelity synthetic patient data for assessing machine learning healthcare software', npj Digital Media, 3, 147, pp. 1-13. doi:10.1038/s41746-020-00353-9.	en_US
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/21683	-
dc.description.abstract	© The Author(s) 2020. There is a growing demand for the uptake of modern Artificial Intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic datasets that capture as many of the complexities of the original dataset (distributions, non-linear relationships and noise) but that does not actually include any real patient data. Whilst previous research has explored models for generating synthetic datasets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables and the resulting sensitivity analysis statistics from machine learning classifiers, whilst quantifying the risks of patient re-identification from synthetic datapoints. We show that through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic datasets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.	en_US
dc.description.sponsorship	Department for Business, Energy and Industrial Strategy, 104676; Innovate UK, Pioneer Fund	en_US
dc.format.extent	1 - 13	-
dc.format.medium	Electronic	-
dc.language.iso	en	en_US
dc.publisher	Springer Nature in partnership with the Scripps Research Translational Institute	en_US
dc.rights	Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit https://creativecommons. org/licenses/by/4.0/.	-
dc.rights.uri	https://creativecommons. org/licenses/by/4.0/	-
dc.subject	synthetic data	en_US
dc.subject	machine learning	en_US
dc.subject	probabilistic graphical models	en_US
dc.subject	latent variables	en_US
dc.subject	outliers	en_US
dc.title	Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software	en_US
dc.type	Article	en_US
dc.identifier.doi	https://doi.org/10.1038/s41746-020-00353-9	-
dc.relation.isPartOf	npj digital medicine	-
pubs.publication-status	Published	-
dc.identifier.eissn	2398-6352	-
Appears in Collections:	Department of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf		2.1 MB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License