Please use this identifier to cite or link to this item:
Title: Generating High-Fidelity Synthetic Patient Data for Assessing Machine Learning Healthcare Software
Authors: Tucker, A
Wang, Z
Rotalinti, Y
Myles, P
Keywords: synthetic data;machine learning;probabilistic graphical models;latent variables;outliers
Issue Date: 2020
Publisher: Springer Nature in partnership with the Scripps Research Translational Institute
Citation: npj digital medicine
Abstract: There is a growing demand for the uptake of modern Artificial Intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic datasets that capture as many of the complexities of the original dataset (distributions, non-linear relationships and noise) but that does not actually include any real patient data. Whilst previous research has explored models for generating synthetic datasets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables and the resulting sensitivity analysis statistics from machine learning classifiers, whilst quantifying the risks of patient re-identification from synthetic datapoints. We show that through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic datasets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.
ISSN: 2398-6352
Appears in Collections:Dept of Computer Science Research Papers

Files in This Item:
File Description SizeFormat 
FullText.pdf432.93 kBAdobe PDFView/Open

Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.