Assessing the impact of synthetic data generated by Bayesian networks on heart disease prediction

Lazzaro, I; Milano, M; Tucker, A; Cannataro, M

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/33520

Title:	Assessing the impact of synthetic data generated by Bayesian networks on heart disease prediction
Authors:	Lazzaro, I Milano, M Tucker, A Cannataro, M
Keywords:	Bayesian networks;synthetic data generation;heart disease prediction;data quality
Issue Date:	15-Jun-2026
Publisher:	Elsevier
Citation:	Lazzaro, I. et al. (2026) 'Assessing the impact of synthetic data generated by Bayesian networks on heart disease prediction', Journal of Computational Science, 99, 102940, pp. 1–10. doi: 10.1016/j.jocs.2026.102940.
Abstract:	Synthetic data generation using Bayesian networks (BN) offers a promising approach to overcoming data scarcity in clinical prediction tasks, yet its actual impact on model performance remains underexplored. This study investigates the use of Bayesian network-based generative models to produce synthetic patient data and examines how the quality of the original real data influences the effectiveness of such augmentation. Three benchmark datasets from the UCI Heart Disease repository (Cleveland, Hungary, and Switzerland) were employed, all sharing an identical structure comprising 13 clinical predictors. The Cleveland dataset, which is the most complete and consistent among the three, was used exclusively as the training source for learning the Bayesian network structure and parameters under clinically informed constraints. To ensure robust evaluation, the dataset was partitioned into two independent subsets: 153 patients were used to train the Bayesian network, while 150 held-out patients were used exclusively to generate synthetic records. Predictive models were trained under three configurations: real data only, synthetic data only, and a hybrid real + synthetic (filtered) dataset, and evaluated using 10-fold cross-validation and external validation on independent cohorts. Results indicate that integrating real and synthetic data significantly improved accuracy and precision, particularly for the Switzerland cohort (F(2,27)=23.06, </i>η²</i>=0.63)), whereas improvements were smaller and partially non-significant in the noisier Hungarian dataset. These findings demonstrate that the effectiveness of synthetic augmentation depends on the structure and completeness of the source data, underscoring the importance of data quality for reliable generative modelling in clinical prediction.
Description:	Data availability: All datasets used in this work are freely available in the UCI repository.
URI:	https://bura.brunel.ac.uk/handle/2438/33520
DOI:	https://doi.org/10.1016/j.jocs.2026.102940
ISSN:	1877-7503
Other Identifiers:	ORCiD: Ilaria Lazzaro https://orcid.org/0009-0007-1612-2538 ORCiD: Allan Tucker https://orcid.org/0000-0001-5105-3506
Appears in Collections:	Department of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © 2026 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license ( https://creativecommons.org/licenses/by/4.0/ ).	2.31 MB	Adobe PDF	View/Open

Show full item record

This item is licensed under a Creative Commons License