Please use this identifier to cite or link to this item:
http://bura.brunel.ac.uk/handle/2438/32454| Title: | Probabilistic vs Deep Generative Models: A Fairness Centred Evaluation of Synthetic Healthcare Tabular Data |
| Authors: | Alattal, D Draghi, B Myles, P Branson, R Tucker, A |
| Keywords: | synthetic data generation;tabular data;fairness in machine learning;healthcare data;generative models;data fidelity;bias mitigation;Bayes boost;GAN;VAE |
| Issue Date: | 26-Feb-2026 |
| Publisher: | Springer |
| Citation: | Alattal, D. et al. (2025) 'Probabilistic vs Deep Generative Models: A Fairness Centred Evaluation of Synthetic Healthcare Tabular Data', International Journal of Computational Intelligence Systems, 0 (in press, pre-proof), pp. 1–49. doi: 10.1007/s44196-026-01173-7. |
| Abstract: | Synthetic data offers a promising avenue for addressing privacy, scarcity, and fairness challenges in healthcare datasets. However, there is limited evaluation of how different generation methods balance fidelity, utility, and fairness, particularly for underrepresented subgroups. This study addresses this gap by comparing representative generative modelling techniques, both probabilistic and deep approaches, that are popular in the research literature. We empirically evaluate BayesBoost, CTGAN, TVAE, CopulaGAN, and DECAF on two healthcare datasets containing numerical, binary, and categorical features. Each model’s performance is assessed along three axes: data fidelity, machine learning utility, and fairness, using Accuracy Parity, Equalised Odds, and Predictive Rate Parity. Results show that BayesBoost consistently achieved superior fidelity, utility, and fairness preservation, particularly when paired with Random Forest classifiers, achieving around 60–63% higher downstream utility than GAN-based deep generative baselines (e.g., Random Forest accuracy up to 0.88 with BayesBoost versus 0.54 to − 0.55 for GAN-based methods). Deep generative models, while effective in capturing complex structures, often degraded fairness, especially for underrepresented groups, with equalised odds deviating by over 100% from the ideal parity value of 1.0 in some settings. The Variational Autoencoder outperformed other deep generative models in fairness preservation, especially for equalised odds, although with some reduction in fidelity and utility. Overall, these findings suggest that synthetic data generation for healthcare must move beyond fidelity evaluations to explicitly assess fairness and subgroup impacts, with probabilistic models such as BayesBoost showing strong potential for ethical deployment, while deep generative models require further adaptation for fairness-sensitive applications. |
| Description: | Data Availability:
CPRD cardiovascular disease synthetic dataset used in this paper can be requested from CPRD (https://cprd.com/cprd-cardiovascular-disease-synthetic-dataset). The diabetes dataset is publicly available on Kaggle (https://www.kaggle.com/datasets/rabieelkharoua/diabetes-health-datasetanalysis). Code availability: Not applicable. Springer is providing an unedited version of this manuscript to give early access to its findings. Before final publication, the manuscript will undergo further editing. Please note there may be errors present which affect the content, and all legal disclaimers apply. A preprint version of the article is available on Research Square at https://doi.org/10.21203/rs.3.rs-7565139/v1 . It has not been certified by peer review. |
| URI: | https://bura.brunel.ac.uk/handle/2438/32454 |
| DOI: | https://doi.org/10.1007/s44196-026-01173-7 |
| ISSN: | 1875-6891 |
| Other Identifiers: | ORCiD: Allan Tucker https://orcid.org/0000-0001-5105-3506 |
| Appears in Collections: | Department of Computer Science Research Papers |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| FullText.pdf | Copyright © The Author(s) 2026. Rights and permissions: Open Access. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/. | 1.6 MB | Adobe PDF | View/Open |
Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.