Bayesboost: Identifying and Handling Bias Using Synthetic Data Generators

Draghi, B; Wang, Z; Myles, P; Tucker, A

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/28873

Full metadata record

DC Field	Value	Language
dc.contributor.author	Draghi, B	-
dc.contributor.author	Wang, Z	-
dc.contributor.author	Myles, P	-
dc.contributor.author	Tucker, A	-
dc.coverage.spatial	online	-
dc.date.accessioned	2024-04-26T13:39:12Z	-
dc.date.available	2024-04-26T13:39:12Z	-
dc.date.issued	2021-09-01	-
dc.identifier	ORCiD: Zhenchen Wang https://orcid.org/0000-0003-4710-0298	-
dc.identifier	ORCiD: Allan Tucker https://orcid.org/0000-0001-5105-3506	-
dc.identifier.citation	Draghi, B. et al. (2021) 'Bayesboost: Identifying and Handling Bias Using Synthetic Data Generators', Proceedings of Machine Learning Research, 154, pp. 49-62. Available at: https://proceedings.mlr.press/v154/draghi21a.html (Accessed: 28 March 2023).	en_US
dc.identifier.issn	2640-3498	-
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/28873	-
dc.description	Paper presented at the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications (LIDTA 2021), online, 17 September 2021.	-
dc.description.abstract	Advanced synthetic data generators can model sensitive personal datasets by creating simulated samples of data with realistic correlation structures and distributions, but with a greatly reduced risk of identifying individuals. This has huge potential in medicine where sensitive patient data can be simulated and shared, enabling the development and robust validation of new AI technologies for diagnosis and disease management. However, even when massive ground truth datasets are available (such as UK-NHS databases which contain patient records in the order of millions) there is a high risk that biases still exist which are carried over to the data generators. For example, certain cohorts of patients may be under-represented due to cultural sensitivities amongst some communities, or due to institutionalised procedures in data collection. The under-representation of groups is one of the forms in which bias can manifest itself in machine learning, and it is the one we investigate in this work.These factors may also lead to structurally missing data or incorrect correlations and distributions which will be mirrored in the synthetic data generated from biased ground truth datasets. In this paper, we explore methods to improve synthetic data generators by using probabilistic methods to firstly identify the difficult to predict data samples in ground truth data, and then to boost these types of data when generating synthetic samples. The paper explores attempts to create synthetic data that contain more realistic distributions and that lead to predictive models with better performance.	en_US
dc.description.sponsorship	NHSX.	en_US
dc.language.iso	en	en_US
dc.publisher	ML Research Press	en_US
dc.relation.uri	https://proceedings.mlr.press/v154/draghi21a.html	-
dc.rights	Copyright © 2021 B. Draghi, Z. Wang, P. Myles & A. Tucker. Published by ML Research Press.	-
dc.source	Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications,	-
dc.source	Proceedings of the Third International Workshop on Learning with Imbalanced Domains: Theory and Applications,	-
dc.subject	synthetic data generators	en_US
dc.subject	data bias	en_US
dc.subject	over-sampling	en_US
dc.subject	Bayesian network	en_US
dc.title	Bayesboost: Identifying and Handling Bias Using Synthetic Data Generators	en_US
dc.type	Conference Paper	en_US
dc.relation.isPartOf	PMLR 154:49-62, 2021	-
pubs.publication-status	Published	-
pubs.volume	154	-
dc.rights.holder	B. Draghi, Z. Wang, P. Myles & A. Tucker	-
Appears in Collections:	Department of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © 2021 B. Draghi, Z. Wang, P. Myles & A. Tucker. Published by ML Research Press.	662.4 kB	Adobe PDF	View/Open

Show simple item record