Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/29710
Title: MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain
Authors: Marshan, A
Almutairi, AN
Ioannou, A
Bell, D
Monaghan, A
Arzoky, M
Keywords: text-to-SQL conversion;large language model;transformers;T5 model;NLP;MIMICSQL dataset;healthcare domain
Issue Date: 26-Jun-2024
Publisher: Frontiers Media
Citation: Marshan, A. et al. (2024) 'MedT5SQL: a transformers-based large language model for text-to-SQL conversion in the healthcare domain', Frontiers in Big Data, 7, 1371680, pp. 1 - 20. doi: 10.3389/fdata.2024.1371680.
Abstract: Introduction: In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs. Methods: To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL. Results: For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching. Discussion: Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.
Description: Data availability statement: Publicly available datasets were analyzed in this study. This data can be found at: https://github.com/wangpinggl/TREQS/tree/master/mimicsql_data/mimicsql_natural_v2; https://huggingface.co/datasets/wikisql .
Supplementary material: The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fdata.2024.1371680/full#supplementary-material .
URI: https://bura.brunel.ac.uk/handle/2438/29710
DOI: https://doi.org/10.3389/fdata.2024.1371680
Other Identifiers: ORCiD: Alaa Marshan https://orcid.org/0000-0001-6764-9160
ORCiD: David Bell https://orcid.org/0000-0003-3148-6691
ORCiD: Mahir Arzoky https://orcid.org/0000-0002-2721-643X
1371680
Appears in Collections:Dept of Computer Science Research Papers

Files in This Item:
File Description SizeFormat 
FullText.pdfCopyright © 2024 Marshan, Almutairi, Ioannou, Bell, Monaghan and Arzoky. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.1.18 MBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons