A review of faithfulness metrics for hallucination assessment in Large Language Models

Malin, B; Kalganova, T; Boulgouris, N

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/30635

Full metadata record

DC Field	Value	Language
dc.contributor.author	Malin, B	-
dc.contributor.author	Kalganova, T	-
dc.contributor.author	Boulgouris, N	-
dc.date.accessioned	2025-02-02T11:37:55Z	-
dc.date.available	2025-02-02T11:37:55Z	-
dc.date.issued	2024-12-31	-
dc.identifier	ORCiD: Ben Malin https://orcid.org/0009-0006-5791-2555	-
dc.identifier	ORCiD: Tatiana Kalganova https://orcid.org/0000-0003-4859-7152	-
dc.identifier	ORCiD: Nikolaos Boulgouris https://orcid.org/0000-0002-5382-6856	-
dc.identifier	arXiv:2501.00269v1 [cs.CL]	-
dc.identifier.citation	Malin, B., and . (2024) 'A review of faithfulness metrics for hallucination assessment in Large Language Models', arXiv preprint, arXiv:2501.00269v1 [cs.CL], pp. 1 - 13. doi: 10.48550/arXiv.2501.00269.	en_US
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/30635	-
dc.description.abstract	This review examines the means with which faithfulness has been evaluated across open-ended summarization, question-answering and machine translation tasks. We find that the use of LLMs as a faithfulness evaluator is commonly the metric that is most highly correlated with human judgement. The means with which other studies have mitigated hallucinations is discussed, with both retrieval augmented generation (RAG) and prompting framework approaches having been linked with superior faithfulness, whilst other recommendations for mitigation are provided. Research into faithfulness is integral to the continued widespread use of LLMs, as unfaithful responses can pose major risks to many areas whereby LLMs would otherwise be suitable. Furthermore, evaluating open-ended generation provides a more comprehensive measure of LLM performance than commonly used multiple-choice benchmarking, which can help in advancing the trust that can be placed within LLMs.	en_US
dc.description.sponsorship	This work has been funded by the European Union.	en_US
dc.format.extent	1 - 13	-
dc.format.medium	Electronic	-
dc.language.iso	en_US	en_US
dc.publisher	Cornell University	en_US
dc.relation.uri	https://arxiv.org/abs/2501.00269v1	-
dc.rights	Attribution 4.0 International	-
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	-
dc.subject	cs.CL	en_US
dc.subject	evaluation	en_US
dc.subject	fact extraction	en_US
dc.subject	faithfulness	en_US
dc.subject	hallucination	en_US
dc.subject	LLM	en_US
dc.subject	machine translation	en_US
dc.subject	question-answering	en_US
dc.subject	RAG	en_US
dc.subject	summarization	en_US
dc.title	A review of faithfulness metrics for hallucination assessment in Large Language Models	en_US
dc.type	Preprint	en_US
dc.identifier.doi	https://doi.org/10.48550/arXiv.2501.00269	-
dc.rights.license	https://creativecommons.org/licenses/by/4.0/legalcode.en	-
dc.rights.holder	The Author(s)	-
Appears in Collections:	Dept of Electronic and Electrical Engineering Research Papers

Files in This Item:

File	Description	Size	Format
Preprint.pdf	Copyright © 2024 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).	307.69 kB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License