On the Reliability of Watermarks for Large Language Models

Kirchenbauer, J; Geiping, J; Wen, Y; Shu, M; Saifullah, K; Kong, K; Fernando, K; Saha, A; Goldblum, M; Goldstein, T

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/29014

Title:	On the Reliability of Watermarks for Large Language Models
Authors:	Kirchenbauer, J Geiping, J Wen, Y Shu, M Saifullah, K Kong, K Fernando, K Saha, A Goldblum, M Goldstein, T
Keywords:	machine learning cs.LG;computation and language cs.CL;cryptography and security cs.CR
Issue Date:	7-May-2024
Publisher:	ICLR
Citation:	Kirchenbauer, J. et al. (2024) 'On the Reliability of Watermarks for Large Language Models', Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7- 11 May, pp. 1 - 9. doi: 10.48550/arXiv.2306.04634 [Available at: https://arxiv.org/abs/2306.04634v4 (Accessed: 15 May 2024)].
Abstract:	As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
Description:	This is the accepted version of the conference paper archived online at arXiv:2306.04634v4 [cs.LG], https://arxiv.org/abs/2306.04634v4. Comments: 9 pages in the main body. Published at ICLR 2024 (https://iclr.cc/virtual/2024/poster/19147). Code is available at https://github.com/jwkirchenbauer/lm-watermarking
URI:	https://bura.brunel.ac.uk/handle/2438/29014
DOI:	https://doi.org/10.48550/arXiv.2306.04634
Other Identifiers:	ORCiD: Kasun Fernando https://orcid.org/0000-0003-1489-9566 arXiv:2306.04634v4 [cs.LG]
Appears in Collections:	Dept of Mathematics Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © 2024 The Authors. the submitter granted the following license to arXiv.org on submission of an article: I grant arXiv.org a perpetual, non-exclusive license to distribute this article. I certify that I have the right to grant this license. I understand that submissions cannot be completely removed once accepted. I understand that arXiv.org reserves the right to reclassify or reject any submission.	3.04 MB	Adobe PDF	View/Open

Show full item record