Text Indexing for Long Patterns using Locally Consistent Anchors

Pissis, S; Ayad, LAK; Loukides, G

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/30312

Full metadata record

DC Field	Value	Language
dc.contributor.author	Pissis, S	-
dc.contributor.author	Ayad, LAK	-
dc.contributor.author	Loukides, G	-
dc.date.accessioned	2024-12-04T13:57:37Z	-
dc.date.available	2024-12-04T13:57:37Z	-
dc.date.issued	2024-07-16	-
dc.identifier	ORCiD: Lorraine A.K. Ayad https://orcid.org/0000-0003-0846-2616	-
dc.identifier	arXiv:2407.11819v1 [cs.DS]	-
dc.identifier.citation	Pissis, S., Ayad, L.A.K. and Loukides, G. (2024) 'Text Indexing for Long Patterns using Locally Consistent Anchors', arXiv:2407.11819v1 [cs.DS], pp. 1 - 30. doi: https://doi.org/10.48550/arXiv.2407.11819.	en_US
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/30312	-
dc.description	The article is available as a preprint, arXiv:2407.11819v1 [cs.DS], available online at https://arxiv.org/abs/2407.11819 , It has not been certified by peer review. Comments: Extended version of a PVLDB 2023 paper. Abstract abridged to satisfy arXiv requirements.	en_US
dc.description.abstract	In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with sampling based on locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound ℓ on the length of the queried patterns -- which is arguably a quite reasonable assumption in practical applications. Our index offers average-case guarantees. In our experiments using real benchmark datasets, we show that it compares favorably based on the four measures to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index. Notably, we also present a counterpart of our index with worst-case guarantees based on the lc-anchors notion of partitioning sets. To the best of our knowledge, this is the first index achieving the best of all worlds in the regime where we have at hand a lower bound ℓ on the length of the queried patterns.	en_US
dc.description.sponsorship	...	en_US
dc.format.extent	1 - 30	-
dc.format.medium	Electronic	-
dc.publisher	Cornell University	en_US
dc.rights	Attribution 4.0 International	-
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/	-
dc.subject	data structures and algorithms (cs.DS)	en_US
dc.subject	databases (cs.DB)	en_US
dc.title	Text Indexing for Long Patterns using Locally Consistent Anchors	en_US
dc.type	Other	en_US
dc.identifier.doi	https://doi.org/10.48550/arXiv.2407.11819	-
pubs.confidential	false	-
pubs.confidential	false	-
pubs.place-of-publication	arXiv	-
dc.rights.license	https://creativecommons.org/licenses/by/4.0/legalcode.en	-
dc.rights.holder	The Author(s)	-
Appears in Collections:	Department of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
Preprintv1.pdf	Copyright © 2024 The Author(s). This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).	3.96 MB	Adobe PDF	View/Open

Show simple item record

This item is licensed under a Creative Commons License