Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/26881
Title: Text Indexing for Long Patterns: Anchors are All you Need
Authors: Ayad, LAK
Pissis, SP
Loukides, G
Issue Date: 10-Jul-2023
Publisher: Association for Computing Machinery
Citation: Ayad, L.A.K., Loukides, G. and Pissis, S.P. (2023) ‘Text Indexing for Long Patterns: Anchors are All you Need’, Proceedings of the VLDB Endowment, 16 (9), pp. 2117 - 2131. doi; 10.14778/3598581.3598586.
Abstract: Copyright © 2023 the owner/author(s). In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.
Description: PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at https://github.com/lorrainea/BDA- index.
URI: https://bura.brunel.ac.uk/handle/2438/26881
DOI: https://doi.org/10.14778/3598581.3598586
Other Identifiers: ORCID iD: Lorraine A.K. Ayad https://orcid.org/0000-0003-0846-2616; Grigorios Loukides https://orcid.org/0000-0003-0888-5061; Solon P. Pissis https://orcid.org/0000-0002-1445-1932.
Appears in Collections:Dept of Computer Science Research Papers

Files in This Item:
File Description SizeFormat 
FullText.pdf© 2023 Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Proceedings of the VLDB Endowment, Vol. 16,No. 9I SSN 2150-8097. doi:10.14778/3598581.35985864.11 MBAdobe PDFView/Open


This item is licensed under a Creative Commons License Creative Commons