Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/32676
Full metadata record
DC FieldValueLanguage
dc.contributor.authorAyad, LAK-
dc.contributor.authorFici, G-
dc.contributor.authorKoerkamp, RG-
dc.contributor.authorLoukides, G-
dc.contributor.authorPatro, R-
dc.contributor.authorPibiri, GE-
dc.contributor.authorPissis, SP-
dc.coverage.spatialVenice, Italy-
dc.date.accessioned2026-01-19T14:13:47Z-
dc.date.available2026-01-19T14:13:47Z-
dc.date.issued2025-07-15-
dc.identifierORCiD: Lorraine A. K. Ayad https://orcid.org/0000-0003-0846-2616-
dc.identifierORCiD: Gabriele Fici https://orcid.org/0000-0002-3536-327X-
dc.identifierORCiD: Ragnar Groot Koerkamp https://orcid.org/0000-0002-2091-1237-
dc.identifierORCiD: Grigorios Loukides https://orcid.org/0000-0003-0888-5061-
dc.identifierORCiD: Rob Patro https://orcid.org/0000-0001-8463-1675-
dc.identifierORCiD: Giulio Ermanno Pibiri https://orcid.org/0000-0003-0724-7092-
dc.identifierORCiD: Solon P. Pissis https://orcid.org/0000-0002-1445-1932-
dc.identifierArticle number: 4-
dc.identifierarXiv:2502.14488v3 [cs.DS]-
dc.identifier.citationAyad, L.A.K. et al. (2025) 'U-Index: A Universal Indexing Framework for Matching Long Patterns', Leibniz International Proceedings in Informatics Lipics, 2025, 338, pp. 4-1 - 4-18. doi: 10.4230/LIPIcs.SEA.2025.4.en_US
dc.identifier.urihttps://bura.brunel.ac.uk/handle/2438/32676-
dc.descriptionA related version of this article is available at arXiv:2502.14488v3 [cs.DS] (https://arxiv.org/abs/2502.14488). Comments: SEA-2025 version. 18 pages, 6 figures, code available at https://github.com/u-index/u-index-rs . ACM classes: F.2.2; J.3. Submission history: From: Ragnar Groot Koerkamp: [v1] Thu, 20 Feb 2025 12:09:34 UTC (383 KB); [v2] Fri, 21 Feb 2025 13:35:43 UTC (383 KB); [v3] Tue, 27 May 2025 12:05:04 UTC (459 KB).-
dc.description.abstractMotivation. Text indexing is a fundamental and well-studied problem. Classic solutions to this problem either replace the original text with a compressed representation, e.g., the FM-index and its variants, or keep it uncompressed but attach some redundancy - an index - to accelerate matching, e.g., the suffix array. The former solutions thus retain excellent compressed space, but are practically slow to construct and query. The latter approaches, instead, sacrifice space efficiency but are typically faster; for example, the suffix array takes much more space than the text itself for commonly used alphabets, like ASCII or DNA, but it is very fast to construct and query. Methods. In this paper, we show that efficient text indexing can be achieved using just a small extra space on top of the original text, provided that the query patterns are sufficiently long. More specifically, we develop a new indexing paradigm in which a sketch of a query pattern is first matched against a sketch of the text. Once candidate matches are retrieved, they are verified using the original text. This paradigm is thus universal in the sense that it allows us to use any solution to index the sketched text, like a suffix array, FM-index, or r-index. Results. We explore both the theory and the practice of this universal framework. With an extensive experimental analysis, we show that, surprisingly, universal indexes can be constructed much faster than their unsketched counterparts and take a fraction of the space, as a direct consequence of (i) having a lower bound on the length of patterns and (ii) working in sketch space. Furthermore, these data structures have the potential of retaining or even improving query time, because matching against the sketched text is faster and verifying candidates can be theoretically done in constant time per occurrence (or, in practice, by short and cache-friendly scans of the text). Finally, we discuss some important applications of this novel indexing paradigm to computational biology. We hypothesize that such indexes will be particularly effective when the queries are sufficiently long, and so we demonstrate applications in long-read mapping.en_US
dc.description.sponsorshipGabriele Fici: Supported by MUR project PRIN 2022 APML – 20229BCXNW, funded by the European Union -– Mission 4 “Education and Research” C2 - Investment 1.1. CUP Master_B53D23012910006. Ragnar Groot Koerkamp: ETH Research Grant ETH-1721-1 to Gunnar Rätsch. Rob Patro: NIH grant award number R01HG009937, NSF award CNS-1763680 and grants 252586 and 2024342821 from the Chan Zuckerberg Initiative DAF, an advised fund of Silicon Valley Community Foundation. RP is a co-founder of Ocean Genomics, Inc. Giulio Ermanno Pibiri: European Union’s Horizon Europe research and innovation programme (EFRA project, Grant Agreement Number 101093026). This work was also partially supported by DAIS – Ca’ Foscari University of Venice within the IRIDE program. Solon P. Pissis: Supported by the PANGAIA and ALPACA projects that have received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively.en_US
dc.format.extent4-1 - 4-18-
dc.format.mediumElectronic-
dc.languageEnglish-
dc.language.isoen_USen_US
dc.publisherSchloss Dagstuhl Leibniz-Zentrum für Informatiken_US
dc.relation.isversionofarXiv:2502.14488v3 [cs.DS]-
dc.relation.urihttps://github.com/u-index/u-index-rs-
dc.relation.urihttps://arxiv.org/abs/2502.14488-
dc.relation.urihttps://creativecommons.org/licenses/by/4.0/-
dc.rightsCreative Commons Attribution 4.0 International-
dc.source23rd International Symposium on Experimental Algorithms (SEA 2025)-
dc.source23rd International Symposium on Experimental Algorithms (SEA 2025)-
dc.subjecttext Indexingen_US
dc.subjectsketchingen_US
dc.subjectminimizersen_US
dc.subjecthashingen_US
dc.titleU-Index: A Universal Indexing Framework for Matching Long Patternsen_US
dc.typeConference Paperen_US
dc.identifier.doihttps://doi.org/10.4230/LIPIcs.SEA.2025.4-
dc.relation.isPartOfLeibniz International Proceedings in Informatics Lipics-
pubs.finish-date2025-07-24-
pubs.finish-date2025-07-24-
pubs.publication-statusPublished-
pubs.start-date2025-07-22-
pubs.start-date2025-07-22-
pubs.volume338-
dc.identifier.eissn1868-8969-
dc.rights.licensehttps://creativecommons.org/licenses/by/4.0/legalcode.en-
dc.rights.holderLorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio, Ermanno Pibiri, and Solon P. Pissis-
dc.contributor.orcidAyad, Lorraine A. K. [0000-0003-0846-2616]-
dc.contributor.orcidFici, Gabriele [0000-0002-3536-327X]-
dc.contributor.orcidKoerkamp, Ragnar Groot [0000-0002-2091-1237]-
dc.contributor.orcidLoukides, Grigorios [0000-0003-0888-5061]-
dc.contributor.orcidPatro, Rob [0000-0001-8463-1675]-
dc.contributor.orcidPibiri , Giulio Ermanno [0000-0003-0724-7092]-
dc.contributor.orcidPissis, Solon P. [0000-0002-1445-1932]-
Appears in Collections:Dept of Computer Science Research Papers

Files in This Item:
File Description SizeFormat 
FullText.pdfCopyright © 2025 Lorraine A. K. Ayad, Gabriele Fici, Ragnar Groot Koerkamp, Grigorios Loukides, Rob Patro, Giulio, Ermanno Pibiri, and Solon P. Pissis; licensed under Creative Commons Attribution 4.0 International license (https://creativecommons.org/licenses/by/4.0/legalcode).2.23 MBAdobe PDFView/Open


Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.