Improving bloom filter performance on sequence data using k-mer bloom filters

David Pellow, Darya Filippova, Carl Kingsford

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    Abstract

    Using a sequence’s k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. Since k-mer sets often reach hundreds of millions of elements, traditional data structures are impractical for k-mer set storage, and Bloom filters and their variants are used instead. Bloom filters reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate. We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the false positive rate up to 30× with little or no additional memory and with set containment queries that are only 1.3-1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original false positive rate. We consider several variants of such k-mer Bloom filters (kBF), derive theoretical upper bounds for their false positive rate, and discuss their range of applications and limitations. We provide a reference implementation of kBF at https://github.com/Kingsford-Group/ kbf/.

    Original languageEnglish
    Title of host publicationResearch in Computational Molecular Biology - 20th Annual Conference, RECOMB 2016, Proceedings
    EditorsMona Singh
    PublisherSpringer Verlag
    Pages137-151
    Number of pages15
    ISBN (Print)9783319319568
    DOIs
    StatePublished - 2016
    Event20th Annual Conference on Research in Computational Molecular Biology, RECOMB 2016 - Santa Monica, United States
    Duration: 17 Apr 201621 Apr 2016

    Publication series

    NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    Volume9649
    ISSN (Print)0302-9743
    ISSN (Electronic)1611-3349

    Conference

    Conference20th Annual Conference on Research in Computational Molecular Biology, RECOMB 2016
    Country/TerritoryUnited States
    CitySanta Monica
    Period17/04/1621/04/16

    Keywords

    • Bloom filters
    • Efficient data structures
    • K-mers

    Fingerprint

    Dive into the research topics of 'Improving bloom filter performance on sequence data using k-mer bloom filters'. Together they form a unique fingerprint.

    Cite this