TY - JOUR
T1 - Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
AU - Pellow, David
AU - Filippova, Darya
AU - Kingsford, Carl
N1 - Publisher Copyright:
© 2017, Mary Ann Liebert, Inc.
PY - 2017/6
Y1 - 2017/6
N2 - Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
AB - Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
KW - Bloom fitters
KW - Efficient data structures
KW - genomics
KW - k-mers
KW - string algorithms
UR - http://www.scopus.com/inward/record.url?scp=85020483730&partnerID=8YFLogxK
U2 - 10.1089/cmb.2016.0155
DO - 10.1089/cmb.2016.0155
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 27828710
AN - SCOPUS:85020483730
SN - 1066-5277
VL - 24
SP - 547
EP - 557
JO - Journal of Computational Biology
JF - Journal of Computational Biology
IS - 6
ER -