Data Set-Adaptive Minimizer Order Reduces Memory Usage in k-Mer Counting

Dan Flomin, David Pellow, Ron Shamir*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

The rapid continuous growth of deep sequencing experiments requires development and improvement of many bioinformatic applications for analysis of large sequencing data sets, including k-mer counting and assembly. Several applications reduce memory usage by binning sequences. Binning is done by using minimizer schemes, which rely on a specific order of the minimizers. It has been demonstrated that the choice of the order has a major impact on the performance of the applications. Here we introduce a method for tailoring the order to the data set. Our method repeatedly samples the data set and modifies the order so as to flatten the k-mer load distribution across minimizers. We integrated our method into Gerbil, a state-of-The-Art memory-efficient k-mer counter, and were able to reduce its memory footprint by 30%-50% for large k, with only a minor increase in runtime. Our tests also showed that the orders produced by our method produced superior results when transferred across data sets from the same species, with little or no order change. This enables memory reduction with essentially no increase in runtime.

Original languageEnglish
Pages (from-to)825-838
Number of pages14
JournalJournal of Computational Biology
Volume29
Issue number8
DOIs
StatePublished - Aug 2022

Funding

FundersFunder number
Zimin Institute for Engineering Solutions Advancing Better Lives
Blavatnik Family Foundation
Ministry of Aliyah and Immigrant Absorption
Israel Science Foundation3165/19, 1339/18
Tel Aviv University

    Keywords

    • bin mapping
    • k-mer counting
    • minimizer order
    • minimizer scheme
    • sequencing

    Fingerprint

    Dive into the research topics of 'Data Set-Adaptive Minimizer Order Reduces Memory Usage in k-Mer Counting'. Together they form a unique fingerprint.

    Cite this