TY - JOUR
T1 - A Biterm Topic Model for Sparse Mutation Data
AU - Sason, Itay
AU - Chen, Yuexi
AU - Leiserson, Mark D.M.
AU - Sharan, Roded
N1 - Publisher Copyright:
© 2023 by the authors.
PY - 2023/3
Y1 - 2023/3
N2 - Mutational signature analysis promises to reveal the processes that shape cancer genomes for applications in diagnosis and therapy. However, most current methods are geared toward rich mutation data that has been extracted from whole-genome or whole-exome sequencing. Methods that process sparse mutation data typically found in practice are only in the earliest stages of development. In particular, we previously developed the Mix model that clusters samples to handle data sparsity. However, the Mix model had two hyper-parameters, including the number of signatures and the number of clusters, that were very costly to learn. Therefore, we devised a new method that was several orders-of-magnitude more efficient for handling sparse data, was based on mutation co-occurrences, and imitated word co-occurrence analyses of Twitter texts. We showed that the model produced significantly improved hyper-parameter estimates that led to higher likelihoods of discovering overlooked data and had better correspondence with known signatures.
AB - Mutational signature analysis promises to reveal the processes that shape cancer genomes for applications in diagnosis and therapy. However, most current methods are geared toward rich mutation data that has been extracted from whole-genome or whole-exome sequencing. Methods that process sparse mutation data typically found in practice are only in the earliest stages of development. In particular, we previously developed the Mix model that clusters samples to handle data sparsity. However, the Mix model had two hyper-parameters, including the number of signatures and the number of clusters, that were very costly to learn. Therefore, we devised a new method that was several orders-of-magnitude more efficient for handling sparse data, was based on mutation co-occurrences, and imitated word co-occurrence analyses of Twitter texts. We showed that the model produced significantly improved hyper-parameter estimates that led to higher likelihoods of discovering overlooked data and had better correspondence with known signatures.
KW - biterm topic model
KW - mutational signature
KW - panel sequencing data
UR - http://www.scopus.com/inward/record.url?scp=85149789087&partnerID=8YFLogxK
U2 - 10.3390/cancers15051601
DO - 10.3390/cancers15051601
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
C2 - 36900390
AN - SCOPUS:85149789087
SN - 2072-6694
VL - 15
JO - Cancers
JF - Cancers
IS - 5
M1 - 1601
ER -