Statistical modeling of SARS-CoV-2 substitution processes: predicting the next variant

Keren Levinstein Hallak, Saharon Rosset*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review


We build statistical models to describe the substitution process in the SARS-CoV-2 as a function of explanatory factors describing the sequence, its function, and more. These models serve two different purposes: first, to gain knowledge about the evolutionary biology of the virus; and second, to predict future mutations in the virus, in particular, non-synonymous amino acid substitutions creating new variants. We use tens of thousands of publicly available SARS-CoV-2 sequences and consider tens of thousands of candidate models. Through a careful validation process, we confirm that our chosen models are indeed able to predict new amino acid substitutions: candidates ranked high by our model are eight times more likely to occur than random amino acid changes. We also show that named variants were highly ranked by our models before their appearance, emphasizing the value of our models for identifying likely variants and potentially utilizing this knowledge in vaccine design and other aspects of the ongoing battle against COVID-19.

Original languageEnglish
Article number285
JournalCommunications Biology
Issue number1
StatePublished - Dec 2022


Dive into the research topics of 'Statistical modeling of SARS-CoV-2 substitution processes: predicting the next variant'. Together they form a unique fingerprint.

Cite this