TY - JOUR
T1 - A random forest-based framework for genotyping and accuracy assessment of copy number variations
AU - Zhuang, Xuehan
AU - Ye, Rui
AU - So, Man Ting
AU - Lam, Wai Yee
AU - Karim, Anwarul
AU - Yu, Michelle
AU - Ngo, Ngoc Diem
AU - Cherny, Stacey S.
AU - Tam, Paul Kwong Hang
AU - Garcia-Barcelo, Maria Mercè
AU - Tang, Clara Sze Man
AU - Sham, Pak Chung
N1 - Publisher Copyright:
© The Author(s) 2020. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
PY - 2020/9/1
Y1 - 2020/9/1
N2 - Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV2, particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV2 in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs.
AB - Detection of copy number variations (CNVs) is essential for uncovering genetic factors underlying human diseases. However, CNV detection by current methods is prone to error, and precisely identifying CNVs from paired-end whole genome sequencing (WGS) data is still challenging. Here, we present a framework, CNV-JACG, for Judging the Accuracy of CNVs and Genotyping using paired-end WGS data. CNV-JACG is based on a random forest model trained on 21 distinctive features characterizing the CNV region and its breakpoints. Using the data from the 1000 Genomes Project, Genome in a Bottle Consortium, the Human Genome Structural Variation Consortium and in-house technical replicates, we show that CNV-JACG has superior sensitivity over the latest genotyping method, SV2, particularly for the small CNVs (≤1 kb). We also demonstrate that CNV-JACG outperforms SV2 in terms of Mendelian inconsistency in trios and concordance between technical replicates. Our study suggests that CNV-JACG would be a useful tool in assessing the accuracy of CNVs to meet the ever-growing needs for uncovering the missing heritability linked to CNVs.
UR - https://www.scopus.com/pages/publications/85113271334
U2 - 10.1093/nargab/lqaa071
DO - 10.1093/nargab/lqaa071
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85113271334
SN - 2631-9268
VL - 2
JO - NAR Genomics and Bioinformatics
JF - NAR Genomics and Bioinformatics
IS - 3
M1 - lqaa071
ER -