If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
The benefit of using a cohort
I am working on a large project and I am stuck on whether I should use cohort in variant calling or call samples individually.
Here are my sample info:
- Cohort f42: 42 female Caucasian
- Among the 42, s1 and s5 are NA12878
- All samples were sequenced by the same vendor using Agilent v4pUTR exome capturing kit.
- Mean exome sequencing depth: f42: ~80 for each sample; s1: 108; s5:70
- Golden standard: Genome in a Bottle (GIB) NA12878 calling set; bed file is the intersection of GIB bed file and v4pUTR region.
My comparison results on SNP:
Table 1: cohort vs individual calling performance cohort filter Sample NRS NRD OGC f42 filter s5 0.920 0.003 0.997 s5 filter s5 0.937 0.001 0.999 f42 filter s1 0.921 0.002 0.998 s1 filter s1 0.945 0.001 0.999 f42 noFilter s5 0.932 0.003 0.997 s5 noFilter s5 0.947 0.001 0.999 f42 noFilter s1 0.934 0.003 0.997 s1 noFilter s1 0.951 0.001 0.999
1. NRS: Non-Reference Sensitivity; NRD: Non-Reference Discrepancy; OGC: OverallGenotype Concordance;
2. cohort s5 means calling s5 individually
I do not see a filter option with VariantEval, so I suppose the following results are for all the called sites
Table 2. Number of variants and Presence in dbSNP cohort sample variants novelSites f42 s5 41,978 48 s5 s5 42,714 47 f42 s1 42,042 77 s1 s1 43,088 68 Table 3. Number of common and unique variants in cohort and individual calling sample cohort 1 cohort 2 intersection unique 1 unique 2 s5 f42 s5 41,284 694 1430 s1 f42 s1 41,653 389 1435
Basically, the three tables suggest the following:
Cohort calling leads to lower calling sensitivity and concordance rate, higher discrepancy rate, and fewer called variants, although it does find more novel sites.
Did I do something wrong here or the benefit of a cohort does not apply to samples at 80X depth? The papers demonstrating improved accuracy in multi-sample calling use samples of low depth, such as 4X. Did anyone check that on deeply sequenced samples?
I appreciate any comments!