Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
CombineGVCF & Joint GT - Information reduction?
I used HC on many individuals and have two groups.
Lets call them A and B.
A has 4 sets of CombinedGVCFs (~400 individuals).
B has 1 set of CombinedGVCFs (~100 individuals).
I have JointGT of all A and of all B.
A gives 507248 variants
B gives 1030483 variants
Combining these with CombineVariants gives
When doing JointGT on A and B together,
I get 1196703 variants (11270 less than combined)
Looking at the disconcordance give these numbers:
Present in Combined, but not in Joint:
7335 variants with QualityScore Mean 71, Median 34
Present in Joint, but not in Combined:
978 variants with QualityScore Mean 641, Median 38
I am wondering: Why does that happen?
I was assuming, that when using CombineVariants, I don't loose information.
And when JointGenotyping, I should actually get equal or more variants than when calling every individual on its own (based on a comment of @Geraldine_VdAuwera in this forum), because the other individuals context helps to call 'complicated to call' variants.
Are these assumptions correct?
And if yes, why do I get these results?
If someone checked the numbers:
The concordance variants don't add up to the difference of the sets:
7335 + 978 = 11270 - 2957
What are these 2957 variants???
The disconcordances from both sides (switching -V and -disc) should add up to the difference of two sets, right?
And the two concordances (from both sides, switching -V and -conc) should be the same.
A ∩ B + A \ B + B \ A = A ∪ B
A \ B + B \ A = A ∪ B - A ∩ B
Apart from that: based on the Quality scores and the circumstances: Could one consider the disconcordance variants as very likely to be FP?
Thanks for clearing things up!