CombineGVCF & Joint GT - Information reduction?
I used HC on many individuals and have two groups.
Lets call them A and B.
A has 4 sets of CombinedGVCFs (~400 individuals).
B has 1 set of CombinedGVCFs (~100 individuals).
I have JointGT of all A and of all B.
A gives 507248 variants
B gives 1030483 variants
Combining these with CombineVariants gives
When doing JointGT on A and B together,
I get 1196703 variants (11270 less than combined)
Looking at the disconcordance give these numbers:
Present in Combined, but not in Joint:
7335 variants with QualityScore Mean 71, Median 34
Present in Joint, but not in Combined:
978 variants with QualityScore Mean 641, Median 38
I am wondering: Why does that happen?
I was assuming, that when using CombineVariants, I don't loose information.
And when JointGenotyping, I should actually get equal or more variants than when calling every individual on its own (based on a comment of @Geraldine_VdAuwera in this forum), because the other individuals context helps to call 'complicated to call' variants.
Are these assumptions correct?
And if yes, why do I get these results?
If someone checked the numbers:
The concordance variants don't add up to the difference of the sets:
7335 + 978 = 11270 - 2957
What are these 2957 variants???
The disconcordances from both sides (switching -V and -disc) should add up to the difference of two sets, right?
And the two concordances (from both sides, switching -V and -conc) should be the same.
A ∩ B + A \ B + B \ A = A ∪ B
A \ B + B \ A = A ∪ B - A ∩ B
Apart from that: based on the Quality scores and the circumstances: Could one consider the disconcordance variants as very likely to be FP?
Thanks for clearing things up!