We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

# CombineGVCF & Joint GT - Information reduction?

BerlinMember

Hi Team,

I used HC on many individuals and have two groups.
Lets call them A and B.
A has 4 sets of CombinedGVCFs (~400 individuals).
B has 1 set of CombinedGVCFs (~100 individuals).

I have JointGT of all A and of all B.
A gives 507248 variants
B gives 1030483 variants

Combining these with CombineVariants gives
1207973 variants

When doing JointGT on A and B together,
I get 1196703 variants (11270 less than combined)

Looking at the disconcordance give these numbers:
Present in Combined, but not in Joint:
7335 variants with QualityScore Mean 71, Median 34

Present in Joint, but not in Combined:
978 variants with QualityScore Mean 641, Median 38

I am wondering: Why does that happen?
I was assuming, that when using CombineVariants, I don't loose information.
And when JointGenotyping, I should actually get equal or more variants than when calling every individual on its own (based on a comment of @Geraldine_VdAuwera in this forum), because the other individuals context helps to call 'complicated to call' variants.
Are these assumptions correct?
And if yes, why do I get these results?

If someone checked the numbers:
The concordance variants don't add up to the difference of the sets:
7335 + 978 = 11270 - 2957
What are these 2957 variants???
The disconcordances from both sides (switching -V and -disc) should add up to the difference of two sets, right?
And the two concordances (from both sides, switching -V and -conc) should be the same.
A ∩ B + A \ B + B \ A = A ∪ B
A \ B + B \ A = A ∪ B - A ∩ B
So.. ???

Apart from that: based on the Quality scores and the circumstances: Could one consider the disconcordance variants as very likely to be FP?

Thanks for clearing things up!
Alexander

Tagged:

@AlexanderV
Hi Alexander,

Can you please post some example records that are in the Combined VCF and not in the Joint Genotyped VCF? Also, please post some records that are in the Joint Genotyped VCF and not in the Combined VCF. I suspect this has to do with the way some indels and mixed records are handled.

Thanks,
Sheila

• BerlinMember
edited September 2015

Okay.

I deleted the genotypes (too many).

For Combined VCF and not in the Joint Genotyped VCF:The variants seem always to be with genotypes from group A XOR group B.
For the other group, it is ./.

For Joint Genotyped VCF and not in the Combined VCF:
All sites have at least ./.:0,0:0 and not just ./. like in the other set.

Combined VCF and not in the Joint Genotyped VCF, B genotyped:

1       472174  .       G       A       37.89   .       AC=2;AF=0.013;AN=160;DP=279;FS=0.000;InbreedingCoeff=0.0251;MLEAC=1;MLEAF=6.250e-03;MQ=60.00;QD=18.94;SOR=2.303;set=
1       5139699 .       T       TA      36.35   .       AC=2;AF=0.048;AN=42;DP=123;FS=0.000;InbreedingCoeff=0.1972;MLEAC=1;MLEAF=0.024;MQ=60.00;QD=18.17;SOR=2.303;set=varia

Combined VCF and not in the Joint Genotyped VCF, A genotyped:

Joint Genotyped VCF and not in the Combined VCF:

For position 1:29495911 I checked in the Combined VCF (where it supposedly is missing) and it actually is not there.
The surrounding variants are at ...05 and ...23.

Tell me if you need something else.

@AlexanderV
Hi Alexander,

Hmm. I am not sure what is happening here, but I think it would be easier if you submit a bug report. http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

Thanks,
Sheila

• BerlinMember

Done.
The file is called 2015-09-10_error_variant_overlap.tar.bz2

http://potato.plantbiology.msu.edu/data/PGSC_DM_v4.03_pseudomolecules.fasta.zip
http://potato.plantbiology.msu.edu/data/PGSC_DM_v4.03_unanchored_regions_chr00.fasta.zip
zcat PGSC_DM_v4.03_pseudomolecules.fasta.zip PGSC_DM_v4.03_unanchored_regions_chr00.fasta.zip | sed 's/ST4.03ch00/999/g' | sed 's/ST4.03ch0//g' | sed 's/ST4.03ch//g' > PGSC_DM_v4.03_all.renamed.fasta