Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CombineGVCF & Joint GT - Information reduction?

Hi Team,

I used HC on many individuals and have two groups.
Lets call them A and B.
A has 4 sets of CombinedGVCFs (~400 individuals).
B has 1 set of CombinedGVCFs (~100 individuals).

I have JointGT of all A and of all B.
A gives 507248 variants
B gives 1030483 variants

Combining these with CombineVariants gives
1207973 variants

When doing JointGT on A and B together,
I get 1196703 variants (11270 less than combined)

Looking at the disconcordance give these numbers:
Present in Combined, but not in Joint:
7335 variants with QualityScore Mean 71, Median 34

Present in Joint, but not in Combined:
978 variants with QualityScore Mean 641, Median 38

I am wondering: Why does that happen?
I was assuming, that when using CombineVariants, I don't loose information.
And when JointGenotyping, I should actually get equal or more variants than when calling every individual on its own (based on a comment of @Geraldine_VdAuwera in this forum), because the other individuals context helps to call 'complicated to call' variants.
Are these assumptions correct?
And if yes, why do I get these results?

If someone checked the numbers:
The concordance variants don't add up to the difference of the sets:
7335 + 978 = 11270 - 2957
What are these 2957 variants???
The disconcordances from both sides (switching -V and -disc) should add up to the difference of two sets, right?
And the two concordances (from both sides, switching -V and -conc) should be the same.
A ∩ B + A \ B + B \ A = A ∪ B
A \ B + B \ A = A ∪ B - A ∩ B
So.. ???

Apart from that: based on the Quality scores and the circumstances: Could one consider the disconcordance variants as very likely to be FP?

Thanks for clearing things up!
Alexander

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @AlexanderV
    Hi Alexander,

    Can you please post some example records that are in the Combined VCF and not in the Joint Genotyped VCF? Also, please post some records that are in the Joint Genotyped VCF and not in the Combined VCF. I suspect this has to do with the way some indels and mixed records are handled.

    Thanks,
    Sheila

  • AlexanderVAlexanderV BerlinMember
    edited September 2015

    Okay.

    I deleted the genotypes (too many).
    Following observation about them:

    For Combined VCF and not in the Joint Genotyped VCF:The variants seem always to be with genotypes from group A XOR group B.
    For the other group, it is ./.

    For Joint Genotyped VCF and not in the Combined VCF:
    All sites have at least ./.:0,0:0 and not just ./. like in the other set.

    Combined VCF and not in the Joint Genotyped VCF, B genotyped:

    1       472174  .       G       A       37.89   .       AC=2;AF=0.013;AN=160;DP=279;FS=0.000;InbreedingCoeff=0.0251;MLEAC=1;MLEAF=6.250e-03;MQ=60.00;QD=18.94;SOR=2.303;set=
    variant     GT:AD:DP:GQ:PL
    1       5139699 .       T       TA      36.35   .       AC=2;AF=0.048;AN=42;DP=123;FS=0.000;InbreedingCoeff=0.1972;MLEAC=1;MLEAF=0.024;MQ=60.00;QD=18.17;SOR=2.303;set=varia
    nt  GT:AD:DP:GQ:PL
    1       16608841        .       A       G       33.69   .       AC=1;AF=0.013;AN=76;BaseQRankSum=0.922;ClippingRankSum=-9.220e-01;DP=221;FS=0.000;InbreedingCoeff=-0.1148;MLEAC=1;MLEAF=0.013;MQ=22.49;MQRankSum=-1.495e+00;QD=4.81;ReadPosRankSum=1.50;SOR=0.223;set=variant   GT:AD:DP:GQ:PGT:PID:PL
    

    Combined VCF and not in the Joint Genotyped VCF, A genotyped:

    1       16979365        .       C       A       30.96   .       AC=1;AF=4.425e-03;AN=226;BaseQRankSum=0.736;ClippingRankSum=0.736;DP=198;FS=0.000;InbreedingCoeff=-0.1483;MLEAC=1;MLEAF=4.425e-03;MQ=45.75;MQRankSum=-7.360e-01;QD=10.32;ReadPosRankSum=-7.360e-01;SOR=1.179;set=variant2       GT:AD:DP:GQ:PL
    1       17293989        .       T       C       42.37   .       AC=2;AF=5.714e-03;AN=350;BaseQRankSum=0.736;ClippingRankSum=0.736;DP=291;FS=0.000;InbreedingCoeff=-0.1213;MLEAC=1;MLEAF=2.857e-03;MQ=60.00;MQRankSum=-7.360e-01;QD=14.12;ReadPosRankSum=0.736;SOR=1.179;set=variant2    GT:AD:DP:GQ:PGT:PID:PL
    1       17627011        .       AT      A       36.83   .       AC=1;AF=0.250;AN=4;BaseQRankSum=1.03;ClippingRankSum=-1.026e+00;DP=7;FS=0.000;MLEAC=1;MLEAF=0.250;MQ=29.12;MQRankSum=-1.026e+00;QD=9.21;ReadPosRankSum=0.00;SOR=0.693;set=variant2      GT:AD:DP:GQ:PL
    

    Joint Genotyped VCF and not in the Combined VCF:

    1       29495911        .       A       G       64.28   .       AC=2;AF=2.488e-03;AN=804;BaseQRankSum=1.38;ClippingRankSum=-6.700e-02;DP=2926;FS=0.000;InbreedingCoeff=-0.0477;MLEAC=2;MLEAF=2.488e-03;MQ=60.00;MQRankSum=0.762;QD=4.29;ReadPosRankSum=0.720;SOR=0.160  GT:AD:DP:GQ:PL
    1       37961206        .       A       G       37.50   .       AC=2;AF=2.475e-03;AN=808;BaseQRankSum=1.00;ClippingRankSum=1.67;DP=3523;FS=0.000;InbreedingCoeff=-0.0263;MLEAC=2;MLEAF=2.475e-03;MQ=60.00;MQRankSum=1.67;QD=1.63;ReadPosRankSum=1.18;SOR=0.061  GT:AD:DP:GQ:PL
    12      2246816 .       AAAC    A,<*:DEL>       2207.54 .       AC=3,27;AF=5.208e-03,0.047;AN=576;BaseQRankSum=-1.026e+00;ClippingRankSum=1.03;DP=1144;FS=0.000;InbreedingCoeff=0.2718;MLEAC=2,22;MLEAF=3.472e-03,0.038;MQ=60.00;MQRankSum=1.03;QD=26.60;ReadPosRankSum=0.00;SOR=1.504  GT:AD:DP:GQ:PGT:PID:PL
    

    For position 1:29495911 I checked in the Combined VCF (where it supposedly is missing) and it actually is not there.
    The surrounding variants are at ...05 and ...23.

    Tell me if you need something else.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @AlexanderV
    Hi Alexander,

    Hmm. I am not sure what is happening here, but I think it would be easier if you submit a bug report. http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • AlexanderVAlexanderV BerlinMember

    Done.
    The file is called 2015-09-10_error_variant_overlap.tar.bz2
    The reference has to be downloaded and modified a little:

    http://potato.plantbiology.msu.edu/data/PGSC_DM_v4.03_pseudomolecules.fasta.zip
    http://potato.plantbiology.msu.edu/data/PGSC_DM_v4.03_unanchored_regions_chr00.fasta.zip
    zcat PGSC_DM_v4.03_pseudomolecules.fasta.zip PGSC_DM_v4.03_unanchored_regions_chr00.fasta.zip | sed 's/ST4.03ch00/999/g' | sed 's/ST4.03ch0//g' | sed 's/ST4.03ch//g' > PGSC_DM_v4.03_all.renamed.fasta

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @AlexanderV
    Hi Alexander,

    Sorry to get to this so late. I just ran your test files, but unfortunately, I cannot replicate the issue without the original GVCFs. First, can I ask you to test this again with the latest version? If the issue still replicates, can you submit the GVCF snippets? I just need GVCF snippets from A and B.

    Thanks,
    Sheila

Sign In or Register to comment.