We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

InbreedingCoeff in VCF not matching my calculation

Initially I was going to ask about why I might be seeing some InbreedingCoeff values less than -1 (based on the calculation as explained at https://software.broadinstitute.org/gatk/documentation/article.php?id=8032, I believe it should always be between -1 and 1). But then I checked a random sample of 10,000 InbreedingCoeff values from my VCF against the values I calculate myself, and I see a strange mismatch generally, not only in the IC values given by GATK as less than -1. Attached is my code and a plot, with the y = x line in blue and y = -1 in red. I see the same pattern in another dataset which was produced by the same GATK-based pipeline.

The VCF referred to in the attached document is generated from 157 exome-capture samples (unrelated individuals) using GATK 3.6 with JDK 1.8.0. We use HaplotypeCaller on each sample, then GenotypeGVCFs on the collected results, then VariantRecalibrator/ApplyRecalibration with the recommended parameters/resources. I can provide the full commands if helpful.

Any ideas about why there is this mismatch? Is there something I'm misunderstanding about the InbreedingCoeff values?


Best Answer


  • lopezclopezc Member
    edited March 2018

    I have found that part of the issue here is multithreading.

    I ran GenotypeGVCFs once with the options our pipeline currently uses, including "-nt 6". Then I ran exactly the same command but without "-nt 6". The InbreedingCoeff values in the output VCF are sometimes < -1 when using "-nt 6" but not when using a single thread.

    There is still a mismatch between the IC values calculated by GenotypeVCFs and the ones I calculate based on the genotype counts. It's not a severe mismatch, but I'm still curious to know why it's there.

    Please see the attached documents for plots of IC values from single-threaded and multi-threaded GenotypeVCFs.

  • In the previously posted documents, I was failing to count alternate alleles "2" and "3". Making the correction only slightly improves the match in IC values, though. See attached.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Camden,

    Interesting. Those graphs show a pretty wide range of InbreedingCoeff values for multithreading. I don't think that is expected, but I also am not sure if there were changes to InbreedingCoeff in newer versions. To test this, can you please try with latest GATK4? If this is a bug in GATK3, I don't know if it will be fixed, as efforts are all in GATK4 now.


  • lopezclopezc Member
    edited April 2018

    Hi @Sheila,

    It looks like multi-threading must be done with Spark when using GATK 4. I can look into setting up GATK 4 and Spark and checking the InbreedingCoeff values, but it probably won't happen quickly. (Maybe someone else reading this can post whether their IC values look good when using GATK 4 with multi-threading.)

    Meanwhile, I can add that among the annotations we're using for VQSR, there are a few QD values that also mismatch between the single-threading and multi-threading output. All of the other annotations besides InbreedingCoeff values match.

    Also, I looked at the GATK 3 source code, and it looks like InbreedingCoeff is calculated using genotype likelihoods (normalized to sum to 1), not "hard" genotype calls --- in other words, the counts of hom-ref, het, and hom-alt used to calculate IC are actually sums across all samples of likelihood-based weights between 0 and 1 --- and that probably explains why there is a discrepancy between the IC value from single-threaded GATK and the value I calculate based on the genotype calls. The values match exactly for variants with high-confidence calls, where the likelihood weights are basically all 0 or 1. The mismatches appear to occur among variants with lower-confidence calls.

  • zmarotizmaroti HUNGARYMember

    using -nt 4 at GenotypeGVCFs (GATK3.5) causes strange behaviour. I am pasting just the relevant part of one vcf entry where we have InbreedingCoeff < -0.8 when we have one heterozygote call of 906 alleles:

    Just pasting the relevant parts of the vcf entry
    1 987068 . G A 1408.16 PASS AC=1;AF=0.001104;AN=906;BaseQRankSum=0.894;ClippingRankSum=-0.449;DP=22636;ExcessHet=3.0103;FS=0.752;InbreedingCoeff=-0.851;MLEAC=1;MLEAF=0.001104;MQ=60;MQRankSum=-0.401;QD=13.94;ReadPosRankSum=-0.223;SOR=0.563;VQSLOD=3.91;culprit=MQ GT:AD:DP:GQ:PL 0/0:33,0:33:99:0,99,986 0/0:34,0:34:99:0,102,1025 0/0:36,0:36:99:0,108,1081 ....

    and here is the single heterozygote call in the vcf entries

    Without multithreading (just copied the entry from the g.vcf as the whole VQSR will take a while to complete) the calculated InbreedingCoeff value seems to be correct:
    1 987068 . G A 1408.16 . AC=1;AF=1.104e-03;AN=906;BaseQRankSum=0.894;ClippingRankSum=-4.490e-01;DP=22636;ExcessHet=3.0103;FS=0.752;InbreedingCoeff=-0.0011;MLEAC=1;MLEAF=1.104e-03;MQ=60.00;MQRankSum=-4.010e-01;QD=13.94;ReadPosRankSum=-2.230e-01;SOR=0.563 GT:AD:DP:GQ:PL 0/0:33,0:33:99:0,99,986 0/0:34,0:34:99:0,102,1025

    So I am not sure, whether it is true that it only affects low confidence calls. According to the various parameters this variant does not seems to be a low confidence variant. We have we have enough depth, we have unique regions with good quality reads (see IGV screenshot), but we would hard filter this variant as technically wrong or pseudogenic (all het) value if we follow the < -0.8 rule.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @zmaroti

    We do not support GATK3 anymore and -nt option has been deprecated in GATK4 because of various errors it caused.

    Please update to the latest version of GATK v4.1.2.0.

Sign In or Register to comment.