The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

InbreedingCoeff in VCF not matching my calculation

Initially I was going to ask about why I might be seeing some InbreedingCoeff values less than -1 (based on the calculation as explained at, I believe it should always be between -1 and 1). But then I checked a random sample of 10,000 InbreedingCoeff values from my VCF against the values I calculate myself, and I see a strange mismatch generally, not only in the IC values given by GATK as less than -1. Attached is my code and a plot, with the y = x line in blue and y = -1 in red. I see the same pattern in another dataset which was produced by the same GATK-based pipeline.

The VCF referred to in the attached document is generated from 157 exome-capture samples (unrelated individuals) using GATK 3.6 with JDK 1.8.0. We use HaplotypeCaller on each sample, then GenotypeGVCFs on the collected results, then VariantRecalibrator/ApplyRecalibration with the recommended parameters/resources. I can provide the full commands if helpful.

Any ideas about why there is this mismatch? Is there something I'm misunderstanding about the InbreedingCoeff values?


Best Answer


  • lopezclopezc Member
    edited March 2018

    I have found that part of the issue here is multithreading.

    I ran GenotypeGVCFs once with the options our pipeline currently uses, including "-nt 6". Then I ran exactly the same command but without "-nt 6". The InbreedingCoeff values in the output VCF are sometimes < -1 when using "-nt 6" but not when using a single thread.

    There is still a mismatch between the IC values calculated by GenotypeVCFs and the ones I calculate based on the genotype counts. It's not a severe mismatch, but I'm still curious to know why it's there.

    Please see the attached documents for plots of IC values from single-threaded and multi-threaded GenotypeVCFs.

  • In the previously posted documents, I was failing to count alternate alleles "2" and "3". Making the correction only slightly improves the match in IC values, though. See attached.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Camden,

    Interesting. Those graphs show a pretty wide range of InbreedingCoeff values for multithreading. I don't think that is expected, but I also am not sure if there were changes to InbreedingCoeff in newer versions. To test this, can you please try with latest GATK4? If this is a bug in GATK3, I don't know if it will be fixed, as efforts are all in GATK4 now.


  • lopezclopezc Member
    edited April 2018

    Hi @Sheila,

    It looks like multi-threading must be done with Spark when using GATK 4. I can look into setting up GATK 4 and Spark and checking the InbreedingCoeff values, but it probably won't happen quickly. (Maybe someone else reading this can post whether their IC values look good when using GATK 4 with multi-threading.)

    Meanwhile, I can add that among the annotations we're using for VQSR, there are a few QD values that also mismatch between the single-threading and multi-threading output. All of the other annotations besides InbreedingCoeff values match.

    Also, I looked at the GATK 3 source code, and it looks like InbreedingCoeff is calculated using genotype likelihoods (normalized to sum to 1), not "hard" genotype calls --- in other words, the counts of hom-ref, het, and hom-alt used to calculate IC are actually sums across all samples of likelihood-based weights between 0 and 1 --- and that probably explains why there is a discrepancy between the IC value from single-threaded GATK and the value I calculate based on the genotype calls. The values match exactly for variants with high-confidence calls, where the likelihood weights are basically all 0 or 1. The mismatches appear to occur among variants with lower-confidence calls.

Sign In or Register to comment.