InbreedingCoeff in VCF not matching my calculation

Initially I was going to ask about why I might be seeing some InbreedingCoeff values less than -1 (based on the calculation as explained at https://software.broadinstitute.org/gatk/documentation/article.php?id=8032, I believe it should always be between -1 and 1). But then I checked a random sample of 10,000 InbreedingCoeff values from my VCF against the values I calculate myself, and I see a strange mismatch generally, not only in the IC values given by GATK as less than -1. Attached is my code and a plot, with the y = x line in blue and y = -1 in red. I see the same pattern in another dataset which was produced by the same GATK-based pipeline.

The VCF referred to in the attached document is generated from 157 exome-capture samples (unrelated individuals) using GATK 3.6 with JDK 1.8.0. We use HaplotypeCaller on each sample, then GenotypeGVCFs on the collected results, then VariantRecalibrator/ApplyRecalibration with the recommended parameters/resources. I can provide the full commands if helpful.

Any ideas about why there is this mismatch? Is there something I'm misunderstanding about the InbreedingCoeff values?

Thanks!

Best Answer

Answers

  • lopezclopezc Member
    edited March 2018

    I have found that part of the issue here is multithreading.

    I ran GenotypeGVCFs once with the options our pipeline currently uses, including "-nt 6". Then I ran exactly the same command but without "-nt 6". The InbreedingCoeff values in the output VCF are sometimes < -1 when using "-nt 6" but not when using a single thread.

    There is still a mismatch between the IC values calculated by GenotypeVCFs and the ones I calculate based on the genotype counts. It's not a severe mismatch, but I'm still curious to know why it's there.

    Please see the attached documents for plots of IC values from single-threaded and multi-threaded GenotypeVCFs.

  • In the previously posted documents, I was failing to count alternate alleles "2" and "3". Making the correction only slightly improves the match in IC values, though. See attached.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @lopezc
    Hi Camden,

    Interesting. Those graphs show a pretty wide range of InbreedingCoeff values for multithreading. I don't think that is expected, but I also am not sure if there were changes to InbreedingCoeff in newer versions. To test this, can you please try with latest GATK4? If this is a bug in GATK3, I don't know if it will be fixed, as efforts are all in GATK4 now.

    -Sheila

  • lopezclopezc Member
    edited April 2018

    Hi @Sheila,

    It looks like multi-threading must be done with Spark when using GATK 4. I can look into setting up GATK 4 and Spark and checking the InbreedingCoeff values, but it probably won't happen quickly. (Maybe someone else reading this can post whether their IC values look good when using GATK 4 with multi-threading.)

    Meanwhile, I can add that among the annotations we're using for VQSR, there are a few QD values that also mismatch between the single-threading and multi-threading output. All of the other annotations besides InbreedingCoeff values match.

    Also, I looked at the GATK 3 source code, and it looks like InbreedingCoeff is calculated using genotype likelihoods (normalized to sum to 1), not "hard" genotype calls --- in other words, the counts of hom-ref, het, and hom-alt used to calculate IC are actually sums across all samples of likelihood-based weights between 0 and 1 --- and that probably explains why there is a discrepancy between the IC value from single-threaded GATK and the value I calculate based on the genotype calls. The values match exactly for variants with high-confidence calls, where the likelihood weights are basically all 0 or 1. The mismatches appear to occur among variants with lower-confidence calls.

Sign In or Register to comment.