CalculateGenotypePosteriors produces a bunch of zero coverage variants

Martin_ZieglerMartin_Ziegler Munich, GermanyMember

Hi GATK Team,

we are running small targeted panels on GATK4. It seems, most of the Variants (~90%) are DP 0 Variants, emerging after applying CalculateGenotypePosteriors. Before this step we are running VQSR with over thirty exomes. Should we use external databases for CalculateGenotypePosteriors?

This is what we do now:

GATK version 4.0.4.0
${tool_gatk} --java-options "${javaArg_xms} ${javaArg_xmx}" CalculateGenotypePosteriors \ -R ${reference} \ -V ${outDirectory}/${variant_vcf}.vcf \ -O ${outDirectory}/${variant_vcf}.postCGP.vcf \ --supporting ${knownsite_hapmap} \ --pedigree ${pedigree}

Greetings
Martin

Best Answer

  • gauthiergauthier admin
    Accepted Answer

    Hi Daniel (@dbecker),

    The priors for CalculateGenotype posteriors can come from three different sources depending on the commandline and the number of input samples: the supporting VCF (if supplied), the pedigree (if supplied), and the allele counts of the input samples themselves (if the number of samples is at least 10). Given that your VCFs has 70 samples, the ACs from your cohort's genotypes will go into the prior calculation unless you specify otherwise with --ignore-input-samples.

    For the homozygous reference genotypes with DP > 0 and no alt alleles, I'm not convinced that CalculateGenotypePosteriors is doing the wrong thing. For example, if there are only 4 reads and all of them are reference, the most likely genotype is homRef, but the heterozygous genotype is not so unlikely. Once priors are applied based on the allele frequency in an external population, or from the input cohort (or from the pedigree), then it's perfectly reasonable for a variant that's very high allele frequency to change genotype to heterozygous. The mechanism behind the math is undersampling of variant reads. Given no reference bias (i.e. 50% probability of sampling from either haplotype), a binomial model shows that for four reads, 1 in 16 times they will all be sampled from the reference. Over a whole callset with millions of variants that's a lot of incorrectly genotyped variants. The goal of using prior probabilities is to leverage orthogonal data to try to correct for undersampling or errors in the sequencing.

    The way the tool works is that if the genotype has a PL field, then it gets posteriors. Genotypes with no depth didn't used to get PLs, but that must have changed when we moved to HaplotypeCaller GVCF mode. So one issue with my implementation is that no-calls with no depth should remain no-calls, even if they have PLs -- would you agree that that is the expected behavior? The other issue is that I'm looking at the total number of samples to determine if the input cohort AC should be used, but I should be using the number of called samples at each site. That's why Martin's experiments with excluding the supporting file and the pedigree still yielded modified genotypes. (It's actually a pretty egregious bug. For example at chr22:10510718 the variant is AC=8, AN=8, which leads to a weak prior favoring genotypes with alternate alleles, but strong enough to change a PL=0,0,0 genotype to a het.) I will fix this, but a good way to protect against this behavior is to supply the --num-reference-samples-if-no-call as in the example command with -supporting 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz in the docs. If that first argument is used, then if a variant is missing from the supporting callset, it's assumed that that site was reference in all the samples in the supporting callset. I don't remember off the top of my head what the number of samples in the HapMap resource is, but you can find it by taking the max AN in the callset and dividing by two.

    We're planning a GATK 4.1 release hopefully by the end of the month and I'll make sure this bug fix makes it in, so you could wait to rerun CalculateGenotypePosteriors until then. For best results even after the fix I would suggest using --num-reference-samples-if-no-call since that resource is for WGS and applying priors at every site is more consistent. As the tool is now, using that arg should prevent a lot of the spurious variants in your callset, but the no-call genotypes will still become calls, though mostly homRefs. You can also use --ignore-input-samples without --num-reference-samples-if-no-call and the no-call genotypes at sites where there is no variant in HapMap should stay no-calls because there will be no information to calculate the prior. Hopefully one of those options works for you.

    Best,
    Laura

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Martin_Ziegler

    CalculateGenotypePosteriors does not effect the DP of variants. There might be something happening before this step thats causing 90% of the variants to have 0 depth. Please look at the tool docs for information about CalculateGenotypePosteriors.

    Regards
    Bhanu

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @bhanuGandham

    DP is not modified by CalculateGenotypePosteriors. I mean, we have thousands of new "heterozygous" variants with DP = 0 after applying CalculateGenotypePosteriors. Before that step, Genotype of this variants was ./.. In some cases Genotype 0/0variants with DP > 0 switch to heterozygous state even with no alt read in this region. I think CalculateGenotypePosteriors in our configuration predicts a lot of variants with no evidence.

    Best regards
    Martin

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Martin_Ziegler

    Would you please send me a few variant records before and after CalculateGenotypePosteriors step.

    Regards
    Bhanu

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @bhanuGandham

    i uploaded two chr22 VCFs before ("pre") and after ("post") applying CalculateGenotypePosteriors. Some Specs:
    Number of Variants
    pre: 2334
    post: 35020

    Thank you for your help
    Martin

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Martin_Ziegler

    I am looking into this issue, I will have an update for you by next week. Given the holiday week we are backed up on our end, but i will definitely get to this by next week.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Martin_Ziegler

    It seems that CalculateGenotypePosteriors predicts a lot of variants with no evidence. The problem can NOT be fixed easily since filtering the "wrong" variance manually is not easy. At this stage it seems to me that:
    1. the "pre" vcf is fine
    2. the "post" vcf has a lot of variance which are called without evidence.
    I think we should investigate this further. The only way to reproduce the problem is to get the files (or the part of the file corresponding to chr22) that the user used as --supporting evidence and as --pedigree.

    Would you be willing to share those two files?
    To share the files please follow the step here.

    Regards
    Bhanu

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @bhanuGandham,
    we use a HapMap vcf from the hg38 ressource bundle (ftp://ftp.broadinstitute.org/bundle/hg38/) and a pedigree file for our exome trios. Only a few samples are exome trios, most of the data are small panels.

    Greetings
    Martin

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Martin_Ziegler

    We are looking into this and will get back to you.

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @bhanuGandham,

    i made some more experiments. It seems, the HapMap file only has a little impact on the amount of predicted low evidence variants. The pedigree file has no impact at all. I ran CalculateGenotypePosteriors without --support and got similar results. I Conclusion, it has to be the tool itself or our exome cohort for VQSR.

    Greetings from Munich
    Martin

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Martin_Ziegler

    Thanks for the feedback. I will send this information to the dev team and get back to you what they have to say. Thank you for being patient.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Martin_Ziegler

    We ran the exact commands on our end but got a very different result.
    For me the CalculateGenotypePosterior does not introduce any new "no evidence" variant
    Here are the commands I ran:

    cat user_pre.vcf | grep chr22 | wc -l
    2336

    cat user_post.vcf | grep chr22 | wc -l
    35022

    gatk CalculateGenotypePosteriors -R $hg38 -V user_pre.vcf -O test_post.vcf --supporting hapmap_3.3.hg38.vcf_remodel_chr22.vcf --pedigree pedigree.ped

    cat test_post.vcf | grep chr22 | wc -l
    2336

    My conclusion is that the tool CalculateGenotypePosteriors is fine and maybe you have some incorrect setup/installation.
    Would you please check and verify your results against mine.
    Thank you.

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @bhanuGandham,

    the single sample VCF that i uploaded was extracted from a multisample VCF (SelectVariants with --exclude-non-variants --remove-unused-alternates options ). Variants with no evidence are only introduced when i run CalculateGenotypePosterior on the multisample VCF. I am really sorry for that! I uploaded the "pre" multisample file here for you (chr22 only), sample name that i checked was GIABGI7a-AG_v7a-181023_NB551378_0043_AHF57JBGX7. The "post" file is slightly to big to upload it here. If you need it for troubleshooting i try to upload it on your ftp server.

    Greetings
    Martin

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Additionally, i uploaded the two datasets to your ftp Server:
    folder: martin_ziegler

    thanks
    Martin

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @bhanuGandha,

    we are calling WES samples for the VQSR cohort without a BED file for target regions. So there should be a lot of low coverage Variants in the VCF . Is it possible that these lowcoverage Variants affect CalculateGenotypePosteriors and lead to the observed "no evidence" variants?

    Merry christmas and a happy new year to your team from Germany.
    Your doing a great job.

    Martin

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @Martin_Ziegler Happy Christmas and New Year to your team as well. We are about to take a short holiday break, so we will be back at trouble shooting in the New Year.

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @bhanuGandham,
    this is my last day on the job, i am starting a new one in January 2019. Therefore, i have to pass this question to my colleague @dbecker.

    Happy new year
    Martin

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @Martin_Ziegler Thank you for using GATK and I hope you continue to do so in the future. All the best with the new job!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @dbecker

    Do you have questions related to this topic and how would you like to proceed?

  • dbeckerdbecker MunichMember ✭✭

    Hi @bhanuGandham,

    Happy new year!

    The question did not change. Martin uploaded new files which are multi-sample-vcfs. We believe that the variants that occur only in a few of these samples introduce the 'no evidence' variants in the others. We definetly produce them by running CalculateGenotypePosteriors. As long as we can't figure out why the appear we can't use the refinement workflow.

    Best,
    Daniel

    @Martin_Ziegler said:
    Hi @bhanuGandham,

    the single sample VCF that i uploaded was extracted from a multisample VCF (SelectVariants with --exclude-non-variants --remove-unused-alternates options ). Variants with no evidence are only introduced when i run CalculateGenotypePosterior on the multisample VCF. I am really sorry for that! I uploaded the "pre" multisample file here for you (chr22 only), sample name that i checked was GIABGI7a-AG_v7a-181023_NB551378_0043_AHF57JBGX7. The "post" file is slightly to big to upload it here. If you need it for troubleshooting i try to upload it on your ftp server.

    Greetings
    Martin

  • gauthiergauthier Member, Broadie, Moderator, Dev admin
    Accepted Answer

    Hi Daniel (@dbecker),

    The priors for CalculateGenotype posteriors can come from three different sources depending on the commandline and the number of input samples: the supporting VCF (if supplied), the pedigree (if supplied), and the allele counts of the input samples themselves (if the number of samples is at least 10). Given that your VCFs has 70 samples, the ACs from your cohort's genotypes will go into the prior calculation unless you specify otherwise with --ignore-input-samples.

    For the homozygous reference genotypes with DP > 0 and no alt alleles, I'm not convinced that CalculateGenotypePosteriors is doing the wrong thing. For example, if there are only 4 reads and all of them are reference, the most likely genotype is homRef, but the heterozygous genotype is not so unlikely. Once priors are applied based on the allele frequency in an external population, or from the input cohort (or from the pedigree), then it's perfectly reasonable for a variant that's very high allele frequency to change genotype to heterozygous. The mechanism behind the math is undersampling of variant reads. Given no reference bias (i.e. 50% probability of sampling from either haplotype), a binomial model shows that for four reads, 1 in 16 times they will all be sampled from the reference. Over a whole callset with millions of variants that's a lot of incorrectly genotyped variants. The goal of using prior probabilities is to leverage orthogonal data to try to correct for undersampling or errors in the sequencing.

    The way the tool works is that if the genotype has a PL field, then it gets posteriors. Genotypes with no depth didn't used to get PLs, but that must have changed when we moved to HaplotypeCaller GVCF mode. So one issue with my implementation is that no-calls with no depth should remain no-calls, even if they have PLs -- would you agree that that is the expected behavior? The other issue is that I'm looking at the total number of samples to determine if the input cohort AC should be used, but I should be using the number of called samples at each site. That's why Martin's experiments with excluding the supporting file and the pedigree still yielded modified genotypes. (It's actually a pretty egregious bug. For example at chr22:10510718 the variant is AC=8, AN=8, which leads to a weak prior favoring genotypes with alternate alleles, but strong enough to change a PL=0,0,0 genotype to a het.) I will fix this, but a good way to protect against this behavior is to supply the --num-reference-samples-if-no-call as in the example command with -supporting 1000G.phase3.integrated.sites_only.no_MATCHED_REV.hg38.vcf.gz in the docs. If that first argument is used, then if a variant is missing from the supporting callset, it's assumed that that site was reference in all the samples in the supporting callset. I don't remember off the top of my head what the number of samples in the HapMap resource is, but you can find it by taking the max AN in the callset and dividing by two.

    We're planning a GATK 4.1 release hopefully by the end of the month and I'll make sure this bug fix makes it in, so you could wait to rerun CalculateGenotypePosteriors until then. For best results even after the fix I would suggest using --num-reference-samples-if-no-call since that resource is for WGS and applying priors at every site is more consistent. As the tool is now, using that arg should prevent a lot of the spurious variants in your callset, but the no-call genotypes will still become calls, though mostly homRefs. You can also use --ignore-input-samples without --num-reference-samples-if-no-call and the no-call genotypes at sites where there is no variant in HapMap should stay no-calls because there will be no information to calculate the prior. Hopefully one of those options works for you.

    Best,
    Laura

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi @gauthier, @dbecker,

    new job, still working with GATK. Thank you for managing this issue. I completely agree with you ("no-calls with no depth should remain no-calls, even if they have PLs -- would you agree that that is the expected behavior?"). I`m really looking forward working with GATK 4.1.

    Best wishes from Munich

    Martin

  • dbeckerdbecker MunichMember ✭✭

    Hi @Martin_Ziegler,

    nice to see that you are still working on the same challenges. You can fill me in, when you solve it. ;)

    Best,
    Daniel

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Great, @Martin_Ziegler ! I wanted to confirm the expected behavior before trying to address this, but I can probably throw together a PR on Friday. Unfortunately our docker nightly build isn't working right now. Can you build your own GATK jar from master? Otherwise we could cut a new release, but it might take a little big longer.

  • Martin_ZieglerMartin_Ziegler Munich, GermanyMember

    Hi Laura @gauthier,
    thats not urgent, especially because @dbecker is the man in charge now ;) .

    All the best
    Martin

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @dbecker ,

    I just merged a fix for this. Let me know if you still have issues.

Sign In or Register to comment.