We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Force GATK HaplotypeCaller/GenotypeGVCFs to report genotypes at a whitelist of sites, even if WT?

Hello,

We have a project generating WGS and WES data. We have nearly 1000 samples and currently perform one round of HaplotypeCaller/GenotypeGVCFs on the WGS data and one with the WES to produce two VCFs. We run CombineVariants on these to make our final VCF.

An inconvenient problem happens with this merge. If VCF 1 has coverage at a site, but all subjects are WT, that site is omitted from the VCF. Therefore when you CombineVariants, there is no difference between actual 'No Data', and 'all wild-type'.

For our purposes, we have a whitelist of sites where we would like to force genotypes to get reported (including if they are all WT). Is there a mechanism in the GATK tools to do this? I assume it would need to occur in HaplotypeCaller/GenotypeGVCFs, since after this point that information is lost.

Thanks,
Ben

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @bbimber

    Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11004

    This should help you with your question.

  • bbimberbbimber HomeMember

    Thank you, but I dont think this guide quite answers it. Yes, I understand the gVCFs contain information about all sites. Nonetheless, when running GenotypeGVCFs the default output will only contain sites where at least one subject is non-reference. Therefore the set of sites being output will vary based on the set of subjects (different combination of subjects have different sets of non-ref sites). While in theory one could run GenotypeGVCFs on all the subjects at once, in practice we're finding that calling 1000s of subjects at once simply isnt practical.

    I realize I missed this initially, but I see GenotypeGVCFs does support "--includeNonVariantSites", which seems like part of what we need. If the description is accurate, this would cause the VCF to have all callable sites, including ones where all samples are non-variant. For our purposes, this might work, but it would likely create a huge VCF. I'm going to explore (outside this thread), whether we can put in a PR to let the user provide a whitelist of sites to output, rather than simply include every callable one.

Sign In or Register to comment.