Any change on the recommendation for genotype-based QC for GVCF called variants?


For QC, we have an old practice of setting low quality genotypes with low DP and/or GQ to missing, e.g. DP<8 and GQ<20.

We noticed that, for the variants called from the GVCF files using HaplotypeCaller, those parameters of homozygous reference (0/0) no longer represent the variant itself but the whole non-var block of the individual. e.g. DP reflects MIN_DP of the non-var block (as far as we see), so as GQ.

May I know that if it is still advisable to do this kind of genotype-based QC based on GQ and DP? or how would you advise for further QC, besides VQSR? This would affect quite a lot on protocol for discovering de novo variants.


Best Answer


  • claratsmclaratsm Member

    BTW, is GQ of the 0/0 genotype the minimum or the average of the non-var block?

  • claratsmclaratsm Member

    Yup, we are working on the VCF file generated using GenotypeGVCF function, not on the GVCF files.

    We combine samples from multiple WES projects together, which are mostly used for rare variant association analysis. We do use VQSR to determine whether a variant should be kept but are not sure how we should deal with the uncertainty in individual genotypes, e.g. ref-hom for samples with low DP, it could be a false negative. That's why we used genotype-filter to flag/set to missing those uncertain genotypes.

    It seems like in this GVCF mode, however, the DP/GQ for hom-ref (not het/hom-alt) does not reflect the coverage or quality of the variant site but the whole non-var block. Am I right about that? May I know how exactly a hom-ref is called from a non-var block, say a block of GT:DP:GQ:MIN_DP:PL=0/0:22:20:9:0,9,30? would the hom-ref genotype take the DP or MIN_DP of the block or an average?

    For unrelated samples, it seems that the phasing-based genotype refinement is not too useful. Would you suggest other genotype refinement strategy for this scenario?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You're correct that in the default (compressed) GVCF mode, some records are combined into blocks. Notice that the blocked records are grouped in bands of similar GQ (see FAQ on GVCF format for more details) so you don't actually lose a lot of granularity. But if you really want to get maximum possible granularity with zero compression (and have storage space to spare) you can use -ERC BP_RESOLUTION instead of -ERC GVCF.

    You can refine genotypes of unrelated samples using population-based imputation. GATK does not offer this as a post-hoc process (though it is essentially what GenotypeGVCFs does when joint genotyping with data from a very large cohort, e.g. the EXAC dataset that was recently made public by Daniel MacArthur & collaborators) but you can use other software such as BEAGLE (from UWash iirc).

Sign In or Register to comment.