We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Distribution of GQ called by HaplotypeCaller GenomeAnalysisTK-3.2-2

evagserraevagserra Cambridge, UKMember
edited April 2015 in Ask the GATK team

Hi GATK team,

I have recently performed an analysis to try and select a GQ cut-off to be applied to WES data post-VQSR (applied 99.9% to the data). The WES data was called using HaplotypeCaller GenomeAnalysisTK-3.2-2 (over ~3000) samples and VQSR was applied (using the same GATK version). To decide on a GQ threshold, I looked at the correlation (over different GQ filters applied to the WES data) of chip genotypes and the sequencing genotypes (~350 samples were both genotyped and sequenced). The genotype data has been QC'ed s normally is in GWAS. The correlation is just the r squared (r2) for each variant between 2 vectors: one with the 350 chip genotypes and the other with the 350 sequencing genotypes. I finally estimated the average r2 per GQ quality filter applied and also counted how many genotypes were being dropped (ie., no longer variant sites). The result of this is the following figure, which I think looks a bit odd and suggests that the GQ is perhaps multi-modal. Have you ever seen this or have any suggestions as to why this might be?
The blue line is the correlation (left y axis) and the green is the proportion of GTs dropped (right y axis). The x axis is the GQ filters applied to the data from 0 to 50.

The calling command line used was this:
-ERC GVCF -variant_index_type LINEAR -variant_index_parameter 128000 -L S04380110_Padded.interval_list

Many thanks,
Eva

image

Tagged:

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @evagserra, have you tried plotting the GQ values themselves, as well as some properties that influence GQ such as DP? This may shed some light on what are the subpopulations of variants and what the underlying factors might be.

  • evagserraevagserra Cambridge, UKMember

    Hi @Geraldine_VdAuwera, thanks for your quick reply - I haven't plotted the GQ and DP yet, will do that as soon as I can and update on it. Cheers, Eva

  • evagserraevagserra Cambridge, UKMember

    Hi @SteveL, great point! I did not call the files myself so just inspected the header and indeed, looks like the GATK command used GVCFGQBands=[5,20,60], two of which I can clearly see in this graph. However, what I do not understand is why is that showing up in the graph I've posted above, since the graph only includes sites that are variant in both vectors (chip and WES), so no 0/0 (which is the only sites to be affected by GVCFGQBands, right)?
    Thanks a lot for your input!

  • SteveLSteveL BarcelonaMember ✭✭

    Unfortunately your graph has diappeared @evagserra - can you post it again.

  • evagserraevagserra Cambridge, UKMember

    Here it is - are you able to view it?

  • SteveLSteveL BarcelonaMember ✭✭

    Hi Eva - visible there fine. I can't divine exactly what the explanation is, but I had a feeling it would be related to the banding. Have you checked to make sure that your variant positions in the post-VQSR VCF don't include any "0/0"?

  • evagserraevagserra Cambridge, UKMember

    Thanks both for your help! @Geraldine_VdAuwera, I do include sites which are variant in some samples but hom-ref (0/0) in others. The reason why I was surprised to see this effect in the graph is because I thought the banding would only be for sites which are hom-ref (0/0) in all samples., ie., if one sample is 0/1 (singleton) but all the others are 0/0, I thought these latter ones will not be affected by the banding (and would get their normal GQ value), but you are saying they are, is this correct? This is all very useful and very good to know.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    That's right, any samples that were hom-ref in the GVCF come into the joint genotyping with annotations that are derived from the band they were part of. So although the GQ will be recalculated for each sample during joint genotyping, the underlying metrics (such as DP) are those of the band, so we expect the GQ to be close to the band's GQ, if I'm not mistaken. One way to check whether this is what's actually happening would be to run HC in BP_RESOLUTION mode, which does not use bands at all, and compare the resulting distribution of GQs.

  • evagserraevagserra Cambridge, UKMember

    @Geraldine_VdAuwera, thanks Geraldine - very helpful! Will try that but at least is great to know why this is happening and it makes sense now.

Sign In or Register to comment.