Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

Weird VQSR filtering pattern

Hi

I performed HC joint calling against 300 normal tissue samples. I did every step according to the best practice and got a VQSR plot very similar to this one. Most variants have the same MQ and FS value and make these feature distribution non-gaussian. Would such result be still valid? Can anyone share some insight?

Cheng

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    Please post the version of GATK you are using, the exact command and the plots showing the weird VQSR filtering pattern.

  • UniCornUniCorn USMember

    The version is GATK4.0.8.1

    command line is :
    /biocluster/data/bioexec/software/gatk-4.0.8.1/gatk VariantRecalibrator \
    -R $REF \
    -V $input_vcf \
    -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 \
    -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
    --resource hapmap,known=false,training=true,truth=true,prior=15.0:$ref_dir/hapmap_3.3.hg19.sites.vcf \
    --resource omni,known=false,training=true,truth=false,prior=12.0:$ref_dir/1000G_omni2.5.hg19.sites.vcf \
    --resource dbsnp,known=true,training=false,truth=false,prior=2.0:$dbsnp_vcf \
    -an MQRankSum -an ReadPosRankSum -an BaseQRankSum -an SOR -an MQ -an FS -an DQ \
    -mode SNP \
    -O $outdir/$output_prefix.snp.recal \
    --tranches-file $outdir/$output_prefix.snp.tranches \
    --rscript-file $outdir/$output_prefix.snp.plots.R

    I guess the reason is that MQ, FS and DQ does not follow Gaussian distribution (shows as very sharp peak with little standard deviation). Should I get ride of these non-Gaussian features? Is there a role of thrum regarding which features should be included? Thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    Can you also please post the VQSR plots that are showing the weird results.

  • UniCornUniCorn USMember
    edited October 18

    Thanks for getting back to me . Please see attached plot below. The problem is more severe with MQRankSum and SOR. Rest of features including ReadPosRankSum and BaseQRankSum are pretty normally distributed.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    Maybe try plotting MQRankSum and SOR density functions to see why those are more severe than other annotations. Also if VQSR filtering is not working for your data you could try hard filtering as described here: https://software.broadinstitute.org/gatk/documentation/article?id=11069

  • UniCornUniCorn USMember

    I went back to bam file and found that most of the alignments have mapping quality (MQ) of 60 which makes most of the MQRankSum as exactly 0. If that is the reason, I guess it is OK use VQSR, right? Those variants having MQRankSum of 0 will be label pass anyway, right?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited November 2

    Hi @UniCorn

    Yes I think that explains it and in that case you should be able to use vqsr. What I am wondering is why all/most the reads have MQ of 60. What kind of sequencing data are you using? And have the variants been filtered before?

  • UniCornUniCorn USMember

    We used NovoSeq platform / typical GATK best practice pipeline and nothing else. My impression is that it is common to have MQ of 60 for the most of the alignments when we have raw data of good quality, right? If my impression is correct, this will make the MQRankSum non-normally distributed. So I am kind of confused...

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    I was wrong, I apologize. This data and the plots look fine. You are right, the plots look concentrated because most of the MQRankSum are exactly 0. This is fine and not something to worry about and yes you can still VQSR for this data.

Sign In or Register to comment.