We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Weird VQSR filtering pattern

Hi

I performed HC joint calling against 300 normal tissue samples. I did every step according to the best practice and got a VQSR plot very similar to this one. Most variants have the same MQ and FS value and make these feature distribution non-gaussian. Would such result be still valid? Can anyone share some insight?

Cheng

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    Please post the version of GATK you are using, the exact command and the plots showing the weird VQSR filtering pattern.

  • UniCornUniCorn USMember

    The version is GATK4.0.8.1

    command line is :
    /biocluster/data/bioexec/software/gatk-4.0.8.1/gatk VariantRecalibrator \
    -R $REF \
    -V $input_vcf \
    -tranche 100.0 -tranche 99.95 -tranche 99.9 -tranche 99.8 -tranche 99.6 -tranche 99.5 -tranche 99.4 \
    -tranche 99.3 -tranche 99.0 -tranche 98.0 -tranche 97.0 -tranche 90.0 \
    --resource hapmap,known=false,training=true,truth=true,prior=15.0:$ref_dir/hapmap_3.3.hg19.sites.vcf \
    --resource omni,known=false,training=true,truth=false,prior=12.0:$ref_dir/1000G_omni2.5.hg19.sites.vcf \
    --resource dbsnp,known=true,training=false,truth=false,prior=2.0:$dbsnp_vcf \
    -an MQRankSum -an ReadPosRankSum -an BaseQRankSum -an SOR -an MQ -an FS -an DQ \
    -mode SNP \
    -O $outdir/$output_prefix.snp.recal \
    --tranches-file $outdir/$output_prefix.snp.tranches \
    --rscript-file $outdir/$output_prefix.snp.plots.R

    I guess the reason is that MQ, FS and DQ does not follow Gaussian distribution (shows as very sharp peak with little standard deviation). Should I get ride of these non-Gaussian features? Is there a role of thrum regarding which features should be included? Thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    Can you also please post the VQSR plots that are showing the weird results.

  • UniCornUniCorn USMember
    edited October 2019

    Thanks for getting back to me . Please see attached plot below. The problem is more severe with MQRankSum and SOR. Rest of features including ReadPosRankSum and BaseQRankSum are pretty normally distributed.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    Maybe try plotting MQRankSum and SOR density functions to see why those are more severe than other annotations. Also if VQSR filtering is not working for your data you could try hard filtering as described here: https://software.broadinstitute.org/gatk/documentation/article?id=11069

  • UniCornUniCorn USMember

    I went back to bam file and found that most of the alignments have mapping quality (MQ) of 60 which makes most of the MQRankSum as exactly 0. If that is the reason, I guess it is OK use VQSR, right? Those variants having MQRankSum of 0 will be label pass anyway, right?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited November 2019

    Hi @UniCorn

    Yes I think that explains it and in that case you should be able to use vqsr. What I am wondering is why all/most the reads have MQ of 60. What kind of sequencing data are you using? And have the variants been filtered before?

  • UniCornUniCorn USMember

    We used NovoSeq platform / typical GATK best practice pipeline and nothing else. My impression is that it is common to have MQ of 60 for the most of the alignments when we have raw data of good quality, right? If my impression is correct, this will make the MQRankSum non-normally distributed. So I am kind of confused...

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @UniCorn

    I was wrong, I apologize. This data and the plots look fine. You are right, the plots look concentrated because most of the MQRankSum are exactly 0. This is fine and not something to worry about and yes you can still VQSR for this data.

Sign In or Register to comment.