best way of filtering out common SNPs in the GATK outputted VCF file

rcholic Denver

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data, and the more complete 1000G variants

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T VariantFiltration \
       -o output.vcf \
       --variant input.vcf \
       --filterExpression "AB < 0.2 || MQ0 > 50" \
       --filterName "Nov09filters" \
       --mask mask.vcf \
       --maskName InDel


  Geraldine_VdAuwera Administrator

    Hi there,

    I'm not sure what you mean; it really depends what you're trying to do, depending on whether you're trying to determine all quality variants, just the novel variants, etc. You will need to clarify your intentions a little more, otherwise we can't really help you.

    Geraldine Van der Auwera, PhD

  rcholic Denver

    @Geraldine. Sorry for not being clear in my first.

    I have got my VCF output from the GATK pipeline and want to further filter out those common SNPs. For this reason, I wanted to use NHLBI exome seq data and 1000G common SNP vcf to clean up my VCF files.

    My questions are:
    1. whether I can use "--mask commonSNPs.vcf" parameter to do this in VariantFiltration of GATK.
    2. How to filter the VCF files of GATK by minor allele frequency (MAF > 0.1)?


  mmterpstra Netherlands
    edited March 2014
    1. How to filter the VCF files of GATK by minor allele frequency (MAF > 0.1)?

    Use VariantAnnotator like this(to lift over annotation from one file to another):

         --resource:cosmic,vcf $cosmicVcf \
         -E 'cosmic.ID' \
         --resource:1000g,vcf $oneKgP1wgsVcf \
         -E '1000g.AF' \
         -E '1000g.AFR_AF' \
         -E '1000g.AMR_AF' \
         -E '1000g.ASN_AF' \
         -E '1000g.EUR_AF' \

    Then use JEXL with VariantFiltration to filter the lifted over annotations:

         --filterExpression "(vc.hasAttribute('1000g.EUR_AF') && (vc.getAttribute('1000g.EUR_AF') > 0.1 && vc.getAttribute('1000g.EUR_AF') < 0.9))" \
         --filterName "1000gEURMAFgt0.1" \

    (not checked!! also see the jexl page for fixing this expression.)

  rcholic Denver


    Thanks for your reply. One thing I am not sure is what is $cosmicVcf and $oneKgP1wgsVcf ? Where do I get the VCFs?

