best way of filtering out common SNPs in the GATK outputted VCF file

rcholicrcholic DenverPosts: 68Member

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T VariantFiltration \
       -o output.vcf \
       --variant input.vcf \
       --filterExpression "AB < 0.2 || MQ0 > 50" \
       --filterName "Nov09filters" \
       --mask mask.vcf \
       --maskName InDel
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,464Administrator, GATK Developer admin

    Hi there,

    I'm not sure what you mean; it really depends what you're trying to do, depending on whether you're trying to determine all quality variants, just the novel variants, etc. You will need to clarify your intentions a little more, otherwise we can't really help you.

    Geraldine Van der Auwera, PhD

  • rcholicrcholic DenverPosts: 68Member

    @Geraldine. Sorry for not being clear in my first.

    I have got my VCF output from the GATK pipeline and want to further filter out those common SNPs. For this reason, I wanted to use NHLBI exome seq data and 1000G common SNP vcf to clean up my VCF files.

    My questions are: 1. whether I can use "--mask commonSNPs.vcf" parameter to do this in VariantFiltration of GATK. 2. How to filter the VCF files of GATK by minor allele frequency (MAF > 0.1)?

    Thanks

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited March 27
    1. How to filter the VCF files of GATK by minor allele frequency (MAF > 0.1)?

    Use VariantAnnotator like this(to lift over annotation from one file to another):

         --resource:cosmic,vcf $cosmicVcf \
         -E 'cosmic.ID' \
         --resource:1000g,vcf $oneKgP1wgsVcf \
         -E '1000g.AF' \
         -E '1000g.AFR_AF' \
         -E '1000g.AMR_AF' \
         -E '1000g.ASN_AF' \
         -E '1000g.EUR_AF' \
    

    Then use JEXL with VariantFiltration to filter the lifted over annotations:

         --filterExpression "(vc.hasAttribute('1000g.EUR_AF') && (vc.getAttribute('1000g.EUR_AF') > 0.1 && vc.getAttribute('1000g.EUR_AF') < 0.9))" \
         --filterName "1000gEURMAFgt0.1" \
    

    (not checked!! also see the jexl page for fixing this expression.)

    Post edited by mmterpstra on
  • rcholicrcholic DenverPosts: 68Member

    @mmterpstra,

    Thanks for your reply. One thing I am not sure is what is $cosmicVcf and $oneKgP1wgsVcf ? Where do I get the VCFs?

Sign In or Register to comment.