Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

best way of filtering out common SNPs in the GATK outputted VCF file

rcholicrcholic DenverPosts: 67Member

In my PiCard/GATK pipeline, I already include the 1000G_gold_standard and dbsnp files in my VQSR step, I am wondering if I should further filter the final vcf files. The two files I use are Mills_and_1000G_gold_standard.indels.hg19.vcf and dbsnp_137.hg19.vcf, downloaded from the GATK resource bundle.

I recently came across the NHLBI exome seq data http://evs.gs.washington.edu/EVS/#tabs-7, and the more complete 1000G variants ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/

These made me wonder if I should use these available VCFs to further filter my VCF files to remove the common SNPs. If so, can I use the "--mask" parameter in VariantFiltration of GATK to do the filtration? Examples below copied from documentation page:

    java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T VariantFiltration \
       -o output.vcf \
       --variant input.vcf \
       --filterExpression "AB < 0.2 || MQ0 > 50" \
       --filterName "Nov09filters" \
       --mask mask.vcf \
       --maskName InDel
Tagged:
Sign In or Register to comment.