The current GATK version is 3.4-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

# Filtering bam files based on known site vcf filters for mouse

DEPosts: 23Member
edited April 2014

Hello again,

More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps.

I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf):

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18-r572
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ)
##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus)
##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2)
##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5)
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) [0]">
##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]">
##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps [3]">
##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) []">
##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) [0]">
##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) [200]">
##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) [5]">
##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) [5]">
##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) [20]">
##FILTER=<ID=Qual,Description="Minimum value of the QUAL field [10]">
##FILTER=<ID=RefN,Description="Reference base is N []">
##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered [2]">
##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]">
##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) [0]">


When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".

However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.

So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.

Thanks very much!

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 463Member, GATK Dev, DSDE Member mod

Are the contigs in your fai file also in lexical order (1, 10, 11, etc)? GATK is very conservative when it comes to references - both name and order must be identical

• DEPosts: 23Member

@pdexheimer‌ that's not the issue - apologies if my previous post was unclear! (Although as for the answer - no, the .fai files for the mm10 reference I'm using locally do not match the .fai file on the Sanger's ftp server in order OR contig name, sigh. I'll make a note to keep an eye on that one too)

My question is how I should filter my bam files, if at all, given that the known site vcfs are filtered to those contigs listed above.

e.g. if I try and run IndelRealigner on my whole-genome, unfiltered bam, will it work with the MGP's contig-filtered set of known sites?

And if IndelRealigner works, will further downstream steps work as well?

• DEPosts: 23Member

Thanks very much @pdexheimer‌! My inclination was to filter my data based on the resources, but I'll try out various options and see how they perform.

• Posts: 463Member, GATK Dev, DSDE Member mod

Also, I think (but am not certain) that filtered records in the VCF will not be used - so you could leave the records in place and add a "not on the right config" filter. This approach is probably more in keeping with the spirit of VCF files