Filtering bam files based on known site vcf filters for mouse

mfletchermfletcher DEPosts: 23Member
edited April 17 in Ask the GATK team

Hello again,

More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps.

I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf):

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18-r572
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ)
##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus)
##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2)
##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5)
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) [0]">
##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]">
##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps [3]">
##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) []">
##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) [0]">
##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) [200]">
##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) [5]">
##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) [5]">
##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) [20]">
##FILTER=<ID=Qual,Description="Minimum value of the QUAL field [10]">
##FILTER=<ID=RefN,Description="Reference base is N []">
##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered [2]">
##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]">
##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) [0]">

When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".

However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.

So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.

Thanks very much!

Post edited by Geraldine_VdAuwera on

Best Answer

Answers

Sign In or Register to comment.