Filtering bam files based on known site vcf filters for mouse

mfletchermfletcher DEPosts: 23Member
edited April 17 in Ask the GATK team

Hello again,

More fun with mouse known site data! I'm using the Sanger MGP v3 known indel/known SNP sites for the IndelRealigner and BQSR steps.

I'm working with whole-genome sequence; however, the known sites have been filtered for the following contigs (example from the SNP vcf):

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18-r572
##reference=ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
##source_20130026.2=vcf-annotate(r813) -f +/D=200/d=5/q=20/w=2/a=5 (AJ,AKR,CASTEiJ,CBAJ,DBA2J,FVBNJ,LPJ,PWKPhJ,WSBEiJ)
##source_20130026.2=vcf-annotate(r813) -f +/D=250/d=5/q=20/w=2/a=5 (129S1,BALBcJ,C3HHeJ,C57BL6NJ,NODShiLtJ,NZO,Spretus)
##source_20130305.2=vcf-annotate(r818) -f +/D=155/d=5/q=20/w=2/a=5 (129P2)
##source_20130304.2=vcf-annotate(r818) -f +/D=100/d=5/q=20/w=2/a=5 (129S5)
##contig=<ID=1,length=195471971>
##contig=<ID=10,length=130694993>
##contig=<ID=11,length=122082543>
##contig=<ID=12,length=120129022>
##contig=<ID=13,length=120421639>
##contig=<ID=14,length=124902244>
##contig=<ID=15,length=104043685>
##contig=<ID=16,length=98207768>
##contig=<ID=17,length=94987271>
##contig=<ID=18,length=90702639>
##contig=<ID=19,length=61431566>
##contig=<ID=2,length=182113224>
##contig=<ID=3,length=160039680>
##contig=<ID=4,length=156508116>
##contig=<ID=5,length=151834684>
##contig=<ID=6,length=149736546>
##contig=<ID=7,length=145441459>
##contig=<ID=8,length=129401213>
##contig=<ID=9,length=124595110>
##contig=<ID=X,length=171031299>
##FILTER=<ID=BaseQualBias,Description="Min P-value for baseQ bias (INFO/PV4) [0]">
##FILTER=<ID=EndDistBias,Description="Min P-value for end distance bias (INFO/PV4) [0.0001]">
##FILTER=<ID=GapWin,Description="Window size for filtering adjacent gaps [3]">
##FILTER=<ID=Het,Description="Genotype call is heterozygous (low quality) []">
##FILTER=<ID=MapQualBias,Description="Min P-value for mapQ bias (INFO/PV4) [0]">
##FILTER=<ID=MaxDP,Description="Maximum read depth (INFO/DP or INFO/DP4) [200]">
##FILTER=<ID=MinAB,Description="Minimum number of alternate bases (INFO/DP4) [5]">
##FILTER=<ID=MinDP,Description="Minimum read depth (INFO/DP or INFO/DP4) [5]">
##FILTER=<ID=MinMQ,Description="Minimum RMS mapping quality for SNPs (INFO/MQ) [20]">
##FILTER=<ID=Qual,Description="Minimum value of the QUAL field [10]">
##FILTER=<ID=RefN,Description="Reference base is N []">
##FILTER=<ID=SnpGap,Description="SNP within INT bp around a gap to be filtered [2]">
##FILTER=<ID=StrandBias,Description="Min P-value for strand bias (INFO/PV4) [0.0001]">
##FILTER=<ID=VDB,Description="Minimum Variant Distance Bias (INFO/VDB) [0]">

When I was trying to use these known sites at the VariantRecalibration step, I got a lot of walker errors saying that (I paraphrase) "it's dangerous to use this known site data on your VCF because the contigs of your references do not match".

However, if you look at the GRCm38_68.fai it DOES include the smaller scaffolds which are present in my data.

So, my question is: how should I filter my bam files for the IndelRealigner and downstream steps? I feel like the best option is to filter on the contigs present in the known site vcfs, but obviously that would throw out a proportion of my data.

Thanks very much!

Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • pdexheimerpdexheimer Posts: 388Member, GSA Collaborator ✭✭✭✭

    Are the contigs in your fai file also in lexical order (1, 10, 11, etc)? GATK is very conservative when it comes to references - both name and order must be identical

  • mfletchermfletcher DEPosts: 23Member

    @pdexheimer‌ that's not the issue - apologies if my previous post was unclear! (Although as for the answer - no, the .fai files for the mm10 reference I'm using locally do not match the .fai file on the Sanger's ftp server in order OR contig name, sigh. I'll make a note to keep an eye on that one too)

    My question is how I should filter my bam files, if at all, given that the known site vcfs are filtered to those contigs listed above.

    e.g. if I try and run IndelRealigner on my whole-genome, unfiltered bam, will it work with the MGP's contig-filtered set of known sites?

    And if IndelRealigner works, will further downstream steps work as well?

  • mfletchermfletcher DEPosts: 23Member

    Thanks very much @pdexheimer‌! My inclination was to filter my data based on the resources, but I'll try out various options and see how they perform.

  • pdexheimerpdexheimer Posts: 388Member, GSA Collaborator ✭✭✭✭

    Also, I think (but am not certain) that filtered records in the VCF will not be used - so you could leave the records in place and add a "not on the right config" filter. This approach is probably more in keeping with the spirit of VCF files

Sign In or Register to comment.