The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# Known site file format for Indel Realignment and BQSR (mouse/mm10)

DEPosts: 23Member

Hello,

I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO  15:12:09,747 HelpFormatter - --------------------------------------------------------------------------------
INFO  15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02
INFO  15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO  15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list
INFO  15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09
INFO  15:12:09,758 HelpFormatter - --------------------------------------------------------------------------------
INFO  15:12:09,759 HelpFormatter - --------------------------------------------------------------------------------
INFO  15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF
INFO  15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT
INFO  15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO  15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
INFO  15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file
INFO  15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation
INFO  15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files
INFO  15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy
INFO  15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

##### ERROR ------------------------------------------------------------------------------------------

##### ERROR ------------------------------------------------------------------------------------------

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!

Tagged:

• Posts: 344Member, GSA Collaborator ✭✭✭

Have you tried gunzipping the files directly? We've actually been playing with exactly the same data, except downloaded from NCBI/dbSNP. I don't know the very latest status, but yesterday we were having a hard enough time downloading an intact file that we were suspecting a corrupt file on the NCBI site. Since the filenames are (I think) identical, perhaps we're seeing different symptoms of a common problem?

• DEPosts: 23Member

I actually only discovered that the gzipped vcfs were incorrectly formatted because I pushed a sample through the HaplotypeCaller to the VQSR step, and it took me 2 days to figure out why the VariantRecalibrator walker kept on spitting out "Annotation not present" errors when the annotations I was passing were clearly present in both my HC output vcf and mgp.v3.SNPs.rsIDdbSNPv137.vcf.gz!

Once I gunzipped the files, the VQSR errors became much more understandable - saying that the vcfs were: 1) incorrectly labelled in the CHROM field (MGP uses "1, 2, 3... X" and not "chr1, chr2, chr3... chrX" 2) not sorted correctly for GATK (ie. 1, 10, 11, 12... instead of 1, 2, 3...) 3) MGP's vcfs have had the small scaffolds filtered out, leaving only chr1-19 and chrX. I aligned my data to mm10 but haven't filtered out these small scaffolds, which was causing the VariantRecalibrator to error before the run even began.

So now at least I'll have to re-do the BQSR step, at the very least, because that clearly failed (lesson learned: check your error logs...!), but it's unclear to me how I can tell whether the Indel Realignment actually did anything, given the lack of error messages.

If you're having trouble downloading those files from the NCBI, I'd suggest getting them directly from the Sanger's mirror at ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/

• DEPosts: 23Member

@pdexheimer‌ Thanks very much for the help! I'll go right back to the realignment step and make sure that everything works properly this time around.