Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Known site file format for Indel Realignment and BQSR (mouse/mm10)

mfletchermfletcher DEPosts: 23Member

Hello,

I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO  15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 
INFO  15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk 
INFO  15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list 
INFO  15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 
INFO  15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF 
INFO  15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT 
INFO  15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 
INFO  15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file 
INFO  15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files 
INFO  15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy 
INFO  15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
ERROR ------------------------------------------------------------------------------------------`

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!

Best Answer

Answers

  • pdexheimerpdexheimer Posts: 344Member, GSA Collaborator ✭✭✭

    Have you tried gunzipping the files directly? We've actually been playing with exactly the same data, except downloaded from NCBI/dbSNP. I don't know the very latest status, but yesterday we were having a hard enough time downloading an intact file that we were suspecting a corrupt file on the NCBI site. Since the filenames are (I think) identical, perhaps we're seeing different symptoms of a common problem?

  • mfletchermfletcher DEPosts: 23Member

    I actually only discovered that the gzipped vcfs were incorrectly formatted because I pushed a sample through the HaplotypeCaller to the VQSR step, and it took me 2 days to figure out why the VariantRecalibrator walker kept on spitting out "Annotation not present" errors when the annotations I was passing were clearly present in both my HC output vcf and mgp.v3.SNPs.rsIDdbSNPv137.vcf.gz!

    Once I gunzipped the files, the VQSR errors became much more understandable - saying that the vcfs were: 1) incorrectly labelled in the CHROM field (MGP uses "1, 2, 3... X" and not "chr1, chr2, chr3... chrX" 2) not sorted correctly for GATK (ie. 1, 10, 11, 12... instead of 1, 2, 3...) 3) MGP's vcfs have had the small scaffolds filtered out, leaving only chr1-19 and chrX. I aligned my data to mm10 but haven't filtered out these small scaffolds, which was causing the VariantRecalibrator to error before the run even began.

    So now at least I'll have to re-do the BQSR step, at the very least, because that clearly failed (lesson learned: check your error logs...!), but it's unclear to me how I can tell whether the Indel Realignment actually did anything, given the lack of error messages.

    If you're having trouble downloading those files from the NCBI, I'd suggest getting them directly from the Sanger's mirror at ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/

  • mfletchermfletcher DEPosts: 23Member

    @pdexheimer‌ Thanks very much for the help! I'll go right back to the realignment step and make sure that everything works properly this time around.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    took me 2 days to figure out why the VariantRecalibrator walker kept on spitting out "Annotation not present" errors when the annotations I was passing were clearly present

    This is a known issue; we have a todo item to make VariantRecalibrator check that the resources match the same reference, and error out if they don't. Sorry about that.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.