The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Known site file format for Indel Realignment and BQSR (mouse/mm10)

mfletchermfletcher DEMember Posts: 23


I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO  15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 
INFO  15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute 
INFO  15:12:09,752 HelpFormatter - For support and documentation go to 
INFO  15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ -o DUK01M.indel.intervals.list 
INFO  15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 
INFO  15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- 
INFO  15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ to be VCF 
INFO  15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT 
INFO  15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 
INFO  15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
INFO  15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 
INFO  15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ as a tabix indexed file 
INFO  15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation 
INFO  15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files 
INFO  15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy 

I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
ERROR ------------------------------------------------------------------------------------------`

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!

Best Answer


  • pdexheimerpdexheimer Member, Dev Posts: 544 ✭✭✭✭

    Have you tried gunzipping the files directly? We've actually been playing with exactly the same data, except downloaded from NCBI/dbSNP. I don't know the very latest status, but yesterday we were having a hard enough time downloading an intact file that we were suspecting a corrupt file on the NCBI site. Since the filenames are (I think) identical, perhaps we're seeing different symptoms of a common problem?

  • mfletchermfletcher DEMember Posts: 23

    I actually only discovered that the gzipped vcfs were incorrectly formatted because I pushed a sample through the HaplotypeCaller to the VQSR step, and it took me 2 days to figure out why the VariantRecalibrator walker kept on spitting out "Annotation not present" errors when the annotations I was passing were clearly present in both my HC output vcf and mgp.v3.SNPs.rsIDdbSNPv137.vcf.gz!

    Once I gunzipped the files, the VQSR errors became much more understandable - saying that the vcfs were:
    1) incorrectly labelled in the CHROM field (MGP uses "1, 2, 3... X" and not "chr1, chr2, chr3... chrX"
    2) not sorted correctly for GATK (ie. 1, 10, 11, 12... instead of 1, 2, 3...)
    3) MGP's vcfs have had the small scaffolds filtered out, leaving only chr1-19 and chrX. I aligned my data to mm10 but haven't filtered out these small scaffolds, which was causing the VariantRecalibrator to error before the run even began.

    So now at least I'll have to re-do the BQSR step, at the very least, because that clearly failed (lesson learned: check your error logs...!), but it's unclear to me how I can tell whether the Indel Realignment actually did anything, given the lack of error messages.

    If you're having trouble downloading those files from the NCBI, I'd suggest getting them directly from the Sanger's mirror at

  • mfletchermfletcher DEMember Posts: 23

    @pdexheimer‌ Thanks very much for the help! I'll go right back to the realignment step and make sure that everything works properly this time around.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,631 admin

    took me 2 days to figure out why the VariantRecalibrator walker kept on spitting out "Annotation not present" errors when the annotations I was passing were clearly present

    This is a known issue; we have a todo item to make VariantRecalibrator check that the resources match the same reference, and error out if they don't. Sorry about that.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.