We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Known site file format for Indel Realignment and BQSR (mouse/mm10)


I was wondering about the format of the known site vcfs used by the RealignerTargetCreator and BaseRecalibrator walkers.

I'm working with mouse whole genome sequence data, so I've been using the Sanger Mouse Genome project known sites from the Keane et al. 2011 Nature paper. From the output, it seems that the RealignerTargetCreator walker is able to recognise and use the gzipped vcf fine:

INFO 15:12:09,747 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,751 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.5-2-gf57256b, Compiled 2013/05/01 09:27:02 INFO 15:12:09,751 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:12:09,752 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:12:09,758 HelpFormatter - Program Args: -T RealignerTargetCreator -R mm10.fa -I DUK01M.sorted.dedup.bam -known /tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz -o DUK01M.indel.intervals.list INFO 15:12:09,758 HelpFormatter - Date/Time: 2014/03/25 15:12:09 INFO 15:12:09,758 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,759 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:12:09,918 ArgumentTypeDescriptor - Dynamically determined type of /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz to be VCF INFO 15:12:10,010 GenomeAnalysisEngine - Strictness is SILENT INFO 15:12:10,367 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:12:10,377 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 15:12:10,439 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06 INFO 15:12:10,468 RMDTrackBuilder - Attempting to blindly load /fml/chones/tmp/mgp.v3.SNPs.indels/ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/mgp.v3.indels.rsIDdbSNPv137.vcf.gz as a tabix indexed file INFO 15:12:11,066 IndexDictionaryUtils - Track known doesn't have a sequence dictionary built in, skipping dictionary validation INFO 15:12:11,201 GenomeAnalysisEngine - Creating shard strategy for 1 BAM files INFO 15:12:12,333 GenomeAnalysisEngine - Done creating shard strategy INFO 15:12:12,334 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
I've checked the indel interval lists for my samples and they do all appear to contain different intervals.

However, when I use the equivalent SNP vcf in the following BQSR step, GATK errors as follows:

`##### ERROR ------------------------------------------------------------------------------------------

ERROR A USER ERROR has occurred (version 2.5-2-gf57256b):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation.
ERROR ------------------------------------------------------------------------------------------`

Which means that the SNP vcf (which has the same format as the indel vcf) is not used by BQSR.

My question is: given that the BQSR step failed, should I be worried that there are no errors from the Indel Realignment step? As the known SNP/indel vcfs are in the same format, I don't know whether I can trust the realigned .bams.

Thanks very much!

Best Answer


  • pdexheimerpdexheimer Member ✭✭✭✭

    Have you tried gunzipping the files directly? We've actually been playing with exactly the same data, except downloaded from NCBI/dbSNP. I don't know the very latest status, but yesterday we were having a hard enough time downloading an intact file that we were suspecting a corrupt file on the NCBI site. Since the filenames are (I think) identical, perhaps we're seeing different symptoms of a common problem?

  • mfletchermfletcher DEMember

    I actually only discovered that the gzipped vcfs were incorrectly formatted because I pushed a sample through the HaplotypeCaller to the VQSR step, and it took me 2 days to figure out why the VariantRecalibrator walker kept on spitting out "Annotation not present" errors when the annotations I was passing were clearly present in both my HC output vcf and mgp.v3.SNPs.rsIDdbSNPv137.vcf.gz!

    Once I gunzipped the files, the VQSR errors became much more understandable - saying that the vcfs were:
    1) incorrectly labelled in the CHROM field (MGP uses "1, 2, 3... X" and not "chr1, chr2, chr3... chrX"
    2) not sorted correctly for GATK (ie. 1, 10, 11, 12... instead of 1, 2, 3...)
    3) MGP's vcfs have had the small scaffolds filtered out, leaving only chr1-19 and chrX. I aligned my data to mm10 but haven't filtered out these small scaffolds, which was causing the VariantRecalibrator to error before the run even began.

    So now at least I'll have to re-do the BQSR step, at the very least, because that clearly failed (lesson learned: check your error logs...!), but it's unclear to me how I can tell whether the Indel Realignment actually did anything, given the lack of error messages.

    If you're having trouble downloading those files from the NCBI, I'd suggest getting them directly from the Sanger's mirror at ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/

  • mfletchermfletcher DEMember

    @pdexheimer‌ Thanks very much for the help! I'll go right back to the realignment step and make sure that everything works properly this time around.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    took me 2 days to figure out why the VariantRecalibrator walker kept on spitting out "Annotation not present" errors when the annotations I was passing were clearly present

    This is a known issue; we have a todo item to make VariantRecalibrator check that the resources match the same reference, and error out if they don't. Sorry about that.

Sign In or Register to comment.