We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Errors about contigs in BAM or VCF files not being properly ordered or sorted

This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files so it will fail with an error in this case.
So what do you do?
For BAM files
You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:
java -jar picard.jar ReorderSam \ I=original.bam \ O=reordered.bam \ R=reference.fasta \ CREATE_INDEX=TRUE
Where reference.fasta
is your genome reference, which must be accompanied by a valid *.dict
dictionary file. The CREATE_INDEX
argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).
Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!
For VCF files
You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:
java -jar picard.jar SortVcf \ I=original.vcf \ O=sorted.vcf \ SEQUENCE_DICTIONARY=reference.dict
Where reference.dict
is the sequence dictionary of your genome reference.
Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.
Version-specific alert for GATK 3.5
In version 3.5, we added some beefed-up VCF sequence dictionary validation. Unfortunately, as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime -U ALLOW_SEQ_DICT_INCOMPATIBILITY can be used (with caution) to override the check.
Comments
Hello,
Is there anyway to adjust/sort the genome fasta file instead of the Bam file?
I have hundreds of Bams taking up TBs of storage. To reorder them all and save double copies (since some information is lost in the process) is a huge endeavour. It is much more feasible for me to tweak the reference instead. Is that possible?
Thanks!
Rola
Yes, you could reorder the contigs within the reference file -- but if you use the same reference that was used to generate all these bams you shouldn't need to. What exactly is the error you're getting?
It might not be the exact reference genome file. The data is publicly available and it is aligned to hg19 but I dont have the exact genome file. What should I do? Do I look at the order of contigs in the Bam file and order the genome file the exact same way?
This is the error log:
INFO 23:29:03,978 GenomeAnalysisEngine - Strictness is SILENT
INFO 23:29:04,344 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 23:29:04,365 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 23:29:04,748 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.38
INFO 23:29:04,972 IntervalUtils - Processing 159138663 bp from intervals
INFO 23:29:08,711 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
I tried to order the contigs in the genome fasta file the same way the Bams are sorted:
pulled out contig order:
samtools view $BAM | cut -f3 | uniq
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr20
chr21
chr22
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrM
chrX
chrY
*
recreate genome file with same contigs in the same order:
cat chr1.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr2.fa chr20.fa chr21.fa chr22.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chrM.fa chrX.fa chrY.fa > hg19.gatk.fa
index genome and create dictionary:
samtools faidx hg19.gatk.fa
java -jar picard.jar CreateSequenceDictionary R= hg19.gatk.fa O= hg19.gatk.dict
run muTect:
java -jar GenomeAnalysisTK.jar -T MuTect2 -R hg19.gatk.fa -I:tumor $tumor -I:normal $normal --intervals "chr7" -o NormalTumorPair_chr7.muTect.vcf
######### Same error:
INFO 20:20:08,335 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:20:08,339 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
INFO 20:20:08,339 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 20:20:08,339 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 20:20:08,344 HelpFormatter - Program Args: -T MuTect2 -R hg19.gatk.fa -I:tumor tumor.bam -I:normal normal.bam --intervals chr7 -o NormalTumorPair_chr7.muTect.vcf
INFO 20:20:08,361 HelpFormatter - Executing as x on Linux 2.6.32-504.30.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_15_12_57-b00.
INFO 20:20:08,361 HelpFormatter - Date/Time: 2016/01/10 20:20:08
INFO 20:20:08,361 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:20:08,361 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:20:08,595 GenomeAnalysisEngine - Strictness is SILENT
INFO 20:20:08,980 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 20:20:08,996 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 20:20:09,142 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14
INFO 20:20:09,232 IntervalUtils - Processing 159138663 bp from intervals
INFO 20:20:10,879 GATKRunReport - Uploaded run statistics report to AWS S3
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
ERROR ------------------------------------------------------------------------------------------
Ah, what you're encountering is not the "contigs mismatch" error, it's a different error that complains specifically about the lexicographical order. We'll add some text to disambiguate that in the doc.
This is something of a legacy requirement that contigs be ordered in pure alphanumeric order. Once you've made sure the dictionaries match between your bams and reference, and any VCF resources you plan to use, you can override this requirement by setting
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
which will also allow lexicographical order.Note that
-U
stands forUNSAFE
and that skipping dictionary validation is not recommended. However in your case I think it may be the only way to work without reordering all your bams.Be sure to test this on a full run of your pipeline on a subset of data to make sure it won't cause any fatal problems downstream.
I am currently trying to resort contigs from a vcf relative to a reference with the picard task SortVCF; however, I am running into the error quoted below. This may be because of an incompatible dictionary but the error was repeated even after creating a new dict file by following the documentation here: http://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference.
Do you have any suggestions? I am using the reference genome as specified at the top of the vcf.
Thank you!
--
[Fri Jan 22 16:50:24 EST 2016] picard.vcf.SortVcf INPUT=[ALL.autosomes.phase3_shapeit2_mvncall_integrated_v4.20130502.sites.vcf] OUTPUT=sorted.vcf SEQUENCE_DICTIONARY=/xchip/cga_home/breardon/misc_projects/_travis_vcf/reference_hs37d5/hs37d5.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false
[Fri Jan 22 16:50:24 EST 2016] Executing as [email protected] on Linux 2.6.32-573.7.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_71-b14; Picard version: 1.834(b2a94f76e786204c2ea48814b4a4a1018ff9b338_1422570082) JdkDeflater
[Fri Jan 22 16:50:25 EST 2016] picard.vcf.SortVcf done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2058354688
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112)
at picard.vcf.SortVcf.doWork(SortVcf.java:81)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:142)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110)
... 4 more
Issue · Github
by Sheila
@breardon
Hi,
I am going to have our Picard expert look at this. He will get back to you.
Sheila
@breardon, did you run the UpdateVcfSequenceDictionaryTool prior to running the SortVcf tool?
Hi, I'm starting to get crazy.
I run HaplotypeCaller without a dbsnp.vcf file and it worked well.
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw_g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
I wanted to include a dbsnp.vcf file and the problems started.
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
Ok, So, I order it:
java -jar picard.jar SortVcf I=.dbsnp_138.hg19.vcf O=dbsnp_138.hg19_sorted.vcf SEQUENCE_DICTIONARY=.ucsc.hg19.dict.gz
So, I try again with the sorted vcf file:
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp.
ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
Since it proposes to reorder the bam file, I do it:
java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Ind1_Recal.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE
So, now, using this bam file reordered, I try again.
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
it doesn't work. I'll try to use the original dbsnp.vcf
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! it seems to work.....
But after 2h just before finishing it appears this error:
ERROR MESSAGE: File /Users/...GATK_BaseRecalibrator/Ind1_Recal_reordered.bai is malformed: Premature end-of-file while reading BAM index file
GATK_BaseRecalibrator/Ind1_Recal_reordered.bai. It's likely that this file is truncated or corrupt -- Please try re-indexing the corresponding BAM file.
..... Well, I reindexed, re run, try that, and that and have been the last 2 days trying all combinations of vcf, reference, bam files, original, reordered...
I do not what can I do.
Magda
@Magda
Hi Magda,
Where did you get the reference and dbsnp file you are using? We have a bundle with files you should be able to use with no problem. You can also try adding
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
as Geraldine suggested above.-Sheila
Premature end of file means the file is truncated and will not be usable. Whatever process was used to generate it must be repeated.
Just realized my comment wasn't helpful -- but what happened after you re-indexed the file? You don't say if there is still an error or what it is.
Hi,
Thanks for answering so quick.
1) to Sheila,
I tried to use the -U ALLOW _QES_DICT_INCOMPATIBILITY and although it runs it gives a warn:
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
WARN 11:29:52,095 SequenceDictionaryUtils - Input files /dbsnp_138.hg19_sorted.vcf and reference have incompatible contigs: The contig order in /dbsnp_138.hg19_sorted.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
dbsnp_138.hg19_sorted.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
It's weird, because precisely in this case I downloaded the reference (ucsc.hg19.fasta), the dbsnp (dbsnp_138.hg19_sorted.vcf ) from the bundle and the bam files were reordered using this reference from the bundle too (R= ../../ucsc.hg19.fasta).
2) To Geraldine,
When I reindexed:
java -jar -Xmx4g picard.jar BuildBamIndex I= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered_reindex.bai VALIDATION_STRINGENCY=LENIENT
I tried to rerun HaplotypeCaller:
Ijava -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g2.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
ERROR MESSAGE: GVCF output requires a specific indexing strategy. Please re-run including the arguments -variant_index_type LINEAR -variant_index_parameter 128000.
I think I'm just going to desist of using a dbsnp file.
Thank you
Hi,
I try a last thing and it worked well. The only different thing is that now I run it on a Test directory where there were only the original Ind1.bam file, the Ind1_reordered.bam and Ind1_reordered.bai (and not the Ind1.bai file). So I think that the old indexed bam file even though the name is different interfere with the command.
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Test/Ind1_Recal_reordered.bam -o GATK_BaseRecalibrator/Test/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
Thank you.
Magda
Hi again.
How is it possible that for the next step I have the same complains?
Yesterday I run HaplotypeCaller after reordering the Bam files using the ref from bundle, then using the dnsnp from bundle and bam files
1. Reorder_Bam using picard on a separate directory without any previous .idx file in the directory (otherwise it doesn't work)
do java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal.bam O= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE
java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF
Today. I want to use GenotypeGVCFs, so
ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
ERROR dbsnp contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16,
ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15,
Ok, I think of sorting the snps.vcf file
java -jar picard.jar SortVcf I=1000G_phase1.snps.high_confidence.hg19.sites.vcf O=1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict
Try again:
java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Ind1_raw.g.vcf -V Ind24_raw.g.vcf -o Output_Joint.vcf
ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp.
ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
ERROR dbsnp contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, ch
OK; Should I also sort again the raw.g.vcf files? I do a test
java -jar picard.jar SortVcf I=Reordered_Bam/Ind1_raw.g.vcf O=Reordered_Bam/Ind1_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict
java -jar picard.jar SortVcf I=Reordered_Bam/Ind24_raw.g.vcf O=Reordered_Bam/Ind24_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict
Try again
java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Reordered_Bam/Ind1_raw_sorted.g.vcf -V Reordered_Bam/Ind24_raw_sorted.g.vcf -o Output_Joint.vcf
ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant.
ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
ERROR variant contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17,
I just don't understand that if I'm using the ucsc.hg19.fasta file from bundle, the SNPs.vfc from bundle, and I reordered my bam files using that same ref file, it still complains in the nest step.
I am sorry, but I just don't get it.
Magda
Hi Magda,
The "Lexicographically sorted human genome sequence" error is our fault -- we added a check that is not necessary, by mistake. We are going to fix this in the next version. For now, if you rerun your last command with
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
it should work.Ok, it's working. Thank you very much
Magda
I downloaded the file dbsnp_138.hg19.vcf from the resource bundle, and I got an error when running BQSQ step using this file as follows:
ERROR MESSAGE: Input files /home/p2010-217-gpfs/Arti/RNASeqFFPE/Software/dbsnp_138.hg19.vcf and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: The contig order in /home/dbsnp_138.hg19.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
ERROR /home/dbsnp_138.hg19.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
ERROR ------------------------------------------------------------------------------------------
To fix this I ran SortVcf, and now get the following error, which makes no sense since isn't it supposed to reorder it?:
Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chrM,length=16571,dict_index=0,assembly=hg19) was found when SAMSequenceRecord(name=chr1,length=249250621,dict_index=0,assembly=null) was expected.
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112)
at picard.vcf.SortVcf.doWork(SortVcf.java:81)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chrM,length=16571,dict_index=0,assembly=hg19) was found when SAMSequenceRecord(name=chr1,length=249250621,dict_index=0,assembly=null) was expected.
at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:165)
at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110)
... 4 more
Any help with this will be appreciated, thanks!
Issue · Github
by Sheila
@artitandon
Hi,
Are you using the latest version of Picard? Can you tell us the exact command you ran?
Thanks,
Sheila
@artitandon
Hi again,
You may also try running Picard's UpdateVcfSequenceDictionary.
-Sheila
Hi Sheila, I am using picard-tools-1.140 version, and the exact command is :
java -jar picard-tools-1.140/picard.jar SortVcf I=dbsnp_138.hg19.vcf O=dbsnp_138.hg19.sort.vcf SEQUENCE_DICTIONARY=hg19.dict
Thanks,
Arti
@artitandon
Hi Arti,
Okay. Can you confirm you get the same error with the latest version of Picard? Also, please try running Picard's UpdateVcfSequenceDictionary as I pointed you to above.
Thanks,
Sheila
Hello
I finally managed to have my reference files in place and tried to run GATK tools, but got the error about contigs and reads not being in same order. I can see my sample file has much simpler and more ordered contigs. Can you please advise how can I extract selected contigs from the reference hg19 fasta file and use it as reference for many more sample files rather than trying to reorder bam files for them all?
C:\Users\sg587\BartsData>java -jar GATK.jar -T HaplotypeCaller -R ref/hg19.fa -I 115N/115N-30129117/SG-115N_S1.bam -o results/SG-115N_S1_HCcalls.vcf
INFO 20:57:55,791 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:57:55,798 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
INFO 20:57:55,799 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
INFO 20:57:55,800 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
INFO 20:57:55,800 HelpFormatter - [Tue Jul 05 20:57:55 BST 2016] Executing on Windows 7 6.1 x86
INFO 20:57:55,801 HelpFormatter - Java HotSpot(TM) Client VM 1.8.0_91-b14 JdkDeflater
INFO 20:57:55,812 HelpFormatter - Program Args: -T HaplotypeCaller -R ref/hg19.fa -I 115N/115N-30129117/SG-115N_S1.bam -o results/SG-115N_S1_HCcalls.
vcf
INFO 20:58:01,530 HelpFormatter - Executing as [email protected] on Windows 7 6.1 x86; Java HotSpot(TM) Client VM 1.8.0_91-b14.
INFO 20:58:01,531 HelpFormatter - Date/Time: 2016/07/05 20:57:55
INFO 20:58:01,532 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:58:01,533 HelpFormatter - --------------------------------------------------------------------------------
INFO 20:58:01,573 GenomeAnalysisEngine - Strictness is SILENT
INFO 20:58:02,172 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
INFO 20:58:02,194 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
WARNING: BAM index file C:\Users\sg587\BartsData\115N\115N-30129117\SG-115N_S1.bam.bai is older than BAM C:\Users\sg587\BartsData\115N\115N-30129117\S
G-115N_S1.bam
INFO 20:58:02,269 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
INFO 20:58:02,346 HCMappingQualityFilter - Filtering out reads with MAPQ < 20
ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
ERROR
ERROR This means that one or more arguments or inputs in your command are incorrect.
ERROR The error message below tells you what is the problem.
ERROR
ERROR If the problem is an invalid argument, please check the online documentation guide
ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
ERROR
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://www.broadinstitute.org/gatk
ERROR
ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
ERROR
ERROR MESSAGE: Input files reads and reference have incompatible contigs. Please see https://www.broadinstitute.org/gatk/guide/article?id=63 formore information. Error details: The contig order in reads and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
ERROR reads contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214,chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000
234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231,chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]
ERROR ------------------------------------------------------------------------------------------
Many thanks
Sumedha
You should use the same reference genome for both read-mapping and SNP-calling.
Alternatively you can remove unmapped scaffolds from the current SNP-calling reference genome but I don't recommend it.
@Will_Gilks
Thanks! It is mapped onto hg19 and am using the same for snp calling too, but could it be that my data is exome sequencing on Illumina NexteraExome assay and analysis doesnt really need all these random contigs? I do think the issue is the order of these contigs in reference file not matching my sample files. Is there a way to reorder contigs in reference fasta file?
@sumedhagarg If the reference file does not contain the same contigs, it's not the same reference, even if it has the same name. The extra contigs are included to improve the quality of the read mapping. It's better to use a reference that includes them even if you're only interested in exome regions. And if it's not possible for you to remap your samples, then you should at least use the reference file that you used for mapping in your analysis, rather than mess around with the content of files. That is the safest way to proceed. If you don't want to do it that way, we can't help you if something goes wrong.
Hi everyone,
I have some difficulties getting the vcf from HaplotypeCaller (it would take weeks to complete..) so I was wondering if it could be possible to use the hard-filters on a vcf I have obtained through platypus or other softwares (.bam files use for VCF come from GATK before HaplotypeCaller).
When I try to use "SelectVariants" to apply a hard filter on the VCF obtained, I got an error message (end of this message).
I applied several tips/potential corrections that I have found in the different GATK conversations like ordering my vcf with SortVcf, ordering the bam file with reorder and sortSam (coordinate) but none where conclusive, the names of my contigs in the VCF and in the reference are apparently still not matching. I'm confused about that because as I look at my reference and my vcf, the contigs seem to be in the same order.. (but the error message shows a different order for the sequence contigs..)
The error message says that the "variant contigs" and the "sequence contigs" are not matching, not the "variant contigs" and the "reference contigs" .. Could the problem come from the .bam file order? Or is it just that vcf from platypus that can't be used with gatk?
I have tried several approaches to solve this problem but I'm stuck so I would greatly appreciate any help!
Thanks
Guilaume
ERROR MESSAGE: Input files variant and sequence have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: The contig order in variant and sequenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
ERROR variant contigs = [scaffold_1_size161211, scaffold_2_size67455, scaffold_3_size80531, scaffold_4_size38253, scaffold_5_size36003, ==> all my contigs
ERROR sequence contigs = [scaffold_100000_size1030, scaffold_100001_size1030, scaffold_100002_size1030, scaffold_100003_size1030, ==> all my contigs
(the genome I'm working with has around 200'000 contigs, I can't attach any files as they are too big)
@daverdin
Hi Guilame,
I am a little confused. You were able to run the GATK pre-processing tools on your original BAM file, but the VCF from other tools is not compatible with GATK? Did you use the same reference you used in pre-processing when running the other tools?
-Sheila
Hi @Sheila,
Thank you for your response and sorry if this was confusing... I have followed the "Best Practices" workflow starting with pair end sequencing data (fastq) until the MergeBamAlignment part (I successfully obtained a .bam file). I then tried to use the HaplotypeCaller on this bam file but wasn't successful (I think the very high number of contigs in the genome I'm working with is the culprit here). So I decided to go back to the same bam file but this time use Platypus to get my VCF file. That worked well and wanted to see if I could go on in the "Best Practices" with this VCF by doing the hard filter (SelectVariants). That's where things get wrong with my contigs names not fitting..
I have use the same reference file and dictionary to obtain the bam file and the platypus VCF file.
I have tried different things like SortSam (coordinate) and SortVcf (with the same reference dictionary) but also reordersam but always got the same error message. If I don't do the sortVcf, the error is slightly different:
ERROR /home/jimp/Downloads/GATKexamples/selfing/CNJ14_6_6/batch_9_ordered_tassel_corrected.vcf contigs = [scaffold_100042_size1029, scaffold_10007_size7458, scaffold_100165_size1026, etc.....]
ERROR reference contigs = [scaffold_1_size161211, scaffold_2_size67455, scaffold_3_size80531, scaffold_4_size38253, etc.....]
the command line I've used:
java -Xmx7G -jar /home/jimp/Desktop/GATK/GenomeAnalysisTK.jar -T SelectVariants -R /home/jimp/Downloads/GATKexamples/selfing/new/crGenome_scaffolds.fasta -V VariantCalls_samsorted_vcfsorted.vcf -selectType SNP -o raw_snps.vcf &> error
Any idea on what is going on here? thanks a lot!
@daverdin No, if you use a completely different caller, you can't get back to best practices. The Best Practices are formulated based on how tools work very specifically, and each tool in the toolchain has specific expectations about upstream processes.
Hello,
After sorting my dbsnp vcf file as you suggested to I am still facing troubles with HaplotypeCaller as it throws the error message:
ERROR --
ERROR stack trace
java.lang.IllegalStateException: Rod span chr1:1-249250621 isn't contained within the data shard chr1:1-249250621, meaning we wouldn't get all of the data we need
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions$ActiveRegionIterator.(TraverseActiveRegions.java:307)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:271)
at org.broadinstitute.gatk.engine.traversals.TraverseActiveRegions.traverse(TraverseActiveRegions.java:78)
at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions https://software.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: Rod span chr1:1-249250621 isn't contained within the data shard chr1:1-249250621, meaning we wouldn't get all of the data we need
ERROR -----------------------------------------------------------------------------------------
Here is the command line I run:
java -jar GenomeAnalysisTK/GenomeAnalysisTK.jar -R Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -T HaplotypeCaller -I drug_tolerant_single_cell_1_MDA_MBA_231_RNA_seq.bam --dbsnp All_20151104.new.vcf -stand_call_conf 15 -o output.raw.snps.indels.vcf -U ALLOW_SEQ_DICT_INCOMPATIBILITY
Please any help would be greatly welcomed.
Davd S.
@vectorio
Hi Davd,
What happens if you run without
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
?Thanks,
Sheila
Without -U ALLOW_SEQ_DICT_INCOMPATIBILITY I get:
ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads.
I run GATK 3.7-0-gcfedb67.
David S.
I finally found what was wrong, I just didn't run ReorderSam on my bam file. After I did that, GATK ran like a charm !
Thanks for the reply.
David S.
@vectorio
Hi David,
Thanks for sharing your solution!
-Sheila
Question about -U ALLOW_SEQ_DICT_INCOMPATIBILITY command. I saw that you mentioned this command will fix chromosome order issues. I am running the RNA-Seq pipeline (https://software.broadinstitute.org/gatk/guide/article?id=3891) using the latest build for the chicken genome and vcf file from Ensemble. When I got to step 5. Base Recalibration I get thrown the error message of "The contig order in knownSites and reference is not the same". Please note I aligned my RNA-seq data using the same reference, sorted the VCF file, indexed the genome build, SO is it appropriate to use the above command because all the data should be aligned using the same source (Ensembles builds for genome and vcf), or am I doing something horribly wrong when performing GATK's pipeline? It also confused me why the BAM file isn't sorting correctly too.
Any feedback on this issue would be greatly appreciated.
The most common stumbling block is that when you have sorted the vcf using Picard SortVcf, you need to delete the vcf index and regenerate it. This is because although the tool generates a new index, that index is actually not updated with the new sort order. There's a ticket to get this fixed in Picard but it's been hard to motivate anyone to fix it, as it's so easy to work around. But it does cause so much confusion, it is worth fixing.
Thank you for that suggestion, solved the entire problem!
Hi there,
I encounter some problems with contigs not sorted the same way bewteen reference and cosmic.vcf
My reference = Homo_sapiens_assembly38 (from GATK_bundle hg38)
My COSMIC file = CosmicCodingMuts.vcf (built for grch38)
Mutect2 command : java -jar $GATK -T MuTect2 -R GATK_bundle_h38/Homo_sapiens_assembly38.fasta -I:tumor BISCEm/VarCallingResults/R14_11_tcs_reordered.4.bam --dbsnp GATK_bundle_h38/dbsnp_146.hg38.vcf --cosmic Mutect2/CosmicCodingMuts_chr_M_sorted.vcf --artifact_detection_mode -o test.vcf
Yielding the following error
ERROR MESSAGE: Input files cosmic and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: The contig order in cosmic and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
ERROR cosmic contigs = [HLA-A01:01:01:01, HLA-A01:01:01:02N, ....
ERROR reference contigs = [chr1, chr2, ....
I did check the intersection between my reference & cosmic contigs names with a python script. Both contains 3366 unique contigs name, and both contains the same 3366 contigs names. They are just not ordered the same way.
The thing is, i tried "my best" to avoid this, given that in my Mutect2 command, the input files were both sorted against the same dictionary (Homo_sapiens_assembly38.dict) with the following commands
For the COSMIC.vcf
java -jar $PICARD SortVcf I=Mutect2/CosmicCodingMuts_chr_M.vcf O=Mutect2/CosmicCodingMuts_chr_M_sorted.vcf SEQUENCE_DICTIONARY=GATK_bundle_h38/Homo_sapiens_assembly38.dict
For the .bam
java -jar $PICARD ReorderSam I=BISCEm/VarCallingResults/R14_11_tcs.4.bam O=BISCEm/VarCallingResults/R14_11_tcs_reordered.4.bam R=GATK_bundle_h38/Homo_sapiens_assembly38.fasta CREATE_INDEX=TRUE
What's the next step to resolve this issue ?
Best regards
Right, removing the .idx associated with my sorted .vcf file did the trick.
(When i launched Mutect2 w-o any index for my cosmic vcf file, one was generated, and it was a good one ;-))
https://www.biostars.org/p/232994/#233135
Hi there,
I'm trying to follow your best practice guidelines for calling variants in RNAseq. At the "Split'N'Trim and reassign mapping qualities" step I get the following error message:
I have reordered the bam as suggested in the post and also recreated the index "manually" but keep getting the same error:
I have also tried to add
-U ALLOW_SEQ_DICT_INCOMPATIBILITY
but then I getDo you have any idea what the problem could be here and how to proceed?
Thanks a lot for your help in advance,
Maik
Issue · Github
by Sheila
@mcengeholm
Hi Maik,
Can you please post the BAM header (specifically the @SQ lines) and FASTA .dict file here?
Thanks,
Sheila
@Sheila
Hi Sheila,
thanks a lot for the quick reply. I have attached the header of the orignial bam, the reordered bam and the reference dict file below.
Thanks a lot for your help in advance,
Maik
@mcengeholm
Hi Maik
Sorry for the delay. Can you tell us what you mean by "I have reordered the bam as suggested in the post and also recreated the index "manually" but keep getting the same error:". What do you mean by manually?
-Sheila