The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

# Errors about contigs in BAM or VCF files not being properly ordered or sorted

edited February 17

This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files so it will fail with an error in this case.

So what do you do?

### For BAM files

You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:

java -jar picard.jar ReorderSam \
I=original.bam \
O=reordered.bam \
R=reference.fasta \
CREATE_INDEX=TRUE


Where reference.fasta is your genome reference, which must be accompanied by a valid *.dict dictionary file. The CREATE_INDEX argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).

Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!

### For VCF files

You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:

java -jar picard.jar SortVcf \
I=original.vcf \
O=sorted.vcf \
SEQUENCE_DICTIONARY=reference.dict


Where reference.dict is the sequence dictionary of your genome reference.

Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.

#### Version-specific alert for GATK 3.5

In version 3.5, we added some beefed-up VCF sequence dictionary validation. Unfortunately, as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime -U ALLOW_SEQ_DICT_INCOMPATIBILITY can be used (with caution) to override the check.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• McGill UniversityPosts: 6Member

Hello,
Is there anyway to adjust/sort the genome fasta file instead of the Bam file?

I have hundreds of Bams taking up TBs of storage. To reorder them all and save double copies (since some information is lost in the process) is a huge endeavour. It is much more feasible for me to tweak the reference instead. Is that possible?

Thanks!
Rola

Yes, you could reorder the contigs within the reference file -- but if you use the same reference that was used to generate all these bams you shouldn't need to. What exactly is the error you're getting?

Geraldine Van der Auwera, PhD

• McGill UniversityPosts: 6Member

It might not be the exact reference genome file. The data is publicly available and it is aligned to hg19 but I dont have the exact genome file. What should I do? Do I look at the order of contigs in the Bam file and order the genome file the exact same way?

This is the error log:

INFO 23:29:03,978 GenomeAnalysisEngine - Strictness is SILENT
INFO 23:29:04,344 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
INFO 23:29:04,365 SAMDataSource$SAMReaders - Initializing SAMRecords in serial INFO 23:29:04,748 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.38
INFO 23:29:04,972 IntervalUtils - Processing 159138663 bp from intervals
INFO 23:29:08,711 GATKRunReport - Uploaded run statistics report to AWS S3

##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
• McGill UniversityPosts: 6Member

I tried to order the contigs in the genome fasta file the same way the Bams are sorted:

## pulled out contig order:

INFO 20:20:09,142 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14 INFO 20:20:09,232 IntervalUtils - Processing 159138663 bp from intervals INFO 20:20:10,879 GATKRunReport - Uploaded run statistics report to AWS S3 ##### ERROR ------------------------------------------------------------------------------------------ ##### ERROR A USER ERROR has occurred (version 3.5-0-g36282e4): ##### ERROR ##### ERROR This means that one or more arguments or inputs in your command are incorrect. ##### ERROR The error message below tells you what is the problem. ##### ERROR ##### ERROR If the problem is an invalid argument, please check the online documentation guide ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool. ##### ERROR ##### ERROR Visit our website and forum for extensive documentation and answers to ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk ##### ERROR ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself. ##### ERROR ##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY] ##### ERROR ------------------------------------------------------------------------------------------ • Posts: 10,469Administrator, Dev admin Ah, what you're encountering is not the "contigs mismatch" error, it's a different error that complains specifically about the lexicographical order. We'll add some text to disambiguate that in the doc. This is something of a legacy requirement that contigs be ordered in pure alphanumeric order. Once you've made sure the dictionaries match between your bams and reference, and any VCF resources you plan to use, you can override this requirement by setting -U ALLOW_SEQ_DICT_INCOMPATIBILITY which will also allow lexicographical order. Note that -U stands for UNSAFE and that skipping dictionary validation is not recommended. However in your case I think it may be the only way to work without reordering all your bams. Be sure to test this on a full run of your pipeline on a subset of data to make sure it won't cause any fatal problems downstream. Geraldine Van der Auwera, PhD • Cambridge, MAPosts: 2Member, Broadie I am currently trying to resort contigs from a vcf relative to a reference with the picard task SortVCF; however, I am running into the error quoted below. This may be because of an incompatible dictionary but the error was repeated even after creating a new dict file by following the documentation here: http://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference. Do you have any suggestions? I am using the reference genome as specified at the top of the vcf. Thank you! -- [Fri Jan 22 16:50:24 EST 2016] picard.vcf.SortVcf INPUT=[ALL.autosomes.phase3_shapeit2_mvncall_integrated_v4.20130502.sites.vcf] OUTPUT=sorted.vcf SEQUENCE_DICTIONARY=/xchip/cga_home/breardon/misc_projects/_travis_vcf/reference_hs37d5/hs37d5.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false [Fri Jan 22 16:50:24 EST 2016] Executing as breardon@cga02 on Linux 2.6.32-573.7.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_71-b14; Picard version: 1.834(b2a94f76e786204c2ea48814b4a4a1018ff9b338_1422570082) JdkDeflater [Fri Jan 22 16:50:25 EST 2016] picard.vcf.SortVcf done. Elapsed time: 0.03 minutes. Runtime.totalMemory()=2058354688 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected. at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112) at picard.vcf.SortVcf.doWork(SortVcf.java:81) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected. at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:142) at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110) ... 4 more #### Issue · Github January 26 by Sheila Issue Number 519 State closed Last Updated Assignee Array Milestone Array Closed By vdauwera • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @breardon Hi, I am going to have our Picard expert look at this. He will get back to you. Sheila • Broad InstitutePosts: 82Member, Broadie, Moderator, Dev admin @breardon, did you run the UpdateVcfSequenceDictionaryTool prior to running the SortVcf tool? • Posts: 5Member Hi, I'm starting to get crazy. 1. I run HaplotypeCaller without a dbsnp.vcf file and it worked well. java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw_g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF 2. I wanted to include a dbsnp.vcf file and the problems started. java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF ##### ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. 1. Ok, So, I order it: java -jar picard.jar SortVcf I=.dbsnp_138.hg19.vcf O=dbsnp_138.hg19_sorted.vcf SEQUENCE_DICTIONARY=.ucsc.hg19.dict.gz 2. So, I try again with the sorted vcf file: java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF ##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp. ##### ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs. ##### ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files. ##### ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam 1. Since it proposes to reorder the bam file, I do it: java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Ind1_Recal.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE 2. So, now, using this bam file reordered, I try again. java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF # it doesn't work. I'll try to use the original dbsnp.vcf java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF #### YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! it seems to work..... But after 2h just before finishing it appears this error: ##### ERROR MESSAGE: File /Users/...GATK_BaseRecalibrator/Ind1_Recal_reordered.bai is malformed: Premature end-of-file while reading BAM index file GATK_BaseRecalibrator/Ind1_Recal_reordered.bai. It's likely that this file is truncated or corrupt -- Please try re-indexing the corresponding BAM file. ..... Well, I reindexed, re run, try that, and that and have been the last 2 days trying all combinations of vcf, reference, bam files, original, reordered... I do not what can I do. Magda • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @Magda Hi Magda, Where did you get the reference and dbsnp file you are using? We have a bundle with files you should be able to use with no problem. You can also try adding -U ALLOW_SEQ_DICT_INCOMPATIBILITY as Geraldine suggested above. -Sheila • Posts: 10,469Administrator, Dev admin Premature end of file means the file is truncated and will not be usable. Whatever process was used to generate it must be repeated. Geraldine Van der Auwera, PhD • Posts: 10,469Administrator, Dev admin Just realized my comment wasn't helpful -- but what happened after you re-indexed the file? You don't say if there is still an error or what it is. Geraldine Van der Auwera, PhD • Posts: 5Member Hi, Thanks for answering so quick. 1) to Sheila, I tried to use the -U ALLOW _QES_DICT_INCOMPATIBILITY and although it runs it gives a warn: java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF WARN 11:29:52,095 SequenceDictionaryUtils - Input files /dbsnp_138.hg19_sorted.vcf and reference have incompatible contigs: The contig order in /dbsnp_138.hg19_sorted.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. dbsnp_138.hg19_sorted.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY] reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM] It's weird, because precisely in this case I downloaded the reference (ucsc.hg19.fasta), the dbsnp (dbsnp_138.hg19_sorted.vcf ) from the bundle and the bam files were reordered using this reference from the bundle too (R= ../../ucsc.hg19.fasta). 2) To Geraldine, When I reindexed: java -jar -Xmx4g picard.jar BuildBamIndex I= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered_reindex.bai VALIDATION_STRINGENCY=LENIENT I tried to rerun HaplotypeCaller: Ijava -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g2.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF ##### ERROR MESSAGE: GVCF output requires a specific indexing strategy. Please re-run including the arguments -variant_index_type LINEAR -variant_index_parameter 128000. I think I'm just going to desist of using a dbsnp file. Thank you • Posts: 5Member Hi, I try a last thing and it worked well. The only different thing is that now I run it on a Test directory where there were only the original Ind1.bam file, the Ind1_reordered.bam and Ind1_reordered.bai (and not the Ind1.bai file). So I think that the old indexed bam file even though the name is different interfere with the command. java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Test/Ind1_Recal_reordered.bam -o GATK_BaseRecalibrator/Test/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF Thank you. Magda • Posts: 5Member Hi again. How is it possible that for the next step I have the same complains? Yesterday I run HaplotypeCaller after reordering the Bam files using the ref from bundle, then using the dnsnp from bundle and bam files 1. Reorder_Bam using picard on a separate directory without any previous .idx file in the directory (otherwise it doesn't work) do java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal.bam O= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE 1. Run HaplotypeCaller using the ref, dnsnp from bundle and reordered bam files java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF Today. I want to use GenotypeGVCFs, so 1. java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites.vcf -V Ind1_raw.g.vcf -V Ind24_raw.g.vcf -o Output_Joint.vcf ##### ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. ##### ERROR dbsnp contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, ##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, # Ok, I think of sorting the snps.vcf file java -jar picard.jar SortVcf I=1000G_phase1.snps.high_confidence.hg19.sites.vcf O=1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict # Try again: java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Ind1_raw.g.vcf -V Ind24_raw.g.vcf -o Output_Joint.vcf ##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp. ##### ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs. ##### ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files. ##### ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam ##### ERROR dbsnp contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, ch ### OK; Should I also sort again the raw.g.vcf files? I do a test java -jar picard.jar SortVcf I=Reordered_Bam/Ind1_raw.g.vcf O=Reordered_Bam/Ind1_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict java -jar picard.jar SortVcf I=Reordered_Bam/Ind24_raw.g.vcf O=Reordered_Bam/Ind24_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict # Try again java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Reordered_Bam/Ind1_raw_sorted.g.vcf -V Reordered_Bam/Ind24_raw_sorted.g.vcf -o Output_Joint.vcf ##### ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant. ##### ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs. ##### ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files. ##### ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam ##### ERROR variant contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, # I just don't understand that if I'm using the ucsc.hg19.fasta file from bundle, the SNPs.vfc from bundle, and I reordered my bam files using that same ref file, it still complains in the nest step. I am sorry, but I just don't get it. Magda • Posts: 10,469Administrator, Dev admin Hi Magda, The "Lexicographically sorted human genome sequence" error is our fault -- we added a check that is not necessary, by mistake. We are going to fix this in the next version. For now, if you rerun your last command with -U ALLOW_SEQ_DICT_INCOMPATIBILITY it should work. Geraldine Van der Auwera, PhD • Posts: 5Member Ok, it's working. Thank you very much Magda • Posts: 22Member I downloaded the file dbsnp_138.hg19.vcf from the resource bundle, and I got an error when running BQSQ step using this file as follows: ##### ERROR MESSAGE: Input files /home/p2010-217-gpfs/Arti/RNASeqFFPE/Software/dbsnp_138.hg19.vcf and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: The contig order in /home/dbsnp_138.hg19.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files.. ##### ERROR /home/dbsnp_138.hg19.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY] ##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM] ##### ERROR ------------------------------------------------------------------------------------------ To fix this I ran SortVcf, and now get the following error, which makes no sense since isn't it supposed to reorder it?: Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chrM,length=16571,dict_index=0,assembly=hg19) was found when SAMSequenceRecord(name=chr1,length=249250621,dict_index=0,assembly=null) was expected. at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112) at picard.vcf.SortVcf.doWork(SortVcf.java:81) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chrM,length=16571,dict_index=0,assembly=hg19) was found when SAMSequenceRecord(name=chr1,length=249250621,dict_index=0,assembly=null) was expected. at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:165) at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110) ... 4 more Any help with this will be appreciated, thanks! #### Issue · Github April 12 by Sheila Issue Number 808 State closed Last Updated Assignee Array Milestone Array Closed By chandrans • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @artitandon Hi, Are you using the latest version of Picard? Can you tell us the exact command you ran? Thanks, Sheila • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @artitandon Hi again, You may also try running Picard's UpdateVcfSequenceDictionary. -Sheila • Posts: 22Member Hi Sheila, I am using picard-tools-1.140 version, and the exact command is : java -jar picard-tools-1.140/picard.jar SortVcf I=dbsnp_138.hg19.vcf O=dbsnp_138.hg19.sort.vcf SEQUENCE_DICTIONARY=hg19.dict Thanks, Arti • Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin @artitandon Hi Arti, Okay. Can you confirm you get the same error with the latest version of Picard? Also, please try running Picard's UpdateVcfSequenceDictionary as I pointed you to above. Thanks, Sheila • CambridgePosts: 9Member Hello I finally managed to have my reference files in place and tried to run GATK tools, but got the error about contigs and reads not being in same order. I can see my sample file has much simpler and more ordered contigs. Can you please advise how can I extract selected contigs from the reference hg19 fasta file and use it as reference for many more sample files rather than trying to reorder bam files for them all? C:\Users\sg587\BartsData>java -jar GATK.jar -T HaplotypeCaller -R ref/hg19.fa -I 115N/115N-30129117/SG-115N_S1.bam -o results/SG-115N_S1_HCcalls.vcf INFO 20:57:55,791 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:57:55,798 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29 INFO 20:57:55,799 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 20:57:55,800 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk INFO 20:57:55,800 HelpFormatter - [Tue Jul 05 20:57:55 BST 2016] Executing on Windows 7 6.1 x86 INFO 20:57:55,801 HelpFormatter - Java HotSpot(TM) Client VM 1.8.0_91-b14 JdkDeflater INFO 20:57:55,812 HelpFormatter - Program Args: -T HaplotypeCaller -R ref/hg19.fa -I 115N/115N-30129117/SG-115N_S1.bam -o results/SG-115N_S1_HCcalls. vcf INFO 20:58:01,530 HelpFormatter - Executing as sg587@cppc118 on Windows 7 6.1 x86; Java HotSpot(TM) Client VM 1.8.0_91-b14. INFO 20:58:01,531 HelpFormatter - Date/Time: 2016/07/05 20:57:55 INFO 20:58:01,532 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:58:01,533 HelpFormatter - -------------------------------------------------------------------------------- INFO 20:58:01,573 GenomeAnalysisEngine - Strictness is SILENT INFO 20:58:02,172 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 20:58:02,194 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
WARNING: BAM index file C:\Users\sg587\BartsData\115N\115N-30129117\SG-115N_S1.bam.bai is older than BAM C:\Users\sg587\BartsData\115N\115N-30129117\S
G-115N_S1.bam
INFO 20:58:02,269 SAMDataSource\$SAMReaders - Done initializing BAM readers: total time 0.07
INFO 20:58:02,346 HCMappingQualityFilter - Filtering out reads with MAPQ < 20

##### ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214,chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000

234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231,chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]

##### ERROR ------------------------------------------------------------------------------------------

Many thanks
Sumedha

• University of Sussex, UKPosts: 116Member ✭✭

You should use the same reference genome for both read-mapping and SNP-calling.

Alternatively you can remove unmapped scaffolds from the current SNP-calling reference genome but I don't recommend it.

• CambridgePosts: 9Member

@Will_Gilks
Thanks! It is mapped onto hg19 and am using the same for snp calling too, but could it be that my data is exome sequencing on Illumina NexteraExome assay and analysis doesnt really need all these random contigs? I do think the issue is the order of these contigs in reference file not matching my sample files. Is there a way to reorder contigs in reference fasta file?

@sumedhagarg If the reference file does not contain the same contigs, it's not the same reference, even if it has the same name. The extra contigs are included to improve the quality of the read mapping. It's better to use a reference that includes them even if you're only interested in exome regions. And if it's not possible for you to remap your samples, then you should at least use the reference file that you used for mapping in your analysis, rather than mess around with the content of files. That is the safest way to proceed. If you don't want to do it that way, we can't help you if something goes wrong.

Geraldine Van der Auwera, PhD

• rutgersPosts: 5Member

Hi everyone,
I have some difficulties getting the vcf from HaplotypeCaller (it would take weeks to complete..) so I was wondering if it could be possible to use the hard-filters on a vcf I have obtained through platypus or other softwares (.bam files use for VCF come from GATK before HaplotypeCaller).
When I try to use "SelectVariants" to apply a hard filter on the VCF obtained, I got an error message (end of this message).
I applied several tips/potential corrections that I have found in the different GATK conversations like ordering my vcf with SortVcf, ordering the bam file with reorder and sortSam (coordinate) but none where conclusive, the names of my contigs in the VCF and in the reference are apparently still not matching. I'm confused about that because as I look at my reference and my vcf, the contigs seem to be in the same order.. (but the error message shows a different order for the sequence contigs..)
The error message says that the "variant contigs" and the "sequence contigs" are not matching, not the "variant contigs" and the "reference contigs" .. Could the problem come from the .bam file order? Or is it just that vcf from platypus that can't be used with gatk?

I have tried several approaches to solve this problem but I'm stuck so I would greatly appreciate any help!
Thanks
Guilaume

##### ERROR sequence contigs = [scaffold_100000_size1030, scaffold_100001_size1030, scaffold_100002_size1030, scaffold_100003_size1030, ==> all my contigs

(the genome I'm working with has around 200'000 contigs, I can't attach any files as they are too big)

@daverdin
Hi Guilame,

I am a little confused. You were able to run the GATK pre-processing tools on your original BAM file, but the VCF from other tools is not compatible with GATK? Did you use the same reference you used in pre-processing when running the other tools?

-Sheila

• rutgersPosts: 5Member

Hi @Sheila,

Thank you for your response and sorry if this was confusing... I have followed the "Best Practices" workflow starting with pair end sequencing data (fastq) until the MergeBamAlignment part (I successfully obtained a .bam file). I then tried to use the HaplotypeCaller on this bam file but wasn't successful (I think the very high number of contigs in the genome I'm working with is the culprit here). So I decided to go back to the same bam file but this time use Platypus to get my VCF file. That worked well and wanted to see if I could go on in the "Best Practices" with this VCF by doing the hard filter (SelectVariants). That's where things get wrong with my contigs names not fitting..

I have use the same reference file and dictionary to obtain the bam file and the platypus VCF file.

I have tried different things like SortSam (coordinate) and SortVcf (with the same reference dictionary) but also reordersam but always got the same error message. If I don't do the sortVcf, the error is slightly different:

##### ERROR reference contigs = [scaffold_1_size161211, scaffold_2_size67455, scaffold_3_size80531, scaffold_4_size38253, etc.....]

the command line I've used:
java -Xmx7G -jar /home/jimp/Desktop/GATK/GenomeAnalysisTK.jar -T SelectVariants -R /home/jimp/Downloads/GATKexamples/selfing/new/crGenome_scaffolds.fasta -V VariantCalls_samsorted_vcfsorted.vcf -selectType SNP -o raw_snps.vcf &> error

Any idea on what is going on here? thanks a lot!