The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Last chance to register for the GATK workshop next week in Basel, Switzerland! http://www.sib.swiss/training/upcoming-training-events/training/gatk-workshop-lecture

Errors about contigs in BAM or VCF files not being properly ordered or sorted

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin
edited February 17 in Common Problems

image This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files so it will fail with an error in this case.

So what do you do?


For BAM files

You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:

java -jar picard.jar ReorderSam \
    I=original.bam \
    O=reordered.bam \
    R=reference.fasta \
    CREATE_INDEX=TRUE

Where reference.fasta is your genome reference, which must be accompanied by a valid *.dict dictionary file. The CREATE_INDEX argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).

Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!


For VCF files

You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:

java -jar picard.jar SortVcf \
    I=original.vcf \
    O=sorted.vcf \
    SEQUENCE_DICTIONARY=reference.dict 

Where reference.dict is the sequence dictionary of your genome reference.

Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.

Version-specific alert for GATK 3.5

In version 3.5, we added some beefed-up VCF sequence dictionary validation. Unfortunately, as a side effect of the additional checks, some users have experienced an error that starts with "ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant." that is due to unintentional activation of a check that is not necessary. This will be fixed in the next release; in the meantime -U ALLOW_SEQ_DICT_INCOMPATIBILITY can be used (with caution) to override the check.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • rdali094rdali094 McGill UniversityPosts: 6Member

    Hello,
    Is there anyway to adjust/sort the genome fasta file instead of the Bam file?

    I have hundreds of Bams taking up TBs of storage. To reorder them all and save double copies (since some information is lost in the process) is a huge endeavour. It is much more feasible for me to tweak the reference instead. Is that possible?

    Thanks!
    Rola

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Yes, you could reorder the contigs within the reference file -- but if you use the same reference that was used to generate all these bams you shouldn't need to. What exactly is the error you're getting?

    Geraldine Van der Auwera, PhD

  • rdali094rdali094 McGill UniversityPosts: 6Member

    It might not be the exact reference genome file. The data is publicly available and it is aligned to hg19 but I dont have the exact genome file. What should I do? Do I look at the order of contigs in the Bam file and order the genome file the exact same way?

    This is the error log:

    INFO 23:29:03,978 GenomeAnalysisEngine - Strictness is SILENT
    INFO 23:29:04,344 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 23:29:04,365 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 23:29:04,748 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.38
    INFO 23:29:04,972 IntervalUtils - Processing 159138663 bp from intervals
    INFO 23:29:08,711 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
  • rdali094rdali094 McGill UniversityPosts: 6Member

    I tried to order the contigs in the genome fasta file the same way the Bams are sorted:

    pulled out contig order:

    samtools view $BAM | cut -f3 | uniq

    chr1
    chr10
    chr11
    chr12
    chr13
    chr14
    chr15
    chr16
    chr17
    chr18
    chr19
    chr2
    chr20
    chr21
    chr22
    chr3
    chr4
    chr5
    chr6
    chr7
    chr8
    chr9
    chrM
    chrX
    chrY
    *

    recreate genome file with same contigs in the same order:

    cat chr1.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr2.fa chr20.fa chr21.fa chr22.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chrM.fa chrX.fa chrY.fa > hg19.gatk.fa

    index genome and create dictionary:

    samtools faidx hg19.gatk.fa

    java -jar picard.jar CreateSequenceDictionary R= hg19.gatk.fa O= hg19.gatk.dict

    run muTect:

    java -jar GenomeAnalysisTK.jar -T MuTect2 -R hg19.gatk.fa -I:tumor $tumor -I:normal $normal --intervals "chr7" -o NormalTumorPair_chr7.muTect.vcf

    ######### Same error:

    INFO 20:20:08,335 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:20:08,339 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
    INFO 20:20:08,339 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 20:20:08,339 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 20:20:08,344 HelpFormatter - Program Args: -T MuTect2 -R hg19.gatk.fa -I:tumor tumor.bam -I:normal normal.bam --intervals chr7 -o NormalTumorPair_chr7.muTect.vcf
    INFO 20:20:08,361 HelpFormatter - Executing as x on Linux 2.6.32-504.30.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_15_12_57-b00.
    INFO 20:20:08,361 HelpFormatter - Date/Time: 2016/01/10 20:20:08
    INFO 20:20:08,361 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:20:08,361 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:20:08,595 GenomeAnalysisEngine - Strictness is SILENT
    INFO 20:20:08,980 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 20:20:08,996 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 20:20:09,142 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14
    INFO 20:20:09,232 IntervalUtils - Processing 159138663 bp from intervals
    INFO 20:20:10,879 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Ah, what you're encountering is not the "contigs mismatch" error, it's a different error that complains specifically about the lexicographical order. We'll add some text to disambiguate that in the doc.

    This is something of a legacy requirement that contigs be ordered in pure alphanumeric order. Once you've made sure the dictionaries match between your bams and reference, and any VCF resources you plan to use, you can override this requirement by setting -U ALLOW_SEQ_DICT_INCOMPATIBILITY which will also allow lexicographical order.

    Note that -U stands for UNSAFE and that skipping dictionary validation is not recommended. However in your case I think it may be the only way to work without reordering all your bams.

    Be sure to test this on a full run of your pipeline on a subset of data to make sure it won't cause any fatal problems downstream.

    Geraldine Van der Auwera, PhD

  • breardonbreardon Cambridge, MAPosts: 2Member, Broadie

    I am currently trying to resort contigs from a vcf relative to a reference with the picard task SortVCF; however, I am running into the error quoted below. This may be because of an incompatible dictionary but the error was repeated even after creating a new dict file by following the documentation here: http://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference.

    Do you have any suggestions? I am using the reference genome as specified at the top of the vcf.

    Thank you!

    --

    [Fri Jan 22 16:50:24 EST 2016] picard.vcf.SortVcf INPUT=[ALL.autosomes.phase3_shapeit2_mvncall_integrated_v4.20130502.sites.vcf] OUTPUT=sorted.vcf SEQUENCE_DICTIONARY=/xchip/cga_home/breardon/misc_projects/_travis_vcf/reference_hs37d5/hs37d5.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false

    [Fri Jan 22 16:50:24 EST 2016] Executing as breardon@cga02 on Linux 2.6.32-573.7.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_71-b14; Picard version: 1.834(b2a94f76e786204c2ea48814b4a4a1018ff9b338_1422570082) JdkDeflater

    [Fri Jan 22 16:50:25 EST 2016] picard.vcf.SortVcf done. Elapsed time: 0.03 minutes.

    Runtime.totalMemory()=2058354688

    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

    Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected.

    at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112)

    at picard.vcf.SortVcf.doWork(SortVcf.java:81)

    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)

    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)

    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected.

    at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:142)

    at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110)

    ... 4 more

    Issue · Github
    by Sheila

    Issue Number
    519
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @breardon
    Hi,

    I am going to have our Picard expert look at this. He will get back to you.

    Sheila

  • deklingdekling Broad InstitutePosts: 82Member, Broadie, Moderator, Dev admin

    @breardon, did you run the UpdateVcfSequenceDictionaryTool prior to running the SortVcf tool?

  • MagdaMagda Posts: 5Member

    Hi, I'm starting to get crazy.

    1. I run HaplotypeCaller without a dbsnp.vcf file and it worked well.
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw_g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    2. I wanted to include a dbsnp.vcf file and the problems started.
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    1. Ok, So, I order it:
      java -jar picard.jar SortVcf I=.dbsnp_138.hg19.vcf O=dbsnp_138.hg19_sorted.vcf SEQUENCE_DICTIONARY=.ucsc.hg19.dict.gz

    2. So, I try again with the sorted vcf file:
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp.
    ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
    ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
    ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
    1. Since it proposes to reorder the bam file, I do it:
      java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Ind1_Recal.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE

    2. So, now, using this bam file reordered, I try again.

      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    it doesn't work. I'll try to use the original dbsnp.vcf

    java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! it seems to work.....

    But after 2h just before finishing it appears this error:

    ERROR MESSAGE: File /Users/...GATK_BaseRecalibrator/Ind1_Recal_reordered.bai is malformed: Premature end-of-file while reading BAM index file

    GATK_BaseRecalibrator/Ind1_Recal_reordered.bai. It's likely that this file is truncated or corrupt -- Please try re-indexing the corresponding BAM file.

    ..... Well, I reindexed, re run, try that, and that and have been the last 2 days trying all combinations of vcf, reference, bam files, original, reordered...

    I do not what can I do.
    Magda

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @Magda
    Hi Magda,

    Where did you get the reference and dbsnp file you are using? We have a bundle with files you should be able to use with no problem. You can also try adding -U ALLOW_SEQ_DICT_INCOMPATIBILITY as Geraldine suggested above.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Premature end of file means the file is truncated and will not be usable. Whatever process was used to generate it must be repeated.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Just realized my comment wasn't helpful -- but what happened after you re-indexed the file? You don't say if there is still an error or what it is.

    Geraldine Van der Auwera, PhD

  • MagdaMagda Posts: 5Member

    Hi,
    Thanks for answering so quick.

    1) to Sheila,
    I tried to use the -U ALLOW _QES_DICT_INCOMPATIBILITY and although it runs it gives a warn:

    java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    WARN 11:29:52,095 SequenceDictionaryUtils - Input files /dbsnp_138.hg19_sorted.vcf and reference have incompatible contigs: The contig order in /dbsnp_138.hg19_sorted.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    dbsnp_138.hg19_sorted.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
    reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

    It's weird, because precisely in this case I downloaded the reference (ucsc.hg19.fasta), the dbsnp (dbsnp_138.hg19_sorted.vcf ) from the bundle and the bam files were reordered using this reference from the bundle too (R= ../../ucsc.hg19.fasta).

    2) To Geraldine,
    When I reindexed:
    java -jar -Xmx4g picard.jar BuildBamIndex I= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered_reindex.bai VALIDATION_STRINGENCY=LENIENT

    I tried to rerun HaplotypeCaller:
    Ijava -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g2.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    ERROR MESSAGE: GVCF output requires a specific indexing strategy. Please re-run including the arguments -variant_index_type LINEAR -variant_index_parameter 128000.

    I think I'm just going to desist of using a dbsnp file.
    Thank you

  • MagdaMagda Posts: 5Member

    Hi,
    I try a last thing and it worked well. The only different thing is that now I run it on a Test directory where there were only the original Ind1.bam file, the Ind1_reordered.bam and Ind1_reordered.bai (and not the Ind1.bai file). So I think that the old indexed bam file even though the name is different interfere with the command.

    java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Test/Ind1_Recal_reordered.bam -o GATK_BaseRecalibrator/Test/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    Thank you.
    Magda

  • MagdaMagda Posts: 5Member

    Hi again.
    How is it possible that for the next step I have the same complains?

    Yesterday I run HaplotypeCaller after reordering the Bam files using the ref from bundle, then using the dnsnp from bundle and bam files
    1. Reorder_Bam using picard on a separate directory without any previous .idx file in the directory (otherwise it doesn't work)
    do java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal.bam O= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE

    1. Run HaplotypeCaller using the ref, dnsnp from bundle and reordered bam files
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    Today. I want to use GenotypeGVCFs, so

    1. java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites.vcf -V Ind1_raw.g.vcf -V Ind24_raw.g.vcf -o Output_Joint.vcf
    ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    ERROR dbsnp contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16,
    ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15,

    Ok, I think of sorting the snps.vcf file

    java -jar picard.jar SortVcf I=1000G_phase1.snps.high_confidence.hg19.sites.vcf O=1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict

    Try again:

    java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Ind1_raw.g.vcf -V Ind24_raw.g.vcf -o Output_Joint.vcf

    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp.
    ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
    ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
    ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
    ERROR dbsnp contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, ch

    OK; Should I also sort again the raw.g.vcf files? I do a test

    java -jar picard.jar SortVcf I=Reordered_Bam/Ind1_raw.g.vcf O=Reordered_Bam/Ind1_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict
    java -jar picard.jar SortVcf I=Reordered_Bam/Ind24_raw.g.vcf O=Reordered_Bam/Ind24_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict

    Try again

    java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Reordered_Bam/Ind1_raw_sorted.g.vcf -V Reordered_Bam/Ind24_raw_sorted.g.vcf -o Output_Joint.vcf

    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant.
    ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
    ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
    ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
    ERROR variant contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17,

    I just don't understand that if I'm using the ucsc.hg19.fasta file from bundle, the SNPs.vfc from bundle, and I reordered my bam files using that same ref file, it still complains in the nest step.

    I am sorry, but I just don't get it.

    Magda

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    Hi Magda,

    The "Lexicographically sorted human genome sequence" error is our fault -- we added a check that is not necessary, by mistake. We are going to fix this in the next version. For now, if you rerun your last command with -U ALLOW_SEQ_DICT_INCOMPATIBILITY it should work.

    Geraldine Van der Auwera, PhD

  • MagdaMagda Posts: 5Member

    Ok, it's working. Thank you very much
    Magda

  • artitandonartitandon Posts: 22Member

    I downloaded the file dbsnp_138.hg19.vcf from the resource bundle, and I got an error when running BQSQ step using this file as follows:

    ERROR MESSAGE: Input files /home/p2010-217-gpfs/Arti/RNASeqFFPE/Software/dbsnp_138.hg19.vcf and reference have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: The contig order in /home/dbsnp_138.hg19.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    ERROR /home/dbsnp_138.hg19.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
    ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
    ERROR ------------------------------------------------------------------------------------------

    To fix this I ran SortVcf, and now get the following error, which makes no sense since isn't it supposed to reorder it?:
    Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chrM,length=16571,dict_index=0,assembly=hg19) was found when SAMSequenceRecord(name=chr1,length=249250621,dict_index=0,assembly=null) was expected.
    at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112)
    at picard.vcf.SortVcf.doWork(SortVcf.java:81)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
    Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=chrM,length=16571,dict_index=0,assembly=hg19) was found when SAMSequenceRecord(name=chr1,length=249250621,dict_index=0,assembly=null) was expected.
    at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:165)
    at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110)
    ... 4 more

    Any help with this will be appreciated, thanks!

    Issue · Github
    by Sheila

    Issue Number
    808
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @artitandon
    Hi,

    Are you using the latest version of Picard? Can you tell us the exact command you ran?

    Thanks,
    Sheila

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @artitandon
    Hi again,

    You may also try running Picard's UpdateVcfSequenceDictionary.

    -Sheila

  • artitandonartitandon Posts: 22Member

    Hi Sheila, I am using picard-tools-1.140 version, and the exact command is :
    java -jar picard-tools-1.140/picard.jar SortVcf I=dbsnp_138.hg19.vcf O=dbsnp_138.hg19.sort.vcf SEQUENCE_DICTIONARY=hg19.dict

    Thanks,
    Arti

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @artitandon
    Hi Arti,

    Okay. Can you confirm you get the same error with the latest version of Picard? Also, please try running Picard's UpdateVcfSequenceDictionary as I pointed you to above.

    Thanks,
    Sheila

  • sumedhagargsumedhagarg CambridgePosts: 9Member

    Hello
    I finally managed to have my reference files in place and tried to run GATK tools, but got the error about contigs and reads not being in same order. I can see my sample file has much simpler and more ordered contigs. Can you please advise how can I extract selected contigs from the reference hg19 fasta file and use it as reference for many more sample files rather than trying to reorder bam files for them all?

    C:\Users\sg587\BartsData>java -jar GATK.jar -T HaplotypeCaller -R ref/hg19.fa -I 115N/115N-30129117/SG-115N_S1.bam -o results/SG-115N_S1_HCcalls.vcf
    INFO 20:57:55,791 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:57:55,798 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
    INFO 20:57:55,799 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 20:57:55,800 HelpFormatter - For support and documentation go to https://www.broadinstitute.org/gatk
    INFO 20:57:55,800 HelpFormatter - [Tue Jul 05 20:57:55 BST 2016] Executing on Windows 7 6.1 x86
    INFO 20:57:55,801 HelpFormatter - Java HotSpot(TM) Client VM 1.8.0_91-b14 JdkDeflater
    INFO 20:57:55,812 HelpFormatter - Program Args: -T HaplotypeCaller -R ref/hg19.fa -I 115N/115N-30129117/SG-115N_S1.bam -o results/SG-115N_S1_HCcalls.
    vcf
    INFO 20:58:01,530 HelpFormatter - Executing as sg587@cppc118 on Windows 7 6.1 x86; Java HotSpot(TM) Client VM 1.8.0_91-b14.
    INFO 20:58:01,531 HelpFormatter - Date/Time: 2016/07/05 20:57:55
    INFO 20:58:01,532 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:58:01,533 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:58:01,573 GenomeAnalysisEngine - Strictness is SILENT
    INFO 20:58:02,172 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500
    INFO 20:58:02,194 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    WARNING: BAM index file C:\Users\sg587\BartsData\115N\115N-30129117\SG-115N_S1.bam.bai is older than BAM C:\Users\sg587\BartsData\115N\115N-30129117\S
    G-115N_S1.bam
    INFO 20:58:02,269 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.07
    INFO 20:58:02,346 HCMappingQualityFilter - Filtering out reads with MAPQ < 20

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.6-0-g89b7209):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Input files reads and reference have incompatible contigs. Please see https://www.broadinstitute.org/gatk/guide/article?id=63 formore information. Error details: The contig order in reads and reference is not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    ERROR reads contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
    ERROR reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214,chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, chr19_gl000208_random, chr9_gl000198_random, chr17_gl000204_random, chrUn_gl000233, chrUn_gl000237, chrUn_gl000230, chrUn_gl000242, chrUn_gl000243, chrUn_gl000241, chrUn_gl000236, chrUn_gl000240, chr17_gl000206_random, chrUn_gl000232, chrUn_gl000

    234, chr11_gl000202_random, chrUn_gl000238, chrUn_gl000244, chrUn_gl000248, chr8_gl000196_random, chrUn_gl000249, chrUn_gl000246, chr17_gl000203_random, chr8_gl000197_random, chrUn_gl000245, chrUn_gl000247, chr9_gl000201_random, chrUn_gl000235, chrUn_gl000239, chr21_gl000210_random, chrUn_gl000231,chrUn_gl000229, chrM, chrUn_gl000226, chr18_gl000207_random]

    ERROR ------------------------------------------------------------------------------------------

    Many thanks
    Sumedha

  • Will_GilksWill_Gilks University of Sussex, UKPosts: 116Member ✭✭

    You should use the same reference genome for both read-mapping and SNP-calling.

    Alternatively you can remove unmapped scaffolds from the current SNP-calling reference genome but I don't recommend it.

  • sumedhagargsumedhagarg CambridgePosts: 9Member

    @Will_Gilks
    Thanks! It is mapped onto hg19 and am using the same for snp calling too, but could it be that my data is exome sequencing on Illumina NexteraExome assay and analysis doesnt really need all these random contigs? I do think the issue is the order of these contigs in reference file not matching my sample files. Is there a way to reorder contigs in reference fasta file?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @sumedhagarg If the reference file does not contain the same contigs, it's not the same reference, even if it has the same name. The extra contigs are included to improve the quality of the read mapping. It's better to use a reference that includes them even if you're only interested in exome regions. And if it's not possible for you to remap your samples, then you should at least use the reference file that you used for mapping in your analysis, rather than mess around with the content of files. That is the safest way to proceed. If you don't want to do it that way, we can't help you if something goes wrong.

    Geraldine Van der Auwera, PhD

  • daverdindaverdin rutgersPosts: 5Member

    Hi everyone,
    I have some difficulties getting the vcf from HaplotypeCaller (it would take weeks to complete..) so I was wondering if it could be possible to use the hard-filters on a vcf I have obtained through platypus or other softwares (.bam files use for VCF come from GATK before HaplotypeCaller).
    When I try to use "SelectVariants" to apply a hard filter on the VCF obtained, I got an error message (end of this message).
    I applied several tips/potential corrections that I have found in the different GATK conversations like ordering my vcf with SortVcf, ordering the bam file with reorder and sortSam (coordinate) but none where conclusive, the names of my contigs in the VCF and in the reference are apparently still not matching. I'm confused about that because as I look at my reference and my vcf, the contigs seem to be in the same order.. (but the error message shows a different order for the sequence contigs..)
    The error message says that the "variant contigs" and the "sequence contigs" are not matching, not the "variant contigs" and the "reference contigs" .. Could the problem come from the .bam file order? Or is it just that vcf from platypus that can't be used with gatk?

    I have tried several approaches to solve this problem but I'm stuck so I would greatly appreciate any help!
    Thanks
    Guilaume


    ERROR MESSAGE: Input files variant and sequence have incompatible contigs. Please see http://gatkforums.broadinstitute.org/discussion/63/input-files-have-incompatible-contigsfor more information. Error details: The contig order in variant and sequenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    ERROR variant contigs = [scaffold_1_size161211, scaffold_2_size67455, scaffold_3_size80531, scaffold_4_size38253, scaffold_5_size36003, ==> all my contigs
    ERROR sequence contigs = [scaffold_100000_size1030, scaffold_100001_size1030, scaffold_100002_size1030, scaffold_100003_size1030, ==> all my contigs

    (the genome I'm working with has around 200'000 contigs, I can't attach any files as they are too big)

  • SheilaSheila Broad InstitutePosts: 3,940Member, Broadie, Moderator, Dev admin

    @daverdin
    Hi Guilame,

    I am a little confused. You were able to run the GATK pre-processing tools on your original BAM file, but the VCF from other tools is not compatible with GATK? Did you use the same reference you used in pre-processing when running the other tools?

    -Sheila

  • daverdindaverdin rutgersPosts: 5Member

    Hi @Sheila,

    Thank you for your response and sorry if this was confusing... I have followed the "Best Practices" workflow starting with pair end sequencing data (fastq) until the MergeBamAlignment part (I successfully obtained a .bam file). I then tried to use the HaplotypeCaller on this bam file but wasn't successful (I think the very high number of contigs in the genome I'm working with is the culprit here). So I decided to go back to the same bam file but this time use Platypus to get my VCF file. That worked well and wanted to see if I could go on in the "Best Practices" with this VCF by doing the hard filter (SelectVariants). That's where things get wrong with my contigs names not fitting..

    I have use the same reference file and dictionary to obtain the bam file and the platypus VCF file.

    I have tried different things like SortSam (coordinate) and SortVcf (with the same reference dictionary) but also reordersam but always got the same error message. If I don't do the sortVcf, the error is slightly different:

    ERROR /home/jimp/Downloads/GATKexamples/selfing/CNJ14_6_6/batch_9_ordered_tassel_corrected.vcf contigs = [scaffold_100042_size1029, scaffold_10007_size7458, scaffold_100165_size1026, etc.....]
    ERROR reference contigs = [scaffold_1_size161211, scaffold_2_size67455, scaffold_3_size80531, scaffold_4_size38253, etc.....]

    the command line I've used:
    java -Xmx7G -jar /home/jimp/Desktop/GATK/GenomeAnalysisTK.jar -T SelectVariants -R /home/jimp/Downloads/GATKexamples/selfing/new/crGenome_scaffolds.fasta -V VariantCalls_samsorted_vcfsorted.vcf -selectType SNP -o raw_snps.vcf &> error

    Any idea on what is going on here? thanks a lot!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,469Administrator, Dev admin

    @daverdin No, if you use a completely different caller, you can't get back to best practices. The Best Practices are formulated based on how tools work very specifically, and each tool in the toolchain has specific expectations about upstream processes.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.