Errors about contigs in BAM or VCF files not being properly ordered or sorted

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,258Administrator, Dev admin
edited January 4 in Common Problems

This is not as common as the "wrong reference build" problem, but it still pops up every now and then: a collaborator gives you a BAM or VCF file that's derived from the correct reference, but for whatever reason the contigs are not sorted in the same order. The GATK can be particular about the ordering BAM and VCF files so it will fail with an error in this case.

image

So what do you do?


For BAM files

You run Picard's ReorderSam tool on your BAM file, using the reference genome dictionary as a template, like this:

java -jar picard.jar ReorderSam \
    I=original.bam \
    O=reordered.bam \
    R=reference.fasta \
    CREATE_INDEX=TRUE

Where reference.fasta is your genome reference, which must be accompanied by a valid *.dict dictionary file. The CREATE_INDEX argument is optional but useful if you plan to use the resulting file directly with GATK (otherwise you'll need to run another tool to create an index).

Be aware that this tool will drop reads that don't have equivalent contigs in the new reference (potentially bad or not, depending on what you want). If contigs have the same name in the BAM and the new reference, this tool assumes that the alignment of the read in the new BAM is the same. This is not a liftover tool!


For VCF files

You run Picard's SortVcf tool on your VCF file, using the reference genome dictionary as a template, like this:

java -jar picard.jar SortVcf \
    I=original.vcf \
    O=sorted.vcf \
    SEQUENCE_DICTIONARY=reference.dict 

Where reference.dict is the sequence dictionary of your genome reference.

Note that you may need to delete the index file that gets created automatically for your new VCF by the Picard tool. GATK will automatically regenerate an index file for your VCF.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • rdali094rdali094 McGill UniversityPosts: 6Member

    Hello,
    Is there anyway to adjust/sort the genome fasta file instead of the Bam file?

    I have hundreds of Bams taking up TBs of storage. To reorder them all and save double copies (since some information is lost in the process) is a huge endeavour. It is much more feasible for me to tweak the reference instead. Is that possible?

    Thanks!
    Rola

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,258Administrator, Dev admin

    Yes, you could reorder the contigs within the reference file -- but if you use the same reference that was used to generate all these bams you shouldn't need to. What exactly is the error you're getting?

    Geraldine Van der Auwera, PhD

  • rdali094rdali094 McGill UniversityPosts: 6Member

    It might not be the exact reference genome file. The data is publicly available and it is aligned to hg19 but I dont have the exact genome file. What should I do? Do I look at the order of contigs in the Bam file and order the genome file the exact same way?

    This is the error log:

    INFO 23:29:03,978 GenomeAnalysisEngine - Strictness is SILENT
    INFO 23:29:04,344 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 23:29:04,365 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 23:29:04,748 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.38
    INFO 23:29:04,972 IntervalUtils - Processing 159138663 bp from intervals
    INFO 23:29:08,711 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
  • rdali094rdali094 McGill UniversityPosts: 6Member

    I tried to order the contigs in the genome fasta file the same way the Bams are sorted:

    pulled out contig order:

    samtools view $BAM | cut -f3 | uniq

    chr1
    chr10
    chr11
    chr12
    chr13
    chr14
    chr15
    chr16
    chr17
    chr18
    chr19
    chr2
    chr20
    chr21
    chr22
    chr3
    chr4
    chr5
    chr6
    chr7
    chr8
    chr9
    chrM
    chrX
    chrY
    *

    recreate genome file with same contigs in the same order:

    cat chr1.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr2.fa chr20.fa chr21.fa chr22.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chrM.fa chrX.fa chrY.fa > hg19.gatk.fa

    index genome and create dictionary:

    samtools faidx hg19.gatk.fa

    java -jar picard.jar CreateSequenceDictionary R= hg19.gatk.fa O= hg19.gatk.dict

    run muTect:

    java -jar GenomeAnalysisTK.jar -T MuTect2 -R hg19.gatk.fa -I:tumor $tumor -I:normal $normal --intervals "chr7" -o NormalTumorPair_chr7.muTect.vcf

    ######### Same error:

    INFO 20:20:08,335 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:20:08,339 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56
    INFO 20:20:08,339 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 20:20:08,339 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 20:20:08,344 HelpFormatter - Program Args: -T MuTect2 -R hg19.gatk.fa -I:tumor tumor.bam -I:normal normal.bam --intervals chr7 -o NormalTumorPair_chr7.muTect.vcf
    INFO 20:20:08,361 HelpFormatter - Executing as x on Linux 2.6.32-504.30.3.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_85-mockbuild_2015_07_15_12_57-b00.
    INFO 20:20:08,361 HelpFormatter - Date/Time: 2016/01/10 20:20:08
    INFO 20:20:08,361 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:20:08,361 HelpFormatter - --------------------------------------------------------------------------------
    INFO 20:20:08,595 GenomeAnalysisEngine - Strictness is SILENT
    INFO 20:20:08,980 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 20:20:08,996 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 20:20:09,142 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.14
    INFO 20:20:09,232 IntervalUtils - Processing 159138663 bp from intervals
    INFO 20:20:10,879 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.5-0-g36282e4):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in reads. Please see http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersamfor more information. Error details: reads contigs = [chr1, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr2, chr20, chr21, chr22, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
    ERROR ------------------------------------------------------------------------------------------
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,258Administrator, Dev admin

    Ah, what you're encountering is not the "contigs mismatch" error, it's a different error that complains specifically about the lexicographical order. We'll add some text to disambiguate that in the doc.

    This is something of a legacy requirement that contigs be ordered in pure alphanumeric order. Once you've made sure the dictionaries match between your bams and reference, and any VCF resources you plan to use, you can override this requirement by setting -U ALLOW_SEQ_DICT_INCOMPATIBILITY which will also allow lexicographical order.

    Note that -U stands for UNSAFE and that skipping dictionary validation is not recommended. However in your case I think it may be the only way to work without reordering all your bams.

    Be sure to test this on a full run of your pipeline on a subset of data to make sure it won't cause any fatal problems downstream.

    Geraldine Van der Auwera, PhD

  • breardonbreardon Cambridge, MAPosts: 1Member, Broadie

    I am currently trying to resort contigs from a vcf relative to a reference with the picard task SortVCF; however, I am running into the error quoted below. This may be because of an incompatible dictionary but the error was repeated even after creating a new dict file by following the documentation here: http://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference.

    Do you have any suggestions? I am using the reference genome as specified at the top of the vcf.

    Thank you!

    --

    [Fri Jan 22 16:50:24 EST 2016] picard.vcf.SortVcf INPUT=[ALL.autosomes.phase3_shapeit2_mvncall_integrated_v4.20130502.sites.vcf] OUTPUT=sorted.vcf SEQUENCE_DICTIONARY=/xchip/cga_home/breardon/misc_projects/_travis_vcf/reference_hs37d5/hs37d5.dict VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=true CREATE_MD5_FILE=false

    [Fri Jan 22 16:50:24 EST 2016] Executing as breardon@cga02 on Linux 2.6.32-573.7.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_71-b14; Picard version: 1.834(b2a94f76e786204c2ea48814b4a4a1018ff9b338_1422570082) JdkDeflater

    [Fri Jan 22 16:50:25 EST 2016] picard.vcf.SortVcf done. Elapsed time: 0.03 minutes.

    Runtime.totalMemory()=2058354688

    To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp

    Exception in thread "main" java.lang.IllegalArgumentException: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected.

    at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:112)

    at picard.vcf.SortVcf.doWork(SortVcf.java:81)

    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)

    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)

    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

    Caused by: java.lang.AssertionError: SAM dictionaries are not the same: SAMSequenceRecord(name=GL000191.1,length=106433,dict_index=22,assembly=b37) was found when SAMSequenceRecord(name=X,length=155270560,dict_index=22,assembly=null) was expected.

    at htsjdk.samtools.SAMSequenceDictionary.assertSameDictionary(SAMSequenceDictionary.java:142)

    at picard.vcf.SortVcf.collectFileReadersAndHeaders(SortVcf.java:110)

    ... 4 more

    Issue · Github
    by Sheila

    Issue Number
    519
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • SheilaSheila Broad InstitutePosts: 2,640Member, Broadie, Moderator, Dev admin

    @breardon
    Hi,

    I am going to have our Picard expert look at this. He will get back to you.

    Sheila

  • deklingdekling Broad InstitutePosts: 20Member, Broadie, Moderator, Dev admin

    @breardon, did you run the UpdateVcfSequenceDictionaryTool prior to running the SortVcf tool?

  • MagdaMagda Posts: 5Member

    Hi, I'm starting to get crazy.

    1. I run HaplotypeCaller without a dbsnp.vcf file and it worked well.
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw_g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    2. I wanted to include a dbsnp.vcf file and the problems started.
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    1. Ok, So, I order it:
      java -jar picard.jar SortVcf I=.dbsnp_138.hg19.vcf O=dbsnp_138.hg19_sorted.vcf SEQUENCE_DICTIONARY=.ucsc.hg19.dict.gz

    2. So, I try again with the sorted vcf file:
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hg19.fa --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp.
    ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
    ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
    ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
    1. Since it proposes to reorder the bam file, I do it:
      java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Ind1_Recal.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE

    2. So, now, using this bam file reordered, I try again.

      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    it doesn't work. I'll try to use the original dbsnp.vcf

    java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! YES!!!!!!! it seems to work.....

    But after 2h just before finishing it appears this error:

    ERROR MESSAGE: File /Users/...GATK_BaseRecalibrator/Ind1_Recal_reordered.bai is malformed: Premature end-of-file while reading BAM index file

    GATK_BaseRecalibrator/Ind1_Recal_reordered.bai. It's likely that this file is truncated or corrupt -- Please try re-indexing the corresponding BAM file.

    ..... Well, I reindexed, re run, try that, and that and have been the last 2 days trying all combinations of vcf, reference, bam files, original, reordered...

    I do not what can I do.
    Magda

  • SheilaSheila Broad InstitutePosts: 2,640Member, Broadie, Moderator, Dev admin

    @Magda
    Hi Magda,

    Where did you get the reference and dbsnp file you are using? We have a bundle with files you should be able to use with no problem. You can also try adding -U ALLOW_SEQ_DICT_INCOMPATIBILITY as Geraldine suggested above.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,258Administrator, Dev admin

    Premature end of file means the file is truncated and will not be usable. Whatever process was used to generate it must be repeated.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,258Administrator, Dev admin

    Just realized my comment wasn't helpful -- but what happened after you re-indexed the file? You don't say if there is still an error or what it is.

    Geraldine Van der Auwera, PhD

  • MagdaMagda Posts: 5Member

    Hi,
    Thanks for answering so quick.

    1) to Sheila,
    I tried to use the -U ALLOW _QES_DICT_INCOMPATIBILITY and although it runs it gives a warn:

    java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19_sorted.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    WARN 11:29:52,095 SequenceDictionaryUtils - Input files /dbsnp_138.hg19_sorted.vcf and reference have incompatible contigs: The contig order in /dbsnp_138.hg19_sorted.vcf and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    dbsnp_138.hg19_sorted.vcf contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
    reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]

    It's weird, because precisely in this case I downloaded the reference (ucsc.hg19.fasta), the dbsnp (dbsnp_138.hg19_sorted.vcf ) from the bundle and the bam files were reordered using this reference from the bundle too (R= ../../ucsc.hg19.fasta).

    2) To Geraldine,
    When I reindexed:
    java -jar -Xmx4g picard.jar BuildBamIndex I= GATK_BaseRecalibrator/Ind1_Recal_reordered.bam O= GATK_BaseRecalibrator/Ind1_Recal_reordered_reindex.bai VALIDATION_STRINGENCY=LENIENT

    I tried to rerun HaplotypeCaller:
    Ijava -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g2.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    ERROR MESSAGE: GVCF output requires a specific indexing strategy. Please re-run including the arguments -variant_index_type LINEAR -variant_index_parameter 128000.

    I think I'm just going to desist of using a dbsnp file.
    Thank you

  • MagdaMagda Posts: 5Member

    Hi,
    I try a last thing and it worked well. The only different thing is that now I run it on a Test directory where there were only the original Ind1.bam file, the Ind1_reordered.bam and Ind1_reordered.bai (and not the Ind1.bai file). So I think that the old indexed bam file even though the name is different interfere with the command.

    java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Test/Ind1_Recal_reordered.bam -o GATK_BaseRecalibrator/Test/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    Thank you.
    Magda

  • MagdaMagda Posts: 5Member

    Hi again.
    How is it possible that for the next step I have the same complains?

    Yesterday I run HaplotypeCaller after reordering the Bam files using the ref from bundle, then using the dnsnp from bundle and bam files
    1. Reorder_Bam using picard on a separate directory without any previous .idx file in the directory (otherwise it doesn't work)
    do java -Xmx6g -jar picard.jar ReorderSam I= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal.bam O= GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam R= ucsc.hg19.fasta CREATE_INDEX=TRUE

    1. Run HaplotypeCaller using the ref, dnsnp from bundle and reordered bam files
      java -Xmx6g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta --dbsnp dbsnp_138.hg19.vcf -I GATK_BaseRecalibrator/Reordered_Bam/Ind1_Recal_reordered.bam -o GATK_HaploCaller/Ind1_raw.g.vcf -L nexterarapidcapture_exome_targetedregions_v1.2.bed -ERC GVCF

    Today. I want to use GenotypeGVCFs, so

    1. java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites.vcf -V Ind1_raw.g.vcf -V Ind24_raw.g.vcf -o Output_Joint.vcf
    ERROR MESSAGE: Input files dbsnp and reference have incompatible contigs: The contig order in dbsnp and referenceis not the same; to fix this please see: (https://www.broadinstitute.org/gatk/guide/article?id=1328), which describes reordering contigs in BAM and VCF files..
    ERROR dbsnp contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16,
    ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15,

    Ok, I think of sorting the snps.vcf file

    java -jar picard.jar SortVcf I=1000G_phase1.snps.high_confidence.hg19.sites.vcf O=1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict

    Try again:

    java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Ind1_raw.g.vcf -V Ind24_raw.g.vcf -o Output_Joint.vcf

    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in dbsnp.
    ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
    ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
    ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
    ERROR dbsnp contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, ch

    OK; Should I also sort again the raw.g.vcf files? I do a test

    java -jar picard.jar SortVcf I=Reordered_Bam/Ind1_raw.g.vcf O=Reordered_Bam/Ind1_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict
    java -jar picard.jar SortVcf I=Reordered_Bam/Ind24_raw.g.vcf O=Reordered_Bam/Ind24_raw_sorted.g.vcf SEQUENCE_DICTIONARY=ucsc.hg19.dict

    Try again

    java -Xmx6g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R ucsc.hg19.fasta -D 1000G_phase1.snps.high_confidence.hg19.sites_sorted.vcf -V Reordered_Bam/Ind1_raw_sorted.g.vcf -V Reordered_Bam/Ind24_raw_sorted.g.vcf -o Output_Joint.vcf

    ERROR MESSAGE: Lexicographically sorted human genome sequence detected in variant.
    ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
    ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
    ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
    ERROR variant contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17,

    I just don't understand that if I'm using the ucsc.hg19.fasta file from bundle, the SNPs.vfc from bundle, and I reordered my bam files using that same ref file, it still complains in the nest step.

    I am sorry, but I just don't get it.

    Magda

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,258Administrator, Dev admin

    Hi Magda,

    The "Lexicographically sorted human genome sequence" error is our fault -- we added a check that is not necessary, by mistake. We are going to fix this in the next version. For now, if you rerun your last command with -U ALLOW_SEQ_DICT_INCOMPATIBILITY it should work.

    Geraldine Van der Auwera, PhD

  • MagdaMagda Posts: 5Member

    Ok, it's working. Thank you very much
    Magda

Sign In or Register to comment.