We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Possible bug related to contig ordering in RealignerTargetCreator

Hi GATK Team,

I am trying to run the RealignerTargetCreator tool on a pair of tumor/normal bam files. The reads were aligned to a version of the hg19 reference genome that had the contigs in a different order, so I used Picard tools' ReorderSam to get the contigs in karyotypic order (using as reference genome the ucsc.hg19.fasta file from the GATK resource bundle) for both input .bam files.

Then I ran this command:

java -Xmx4g -jar /home/sxh615/jobs/scripts/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /REF_GENOME/ucsc.hg19.fasta -L chr21.intervals -I chr21.tumor.RG.reorder.bam -I chr21.normal.RG.reorder.bam -o chr21.RG.reorder.chr21.RTK.intervals -nt 2 -known /REF_VARs/1000G_phase1.indels.hg19.sites.vcf

I am getting this error:

Input files known and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.

ERROR known contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]

I checked the header of the input .bam files and the reference indel vcf file to make sure the contigs are ordered the same as the reference.
(the file 1000G_phase1.indels.hg19.sites came from the GATK resource bundle). Neither are ordered as chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM.

I also noticed when I do not supply the file of known indels, I do not get an error, so it seems the issue is with the 1000G_phase1.indels.hg19.sites.vcf.

I tried PicardTools SortVcf on the 1000G_phase1.indels.hg19.sites.vcf file, using the dictionary file corresponding to ucsc.hg19.fasta to re-order that file in case something was out of order. Then I ran this command:

java -Xmx4g -jar /home/sxh615/jobs/scripts/GenomeAnalysisTK.jar -T RealignerTargetCreator -R /home/sxh615/jobs/REF_GENOME/ucsc.hg19.fasta -L all_CryptCRC_Enhancers.v4filtered.sortedMerged.chr21.intervals -I /scratch/pbsjobs/pbstmp.3446143.hpcmaster/LP6005/LP6005254-DNA_B01/LP6005254-DNA_B01.chr21.RG.reorder.bam -I /scratch/pbsjobs/pbstmp.3446143.hpcmaster/LP6005/LP6005203-DNA_B01/LP6005203-DNA_B01.chr21.RG.reorder.bam -o /scratch/pbsjobs/pbstmp.3446143.hpcmaster/LP6005/LP6005203-DNA_B01/LP6005203-DNA_B01.chr21.RG.reorder.chr21.RTK.intervals -nt 2 -known /home/sxh615/jobs/scripts/REF_VARs/1000G_phase1.indels.hg19.sites.sorted.vcf

And got this error message:

ERROR MESSAGE: Lexicographically sorted human genome sequence detected in known.
ERROR For safety's sake the GATK requires human contigs in karyotypic order: 1, 2, ..., 10, 11, ..., 20, 21, 22, X, Y with M either leading or trailing these contigs.
ERROR This is because all distributed GATK resources are sorted in karyotypic order, and your processing will fail when you need to use these files.
ERROR You can use the ReorderSam utility to fix this problem: http://gatkforums.broadinstitute.org/discussion/58/companion-utilities-reordersam
ERROR known contigs = [chr1, chr10, chr11, chr11_gl000202_random, chr12, chr13, chr14, chr15, chr16, chr17, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18, chr18_gl000207_random, chr19, chr19_gl000208_random, chr19_gl000209_random, chr1_gl000191_random, chr1_gl000192_random, chr2, chr20, chr21, chr21_gl000210_random, chr22, chr3, chr4, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr5, chr6, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7, chr7_gl000195_random, chr8, chr8_gl000196_random, chr8_gl000197_random, chr9, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chrM, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249, chrX, chrY]

But the order of the contigs in the 1000G_phase1.indels.hg19.sites.sorted.vcf is not lexicographic; it is karyotypic.
I can't think of the next trouble-shooting step. I've also searched the forum for similar issues but did not get much further. Any help on what could be causing these issues would be much appreciated.

Thanks!
Steve

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Steve,

    Is the ordering shown in ERROR known contigs = [ correct? If so that is lexicographical.

  • newGATKusernewGATKuser CaseMember

    Hi @Geraldine_VdAuwera

    Thanks for the fast reply. The header of the 1000G_phase1.indels.hg19.sites.sorted.vcf file has the order as this:

    contig=<ID=chrM,length=16571,assembly=hg19>

    contig=<ID=chr1,length=249250621,assembly=hg19>

    contig=<ID=chr2,length=243199373,assembly=hg19>

    contig=<ID=chr3,length=198022430,assembly=hg19>

    contig=<ID=chr4,length=191154276,assembly=hg19>

    contig=<ID=chr5,length=180915260,assembly=hg19>

    contig=<ID=chr6,length=171115067,assembly=hg19>

    contig=<ID=chr7,length=159138663,assembly=hg19>

    contig=<ID=chr8,length=146364022,assembly=hg19>

    contig=<ID=chr9,length=141213431,assembly=hg19>

    contig=<ID=chr10,length=135534747,assembly=hg19>

    contig=<ID=chr11,length=135006516,assembly=hg19>

    contig=<ID=chr12,length=133851895,assembly=hg19>

    contig=<ID=chr13,length=115169878,assembly=hg19>

    contig=<ID=chr14,length=107349540,assembly=hg19>

    contig=<ID=chr15,length=102531392,assembly=hg19>

    contig=<ID=chr16,length=90354753,assembly=hg19>

    contig=<ID=chr17,length=81195210,assembly=hg19>

    contig=<ID=chr18,length=78077248,assembly=hg19>

    contig=<ID=chr19,length=59128983,assembly=hg19>

    contig=<ID=chr20,length=63025520,assembly=hg19>

    contig=<ID=chr21,length=48129895,assembly=hg19>

    contig=<ID=chr22,length=51304566,assembly=hg19>

    contig=<ID=chrX,length=155270560,assembly=hg19>

    contig=<ID=chrY,length=59373566,assembly=hg19>

    contig=<ID=chr1_gl000191_random,length=106433,assembly=hg19>

    contig=<ID=chr1_gl000192_random,length=547496,assembly=hg19>

    contig=<ID=chr4_ctg9_hap1,length=590426,assembly=hg19>

    contig=<ID=chr4_gl000193_random,length=189789,assembly=hg19>

    contig=<ID=chr4_gl000194_random,length=191469,assembly=hg19>

    contig=<ID=chr6_apd_hap1,length=4622290,assembly=hg19>

    contig=<ID=chr6_cox_hap2,length=4795371,assembly=hg19>

    contig=<ID=chr6_dbb_hap3,length=4610396,assembly=hg19>

    .....

    I also manually sifted through the file to check that chr1 indels are before chr2 indels, which are before chr3 indels, etc.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Alright, that looks correct, maybe the index is out of date. Try deleting the vcf index file and then running the RTC command. GATK will generate a new index which might fix your problem.

  • newGATKusernewGATKuser CaseMember

    Thanks @Geraldine_VdAuwera !

    Re-making the index fixed the problem. I really appreciate your help as I was stuck on this for a couple of days.

Sign In or Register to comment.