UnifiedGenotyper reads v. reference incompatibility
We are running UnifiedGenotyper to call variants in 5.5 Mb of targeted capture sequence. Using the script:
-jar /my.dir/GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar \
-T UnifiedGenotyper \
-R /my.dir/hg19/ucsc.hg19.fasta \
-I /my.dir/bam/solid5500_FC1_20120227_01_08NA35454_F3.csfasta.ma.bam -I /my.dir/bam/solid5500_FC1_20120227_02_08NA35454_F3.csfasta.ma.bam -I /my.dir/bam/solid5500_FC1_20120227_03_08NA35454_F3.csfasta.ma.bam \
-o /my.dir/test3bam.vcf \
--dbsnp /my.dir/hg19/dbsnp_137.hg19.vcf \
-glm BOTH \
-L /my.dir/039087_D_BED_20120215_mod1.bed \
-stand_call_conf 30.0 \
-stand_emit_conf 30.0 \
We get the following error:
ERROR MESSAGE: Input files reads and reference have incompatible contigs: The following contigs included in the intervals to process have different indices in the sequence dictionaries for the reads vs. the reference: [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr14, chr15, chr17, chr19, chr20, chr22]. As a result, the GATK engine will not correctly process reads from these contigs. You should either fix the sequence dictionaries for your reads so that these contigs have the same indices as in the sequence dictionary for your reference, or exclude these contigs from your intervals. This error can be disabled via -U ALLOW_SEQ_DICT_INCOMPATIBILITY, however this is not recommended as the GATK engine will not behave correctly..
ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chr23, chr24, chr25]
ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
Are we correct in concluding that the problem is that the chr names (ie, chr25 v. chrM) and order are different between our .bams and the reference .fasta provided in GATK's hg19 bundle? If so, are we also correct in concluding that a workable solution would be to substitute the .fasta reference we used to generate the .bams (which is also based on hg19) in UnifiedGenotyper's "-R" argument, and use the workflow described here (http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference) to generate the .fai and .dict to accompany our .fasta? If we use our own reference .fasta, can we still use the GATK bundle's unaltered "dbsnp_137.hg19.vcf" for the "--dbsnp" argument, or will this need to be modified?
Any advice would be much appreciated.