Problem due to: "MESSAGE: Input files reads and reference have incompatible contigs"

I am trying to compute mean coverage (using GATK DepthOfCovearge) for a BAM file (targeting sequencing) aligned using reference hg19.
java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ucsc.hg19.fasta \
-T DepthOfCoverage \
-I my_bam.list \
-L my_targets.bed \
-o coverage
The problem reported is:
##### ERROR MESSAGE: Input files reads and reference have incompatible contigs: Found contigs with the same name but different lengths:
##### ERROR contig reads = chrM / 16569
##### ERROR contig reference = chrM / 16571.
##### ERROR reads contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM]
##### ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
##### ERROR ------------------------------------------------------------------------------------------
Could you please help me to find a solution?
Many thanks in advance.
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
@NicolaC You have to make sure that everything matches: references, bam contigs and any inrervals file (like you BED file). Based on the sequence dictionary you posted above, it looks like your reads were aligned to the b37 reference, but modified to have the 'chr' prefix in the contig names. That is not a good sign -- that's the kind of modification that causes all these compatibility problems.
There are several ways to fix this problem.
The safest thing to do is realign your reads from scratch to the reference you want to use, and use all the matching files. It takes more time but it's the only way to be sure that nothing else can go wrong.
You could strip all 'chr' prefixes from both your bam and BED file, and rename chrM to MT, to be able to use the real b37 reference as well as b37-aligned resources (known variants etc). The coordinates will match up; the only contig that is slightly different between hg19 and b37 is the mitochondrial DNA, which you probably don't need to care about. This can work if you do it carefully and cleanly, but many things can go wrong during the editing process that can screw up your files even more.
You could add 'chr' prefixes to the contig names in the b37 reference and rename MT to chrM, which is probably what was used to align your reads. Same caveat as for option #2: it can work if you do it carefully and cleanly, but things can go wrong during the editing process that can screw up your files even more. It's the option I dislike the most because it perpetuates the false identification of the reference build, and you'll need to adapt all resource files (known variants etc.) accordingly.
My recommendation is to use option #1, which is the cleanest and safest option.
Answers
@NicolaC Have you checked, whether your bams are the same build as your reference?
I checked contigs in the header of the BAM file, that is:
And the LB tag is
LB:hg19/IonXpress
I supposed that hg19 was used as reference, am I wrong? Are there any way to verify it?
Thank you.
@NicolaC check this page (4. What is the canonical ordering of human reference contigs in a BAM file?):
http://gatkforums.broadinstitute.org/discussion/1317/collected-faqs-about-bam-files
Also check this page:
http://gatkforums.broadinstitute.org/discussion/2396/input-files-known-and-reference-have-incompatible-contigs
Excellent answer by @pmint :
"in hg19 version, chrM length = 16571 in b37 version, chrM length = 16569"
So switch from hg19 to b37 and your problem should/might be sorted. I hope that helps.
If you search for your error message, then you will find, that others have had the same problem.
@tommycarstensen your suggestions are very useful. thank you for your help.
I tried using as reference the b37 reference: human_g1k_v37.fasta
The error reported is:
MESSAGE: File associated with my_target.bed is malformed: Problem reading the interval file caused by Badly formed genome loc: Contig chr8 given as location, but this contig isn't present in the Fasta sequence dictionary
This happens because, regions in BED file are specified as:
Simply removing the "chr" prefix from BED file target regions is it enough to solve the problem without introducing any bias? I was wondering if, for example, regions with coordinates chr8:234370-234371 (hg19) corresponds exactly to 8:234370-234371 (b37).
Thank you.
@NicolaC You have to make sure that everything matches: references, bam contigs and any inrervals file (like you BED file). Based on the sequence dictionary you posted above, it looks like your reads were aligned to the b37 reference, but modified to have the 'chr' prefix in the contig names. That is not a good sign -- that's the kind of modification that causes all these compatibility problems.
There are several ways to fix this problem.
The safest thing to do is realign your reads from scratch to the reference you want to use, and use all the matching files. It takes more time but it's the only way to be sure that nothing else can go wrong.
You could strip all 'chr' prefixes from both your bam and BED file, and rename chrM to MT, to be able to use the real b37 reference as well as b37-aligned resources (known variants etc). The coordinates will match up; the only contig that is slightly different between hg19 and b37 is the mitochondrial DNA, which you probably don't need to care about. This can work if you do it carefully and cleanly, but many things can go wrong during the editing process that can screw up your files even more.
You could add 'chr' prefixes to the contig names in the b37 reference and rename MT to chrM, which is probably what was used to align your reads. Same caveat as for option #2: it can work if you do it carefully and cleanly, but things can go wrong during the editing process that can screw up your files even more. It's the option I dislike the most because it perpetuates the false identification of the reference build, and you'll need to adapt all resource files (known variants etc.) accordingly.
My recommendation is to use option #1, which is the cleanest and safest option.
Dear @Geraldine_VdAuwera , thank you for the more than exaustive reply.
I agree with you that "it looks like your reads were aligned to the b37 reference, but modified to have the 'chr' prefix in the contig names". I am processing bam files sequenced and aligned by IonTorrent. I found out that it uses a specific assembly of the human genome reference. I will ask for that specific assembly trying to solve compatibility problems or as you proposed, I will re-align my reads.
Many thanks for all helps received!
Ah I see -- good to know that Ion uses their own reference build. Good luck!