Running Mutect on paired-end vs. single-end read data

max_shpak

I have been running MuTect on a number of TCGA .bam files recently. Many of the .bam files were aligned to hg18 and had to be realigned to hg19 for consistency with other files in the data set. In all cases, I used Homo_sapiens_assembly19 as a reference file, and when the files were remapped, Mutect and other mutation callers ran without a problem.

However, there was one .bam file that was created from unpaired (single-end) reads. We were able to modify our scripts to realign this to hg19 as well without significant problems, however, when we attempt to run Mutect, we get the following error:

ERROR MESSAGE: Input files reads and reference have incompatible contigs: The following contigs included in the intervals to process have different indices in the sequence dictionaries for the reads vs. the reference: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,19, 20, 21, 22, X, Y]. As a result, the GATK engine will not correctly process reads from these contigs. You should either fix the sequence dictionaries for your reads so that these contigs have the same indices as in the sequence dictionary for your reference, or exclude these contigs from your intervals. This error can be disabled via -U ALLOW_SEQ_DICT_INCOMPATIBILITY, however this is not recommended as the GATK engine will not behave correctly.

We did not get this or any other error message while processing the bam files that were originally generated from pair-end reads. However, since by this stage of the analysis, the reads have already been mapped (using the same reference file), I don't see why this should be a source of error. Similarly, the interval (.bed) file and the snp coordinate file dbsnp_138.b37.vcf are compatible with the reference genome.

Could there be any issues specific to the fact that my .bam files were originally mapped from single-end reads that could be causing this problem? If not, what are the possible sources of error here?



