SNP calling on inverted repeats
Dear GATK team,
I have encountered a problem when I used the HaploTypeCaller for variant detection on about 100 plastid genomes. The plastid genome is haploid and contains two large inverted repeats (which are presumably almost 100% identical, though inverted). However, no variants are detected on either of these regions and the SNPs/indels are only reported in the non-repeated regions.
I would expect that intra-individual polymorphisms on the inverted repeats would not be detected, since the mapping algoritm from BWA or similar can't assign the reads accurately to either of the repeated regions. However, there are variants between samples that are present in both inverted repeats and I would expect that the haplotype caller should find these. I used VarScan on the same set of bam files and had no problem in detecting variants in the inverted repeats.
I ran the following command:
java -jar ~/Prog/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -ploidy 1 -I "$i".recal.bam -o ../plastid_snp/gvcf_hap/"$i".g.vcf -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000
The resulting gvcf files contain only polymorphisms in the non-repeated DNA, thus it's not a problem of the variant filtering step.
I was wondering whether you have an idea why the haplotype caller doesn't call the variants in the inverted repeats? Did you ever encounter similar problems? Any ideas/inputs would be highly appreciated. I could imgaine that the problem has to do with BWA assigning a lower mapping score to reads that are not uniquely mapped.
Of course, a simple workaround is to delete one of the repeats from the reference before read mapping.