We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

SNP calling on inverted repeats

Dear GATK team,

I have encountered a problem when I used the HaploTypeCaller for variant detection on about 100 plastid genomes. The plastid genome is haploid and contains two large inverted repeats (which are presumably almost 100% identical, though inverted). However, no variants are detected on either of these regions and the SNPs/indels are only reported in the non-repeated regions.
I would expect that intra-individual polymorphisms on the inverted repeats would not be detected, since the mapping algoritm from BWA or similar can't assign the reads accurately to either of the repeated regions. However, there are variants between samples that are present in both inverted repeats and I would expect that the haplotype caller should find these. I used VarScan on the same set of bam files and had no problem in detecting variants in the inverted repeats.

I ran the following command:
java -jar ~/Prog/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -ploidy 1 -I "$i".recal.bam -o ../plastid_snp/gvcf_hap/"$i".g.vcf -ERC GVCF --variant_index_type LINEAR --variant_index_parameter 128000

The resulting gvcf files contain only polymorphisms in the non-repeated DNA, thus it's not a problem of the variant filtering step.

I was wondering whether you have an idea why the haplotype caller doesn't call the variants in the inverted repeats? Did you ever encounter similar problems? Any ideas/inputs would be highly appreciated. I could imgaine that the problem has to do with BWA assigning a lower mapping score to reads that are not uniquely mapped.

Of course, a simple workaround is to delete one of the repeats from the reference before read mapping.

Kind regards,

Best Answer


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Marco,

    How large are the repeats? HaplotypeCaller may not be able to detect them if they are very large. You can also try increasing the active region size with --activeRegionMaxSize.

    You may also want to look into GenomeSTRiP, a tool for detecting structural variants.


  • Robur86Robur86 UKMember

    Hi Sheila,

    Thanks for your answer.

    The inverted repeats are large (~30'000bp each) and the genome is around 160'00bp. I increased the --activeRegionMaxSize for a start to 3000 but that didn't help. To be honest, I am unsure how this could improve the variant calling (and what an appropriate MaxSize would be) and why the haplotypecaller is not able to detect variants in the inverted repeats. Like I stated above, it could be possible to delete one of the inverted repeats from the reference before read mapping, but this seems not very elegant, since e.g. the coverage would be artificially influenced. What's the reason that the haplotypecaller doesn't recognize variation in those regions? I guess that there is no straightforward workaround?

    GenomeSTRiP seems and interesting option, but I am not mainly interested in structural variation.

    Many thanks,

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Marco,

    Wow, those are very large repeats. Can you post an IGV screenshot that shows the mapping in the region where the repeats are?


  • Robur86Robur86 UKMember

    Hi Sheila,

    Indeed, and that's a feature that is shared among most plastid genomes. I attach a screenshot from IGV. The inverted repeat starts basically exactly where the mapped reads are more transparent, which indicates mapping quality. I think that this gives the answer: The BWA assigns a low quality score if a read cannot be uniquely mapped and If I understood it correctly, the haplotypecaller takes the mapping quality into account and thus discards the variants because of a low quality score.

    I guess that the only option is to either go ahead without using variants from those regions or to use a different tool?

    Thanks for taking your time, this is a great forum!

  • frebiofrebio belgiumMember

    Dear Sheila,

    I would like to come back to this post, because I'm facing similar problems where I want to call SNPs in gene tandem arrays. I wonder whether it would be ok to set Haplotypecaller's MappingQualityReadFilter to a --minimum-mapping-quality of zero? Or are there better alternatives?

    Thanks for the help.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @frebio,

    For tandem genes, an issue you will encounter is multi-mapping reads that are flagged secondary (0x100). GATK also ignores secondary alignments from analyses. So, setting min MAPQ to zero is insufficient to analyze such reads.

Sign In or Register to comment.