High proportion of spanning deletion in a whole-genome callset
Hi GATK team,
I am working on a callset of ~150 high coverage human genomes. My processing pipeline follows the GATK Best Practices except that 1-I use IndelRealigner because we started processing the data before you stopped recommending it and 2-at the BQSR step I run additional steps involving variant calling on each sample (to avoid being too dependent on reference datasets). I am using GATK3.5 and GATK3.7 (3.7 for HaplotypeCaller and the steps afterwards).
I have mostly samples from Africa but I also have ~20 samples from outside Africa.
In my callset before VQSR I observe a very high proportion of variants with a * as the alternate allele (spanning deletion). It represents more than 25% of my biallelic variants. I looked at some of the variants in the VCF file and they seem to make sense as long as I can tell (close to indels etc). However I am worried about such a high proportion and have difficulties finding information about these variants and what to do with them.
Do you think that such a high proportion of spanning deletion is possible or do you think it is caused by something else in my pipeline? In that case, which checks could I run? Do you know where I could find data that I could compare mine with?
Thanks in advance for your insights,