Variant calling: high (and strange) number of alternative allele
Deat GATK team,
I am calling variant on a trio (mother, father and offspring) of Macaca mulatta. I have whole genome sequencing 60X for each individual. I use GATK 126.96.36.199, I call variant with HaplotypeCaller BP-RESOLUTION, combine with GenomicDBimport per chromosomes and genotype with GenotypeGVCF.
I am interested in the number of sites where I have only reference allele (AD=0 for the alternative) and the number of sites where I have some reads supporting ALT allele (AD > 0) in the parents.
I found a lot of sites (for each individuals) where I have AD>0 in the gvcf file (per indiviuals, the combined one and after genotyping). I looked at each site that are HomRef and for each individuals less than 30% of the HomRef sites have AD=0 for the alternative allele. I know that HaplotypeCaller does a realignement step that may change the positions of the reads, but 70% of the sites that have AD>0 seems a lot. I looked back at the BAM file and those alternative alleles don’t seem to be there. I try to call again using the bam.out option, and here again I don’t see so many alternative alleles. However, I see that sometimes on a read where there were no alternative allele on the bam input there is an alternative allele on the output.
Also I have tried samtools mpileup and in this case almost 90% of the HomRef sites are AD=0 for alternative allele.
Just as an example bellow is the VCF output from HaplotypeCaller for one individual and then there is a picture of both the input bam file and the output bam file.
For chr1 pos 24203380 the ref is A and I have:
Vcf --> DP=96 AD=92,4
Bam input --> DP 93, 92,1 (N)
Bam out --> DP=80, 79,1 (N)
chr1 24203380 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,4:96:57:0,57,5771 chr1 24203381 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:90,5:95:0:0,0,5897 chr1 24203382 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:78:0,78,6075 chr1 24203383 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:68:0,68,6127
Just in case here is my code:
gatk --java-options "-XX:ParallelGCThreads=16 -Xmx64g" HaplotypeCaller -R /PATH/rheMac8.fa -I /PATH/R01068_sorted.merged.addg.uniq.rmdup.bam -O /PATH/R01068_res.g.vcf -ERC BP_RESOLUTION \
I don’t know why I have this high number of alternative alleles and how to get read of them to have the 'real' number of alternative allele per position. The problem persists on the genotyping vcf files with some alternative alleles that are not present on any bam (input or HaplotypeCaller output).
I hope I gave you enough details so you have a clear idea of my problem and will be able to help me.