Variant calling: high (and strange) number of alternative allele

Deat GATK team,

I am calling variant on a trio (mother, father and offspring) of Macaca mulatta. I have whole genome sequencing 60X for each individual. I use GATK 4.0.7.0, I call variant with HaplotypeCaller BP-RESOLUTION, combine with GenomicDBimport per chromosomes and genotype with GenotypeGVCF.

I am interested in the number of sites where I have only reference allele (AD=0 for the alternative) and the number of sites where I have some reads supporting ALT allele (AD > 0) in the parents.

I found a lot of sites (for each individuals) where I have AD>0 in the gvcf file (per indiviuals, the combined one and after genotyping). I looked at each site that are HomRef and for each individuals less than 30% of the HomRef sites have AD=0 for the alternative allele. I know that HaplotypeCaller does a realignement step that may change the positions of the reads, but 70% of the sites that have AD>0 seems a lot. I looked back at the BAM file and those alternative alleles don’t seem to be there. I try to call again using the bam.out option, and here again I don’t see so many alternative alleles. However, I see that sometimes on a read where there were no alternative allele on the bam input there is an alternative allele on the output.
Also I have tried samtools mpileup and in this case almost 90% of the HomRef sites are AD=0 for alternative allele.

Just as an example bellow is the VCF output from HaplotypeCaller for one individual and then there is a picture of both the input bam file and the output bam file.
For chr1 pos 24203380 the ref is A and I have:
Vcf --> DP=96 AD=92,4
Bam input --> DP 93, 92,1 (N)
Bam out --> DP=80, 79,1 (N)

chr1 24203380 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,4:96:57:0,57,5771 chr1 24203381 . G <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:90,5:95:0:0,0,5897 chr1 24203382 . C <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:78:0,78,6075 chr1 24203383 . A <NON_REF> . . . GT:AD:DP:GQ:PL 0/0:92,3:95:68:0,68,6127

Just in case here is my code:
gatk --java-options "-XX:ParallelGCThreads=16 -Xmx64g" HaplotypeCaller -R /PATH/rheMac8.fa -I /PATH/R01068_sorted.merged.addg.uniq.rmdup.bam -O /PATH/R01068_res.g.vcf -ERC BP_RESOLUTION \

I don’t know why I have this high number of alternative alleles and how to get read of them to have the 'real' number of alternative allele per position. The problem persists on the genotyping vcf files with some alternative alleles that are not present on any bam (input or HaplotypeCaller output).

I hope I gave you enough details so you have a clear idea of my problem and will be able to help me.
Best,

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @Lucie_B - The HaplotypeCaller is a good first pass method to use to find your variants. You might want to consider adding the Genotype Refinement Workflow to this process to make sure that your child genotype reflects the parental probabilities of observing an allele.

    For a sample for which both parent genotypes are available, the child’s genotype can be supported or invalidated by the parents’ genotypes based on Mendel’s laws of allele transmission. Even the confidence of the parents’ genotypes can be recalibrated, such as in cases where the genotypes output by HaplotypeCaller are apparent Mendelian violations.

    This should probably remove some of the alternative alleles currently seen in your analysis.

    Another procedure to consider is adjusting your hard filtering options.

    This will reduce the number of calls to those that fall within a certain threshold value.

  • Lucie_BLucie_B Member

    Dear @AdelaideR

    Thanks for the advice of using the Genotype Refinement Workflow. I have looked at it and it does reduce the number of Mendelian Violation seen in my dataset.

    However I am not sure this will be helpful on my problem. My issue is mostly on the sites for which all samples are HomRef. I have a non normal proportion of those sites with at least one alternative alleles seen. See below an example:

    chr1 34 . T <NON_REF> . . . GT:AD:DP:GQ:PL ./.:88,1:89:99:0,120,1800 ./.:65,2:67:65:0,65,4228 ./.:72,1:73:99:0,120,1800
    chr1 35 . T <NON_REF> . . . GT:AD:DP:GQ:PL ./.:86,1:87:99:0,120,1800 ./.:63,3:66:0:0,0,3997 ./.:72,1:73:99:0,120,1800
    chr1 36 . T <NON_REF> . . . GT:AD:DP:GQ:PL ./.:84,2:86:99:0,113,5502 ./.:63,2:65:58:0,58,4124 ./.:74,3:77:34:0,34,4380
    chr1 37 . A <NON_REF> . . . GT:AD:DP:GQ:PL ./.:87,1:88:99:0,120,1800 ./.:62,2:64:60:0,60,4034 ./.:77,1:78:99:0,120,1800
    chr1 38 . G <NON_REF> . . . GT:AD:DP:GQ:PL ./.:87,1:88:99:0,120,1800 ./.:63,3:66:0:0,0,3920 ./.:79,1:80:99:0,120,1800
    chr1 39 . G <NON_REF> . . . GT:AD:DP:GQ:PL ./.:90,1:91:99:0,120,1800 ./.:61,3:64:0:0,0,3895 ./.:77,2:79:91:0,91,4600
    chr1 40 . A <NON_REF> . . . GT:AD:DP:GQ:PL ./.:87,2:89:99:0,120,1800 ./.:57,4:61:0:0,0,3493 ./.:77,3:80:53:0,53,4302

    All those sites are HomRef but with few alternative alleles that I don't see on the Bam file (neither the input ones or the output ones). I have more than 70% of the sites with at least one alternative allele, but only 10 % of the sites have alternative allele in the Bam files.

    Would you have any idea where those alternatives allele could come from? As they are neither in the bam in or bam out I find it very strange. I wonder if this is normal or if something could have went wrong during variant calling.
    Best,

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @Lucie_B

    This is interesting behavior. I am going to do a little more research.

    My first hunch is that this is happening during the realignment phase of the HaplotypeCaller.

    It would be helpful to see a screenshot of the bams to make sure that the alternative alleles are not present.

    I will keep doing some research in the meantime.

  • gauthiergauthier Member, Broadie, Moderator, Dev admin

    Hi @Lucie_B ,

    A lot of different kinds of evidence get counted in favor of the non-ref allele -- deletions, mismatches, bases next to indels, and soft clips. Do you have soft clips turned on in IGV? It's an option under Preferences > Alignments (https://software.broadinstitute.org/software/igv/Preferences) I believe if they're not on, they won't get counted in the IGV pileup, which could be one source of discrepancy. Soft clips could also explain the depth discrepancy from your BWA bam and the bamout since we drop reads with soft clips if they can't be well explained by the haplotypes discovered in the assembly.

    There's a limit to how much we can help without your bam file, though, so this is just speculation. If you don't see anything weird nearby after adjusting your IGV options, someone from our outreach team can help you share a snippet of your data for debugging.

    -Laura

Sign In or Register to comment.