We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

variant calling in polyN region using HaplotypeCaller

asakiasaki chinaMember
edited November 2017 in Ask the GATK team

Hi all,

I am using GATK HaplotypeCaller to call the variant, which lies in polyN complex region.

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R /data/iGenomes/Homo_sapiens/NCBI/build37.2/Sequence/WholeGenomeFasta/genome.fa -D wellwise.vcf -L wellwise.vcf -I region.bam -o test.vcf -bamout region.bamout.bam

It is obvious that the most of the alleles represent T at the position in the attached image, either in original or bamout bam file.

However, the vcf file only shows that there are 16 reads representing C while there are only 7 reads for T, which is largely different against the IGV shots.

22  42528028    .   C   T   90.73   .   AC=1;AF=0.500;AN=2;BaseQRankSum=2.061;ClippingRankSum=0.000;DP=128;ExcessHet=3.0103;FS=4.365;MLEAC=1;MLEAF=0.500;MQ=59.93;MQRankSum=0.756;QD=3.94;ReadPosRankSum=1.554;SOR=1.785    GT:AD:DP:GQ:PL  0/1:16,7:23:99:128,0,418

Can @Sheila or others help me to fix the issue?


Best Answer


  • asakiasaki chinaMember

    I have fixed the issue now.

    Since I am using PCR to enrich the target region, the reads mapped in the figure above have many pverlapped gap regions, which make GATK fail to call the turth variant.

    I tried adding --max_deletion_fraction 0.8 in UG, and the variant is called with the right genotypes I expected. However, it seems that it can be only applied in UG, but not HC.

    5. When using UnifiedGenotyper, check for overlapping deletions
    The UnifiedGenotyper ignores sites if there are too many overlapping deletions. This parameter can be relaxed using the --max_deletion_fraction argument (see the UG's documentation page to find out what is the default value for this argument) but be aware that increasing its value could adversely affect the reliability of your results.

    And by IGV shots, it is obvious that many reads are mis-aligned, since the gap can be extended after my target SNP, otherwise, half of the reads will be neglected.
    Any recommendations for bwa or other aligners?


  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited November 2017

    Use a high fidelity polymease (Q5 is highly recommended). We had a similar issue for some small repeat regions in some critical genes such as the 5'UTR of UGT1A1. We are looking for TA repeat counts for and using a mainstream polymerase didn't cut the case. We switched to Phusion (We have a huge stock that's why when it is over we will switch to Q5) and now we are able to detect TA repeat count from amplicon sequencing.

    Another thing to help you is to use str-realigner to help you realign your reads at highly repetitive regions.


    This helped me in many cases to see the variant clearly.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited November 2017

    Another suggestion. If you are hundred percent sure about the position of your SNP you may mask that nucleotide with bedtools with N and realign your reads again with BWA MEM. you will notice that TTC's will be aligned properly and gaps will be left normalized. Once the mapping is done you need to call variants using the unmasked fasta file.

    Thats what I do usually to get the clear picture.

    Here is an example I worked yesterday.

    I am sure that the splice site is mutated and T turned to G so I masked that T with N (Actually I cheated by masking with G also instead of N). All Gs were aligned properly and gaps due to polymerase error at the homopolymer is gone.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Junfeng :smile:

    Are you using the latest version of HaplotypeCaller? Can you try adding --interval_padding 100 to your command?
    You may also check if GATK4 latest beta helps.


  • asakiasaki chinaMember

    @Sheila , both adding -ip 100 to GATK3.8 or using GATK4 latest beta did not help a lot.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    As I said it is a PCR error and must be handled with a high fidelity polymerase with limited number of cycles to reduce the artifacts to a minimum. None of the callers can help solve this issue but the polymerase.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for your comments here @SkyWarrior.

    I think @asaki we must ask how do you know what is the truth in this case? Low complexity regions, especially those with the same consecutive base, are prone to polymerase error during PCR amplification as well as during sequencing by synthesis (SBS). It may be that your reads encompass true variation but also not.

    Remember that aligners handle each read independently and so calling directly on a pileup can result in multiple different variant representations for what could be a single variant even in a simple scenario. I go over this concept in the indel realignment tutorial. That is why we then developed processes such as indel realignment and then HaplotypeCaller's pairHMM scoring.

    I'm not sure if this is helpful but one thing you could do is provide HaplotypeCaller a VCF file containing the allele representation you are looking for. HaplotypeCaller's --genotyping_mode GENOTYPE_GIVEN_ALLELES mode takes the file given by the --alleles option and calls variants based on those in the alleles file.

  • asakiasaki chinaMember

    @SkyWarrior , thanks for your suggestion. Indeed, changing polymerase can largely fix the issue, although this is just a comercial kit which I can not change.

    @shlee , I agree with you. Other than the PCR or sequencing error, the most issue here is the aligment. It seem that the bwa tend to insert gaps that are left normalized.

    And this is a 1000G cell line sample bought from Coriell. This is a truth variant in CYP2D6, which has already been validated by Sanger and is genotyped as 1/1 in origianl VCF file from 1000G.

    Both GATK3 or GATK4 can not solve this issue, and local INDEL realignment did not help too.

    The only solution for me is either using --max_deletion_fraction 0.8, or just directly determine the genotype by clustering.

Sign In or Register to comment.