Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller: Alternate allele get called or not depending on -ip option

slegrasslegras Member
edited May 2018 in Ask the GATK team

Hi, I'm currently analyzing some data (exome-seq) using HaplotypeCaller and get what seems to me an odd behaviour:
The problem is that I've got a position which is clearly bi-allelic in IGV and that is said to have only the reference allele in the gVCF I'm generating with HaplotypeCaller.

Here is the command line I used:

nohup java -jar PATH/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R PATH/ucsc.hg19_noHaps.fasta \
-I PATH/JLCL254.realigned.recalibrated.bam \
-L PATH/merged.bed \
-ip 50 \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-o JLCL254.vcf 

Here is the gVCF line of the variant of interest:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT JLCL254

chr1 912049 . T NON_REF . . END=912049 GT:DP:GQ:MIN_DP:PL 0/0:68:0:68:0,0,0

The variant of interest is located 19bp away from the captured region but with "-ip 50" it should be detected.

To check what is really analyzed, I output the bamout for all analyzed regions (-L PATH/merged.bed, -ip 50) and saw that the location of the variant is not analyzed (If I'm correct: as there is no coverage, this is not an active region).

Then I forced the bamout at the location +/-20nt around my variant to check whether some reads with the alternate allele are still kept. I used:
-L chr1:912029-912069 \
-forceActive \
-disableOptimizations

Doing so I've being able to see that many reads with the alternate allele are indeed still kept. The gVCF file generated along with the bamout file contains now my variant of interest with the alternate allele:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT JLCL254

chr1 912049 rs9803103 T C,NON_REF 1235.77 . BaseQRankSum=-0.677;ClippingRankSum=-0.942;DB;DP=61;MLEAC=1,0;MLEAF=0.500,0.00;MQ=54.88;MQRankSum=-0.600;ReadPosRankSum=-0.195 GT:AD:DP:GQ:PL:SB 0/1:19,42,0:61:99:1264,0,448,1321,574,1895:3,16,6,36

Please see an IGV screenshot:

Tracks are (from top to bottom):
* the original bam file
* the bamout for all captured regions (known from the file -L PATH/merged.bed)
* the forced bamout (at the location of the variant i.e -L chr1:912029-912069)
* merged.bed is the file used with the -L option.

Finally, I tried to call variants changing the -ip option to 100 and got the alternate allele called.

Please note that:
If I manually add/subtract 50bp to the closest target region boundaries, I've got the same result as with -ip 50.
If I manually add/subtract 100bp to the closest target region boundaries, I've got the same result as with -ip 100.

I tried several versions of GATK (3.46, 3.7, 4.0.4.0) and always got the same results.

I may have miss something but so far I can't explain myself what's happening. Do you see any explanation for what I observe? Do you see any options I should use to overcome this?
Many thanks in advance for your help.

NB: java -version
openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)

Post edited by slegras on

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited May 2018

    This problem is issued in github as a bug and still not fixed. As a workaround @shlee proposed to run exomes without bed files in HC and filter variants based on bed later on. I am trying to find out my samples to represent this problem but it seems they have gone with the wind and they are somewhere in my piles of archived sequences. Once I find them I will test this proposed solution.

  • slegrasslegras Member

    Thanks for your answer! I'll do like this then.

    Cheers

Sign In or Register to comment.