Download the latest Picard release at
GATK version 4.beta.5 is out. See the GATK4 beta page for download and details.

rare homozygous indel bug in GATK 2.4-9 ?

vplagnolvplagnol Member
edited April 2013 in Ask the GATK team

Dear all,

I am seeing some weird (and rare) homozygous indel calls in a set of exome samples which are pretty reliably false calls.
I am using GATK 2.4-9, and calling approximately 500 exomes BAM jointly (most of them reduced).

In one sample I see this:
chrom 6
POS 137522059
ID .
QUAL 1892.58
SampleX 1/1:0,40:40:43:101,43,0

So there should be 40 reads supporting the alternate allele. However, the pileup is perfectly clean, no sign of the indel whatsoever (see pileup at the end of this email). I know that this indel is present in heterozygous form in another sample of the same batch, but quite likely not in homozygous form in any sample.

Do you guys have any idea re what may be causing the issue? It is happening in other samples at other positions, it's not a one time thing.
Any suggestions, including some ideas of data I could provide to debug, is welcome.

Thank you in advance and many thanks for the fantastic work with GATK.


(so pileup below, looking clean for that sample)

mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
6 137522050 T 33 .....,,,....,....,.,.,,..,,,,.... ??????@?@>;>@????
6 137522051 C 35 .....,,,....,....,.,.,,..,,,,....^g.^g, ??????????????=????????????????????
6 137522052 A 36 .....,,,....,....,.,.,,..,,,,.....,^g. ?????>?>@?@????????@??@??@??@>;
6 137522054 T 37 .....,,,....,....,.,.,,..,,,,.....,.. ??????????@?@??????==
6 137522055 G 37 .....,,,....,....,.,.,,..,,,,.....,.. ??>?????????????????????????????????>
6 137522056 G 38 .....,,,....,....,.,.,,..,,,,.....,..^g. ????????????????????????????????????>>
6 137522058 T 40 .....,,,....,....,.,.,,..,,,,.....,...,. ??????@?@>;>@??@?????@<;<>?>
6 137522059 T 40 .....,,,....,....,.,.,,..,,,,.....,...,. ??????@???????>?@?>???@>;>@??@?????@<;<??>
6 137522062 T 42 .....,,,....,....,.,.,,..,,,,.....,...,.,^g, @?>???@???????=?@???????>@??@?????@<;<????? 6 137522063 T 42 .....,,,....,....,.,.,,..,,,,.....,...,.,, @?????@???@???=?????>@??>@??@?????@<;<????? 6 137522064 T 42 .....,,,....,....,.,.,,..,,,,.....,...,.,, ??>???????????>?@??@>;?@?????@=<;????@?????@?????@==????@ 6 137522066 C 42 ....,,,....,....,.,.,,..,,,,.....,...,.,,^g, ?????????????????????????????????????????>
6 137522067 T 42 ....,,,....,....,.,.,,..,,,,.....,...,.,,, ?????????????>@????@?@??@??@?????@==????@?
6 137522068 T 42 ....,,,....,....,.,.,,..,,,,.....,...,.,,, ?????????????=@????@?@??@??@?????@==????@?
6 137522069 A 42 .$...,,,....,....,.,.,,..,,,,.....,...,.,,, >???=?=@???@????@?????@??

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Vincent,

    Can you please try running on this region with the latest nightly build? We hve made some improvments to the tools so I would like to check if this issue still occurs with our internal development version. See Downloads page for the nightlies.

  • vplagnolvplagnol Member
    edited April 2013

    Thanks Geraldine,

    I tried with the latest nightly build GenomeAnalysisTK-nightly-2013-04-20-gf2cba7e and I see the same thing.
    In fact I reduced the target region of interest to only call that single gene and I still see that homozygous call, so at least it is reproducible.

    Now this problematic BAM file that I used is reduced. If I use the unreduced BAM file and exactly the same code then I get:
    SampleX 0/0:40,0:40:99:0,120,2756

    So it really looks like a bug, probably linked to the use of the reduced BAM format but I cannot be sure.

    Post edited by vplagnol on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ok, we'll take a closer look at this. Could you please upload a file snippet that reproduces the issue to our FTP? Instructions are in the FAQ under "how can I submit a bug report?"

  • So the bug report has been uploaded. Just let me know if anything is not clear and I will add what is needed.

    If I am not mistaken and there is indeed a bug, I guess this issue is with the indel realignment, perhaps in combination with the reduced read format. And because in the reduced format all reads are compressed into a single one, all 40 reads become mistakenly assigned to the haplotype that contains the indel, which results in a homoyzgous non reference indel call.

    Let me know

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Vincent,

    Finally got time to look at your files -- sorry it's taken so long. I was able to reproduce your issue -- thanks for the very clear bug report and files. Not sure at this point what's causing this, but interestingly the issue doesn't occur if you use HaplotypeCaller with the equivalent command line. You may want to use HC (which is much better at calling indels anyway) to move ahead while we figure this one out. I'll let you know when we have more.

  • OK thanks for that, let me know if/when you can figure it out.

    I was actually quite impressed by the indel calls with the UnifiedGenotyper, I thought they were excellent. I am trying the HaplotypeCaller right now but even after tuning some of the parameters (like minPruning) it is too slow for the scale of what I am trying to do (ie call ~ 1,000 BAMs or so jointly). A single gene takes ~ 30 minutes, it's just too slow for me even with many parallel jobs. In any case, it's a topic for another thread, but I am somewhat stuck if this UnifiedGenotyper issue does not get fixed.

    Thank you again

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, fair point -- on that scale the HC does have severe performance issues. We'll try to get this resolved quickly.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Update: we've identified the cause of the problem and the developer is working on a fix.

  • You guys are incredibly helpful, thanks. I'd be curious to know the source of the problem at some point if that can be explained easily.
    Note that I am using GATK to reanalyze relatively large scale datasets that have been looked at carefully by several clinicians (using samtools and single sample calling prior to this) so I may have other issues like that to report, but hopefully I won't.

  • Thanks, I just tried GenomeAnalysisTK-nightly-2013-04-30-g950c10d and indeed it fixes the issue for the test set (2 BAM files, just the problematic region). I'll now go ahead with the larger production set but I am confident it will all work fine. I'll let you guys know otherwise.

    Many thanks!

Sign In or Register to comment.