Our documentation websites are back online as of 6/25, following a power outage that began on 6/23. If you are still experiencing issues accessing documentation, please let us know.

Confused by overlapping indels

VanillaVanilla Member
edited October 2015 in Ask the GATK team

Hi all, I'm currently confused about the snips called as shown below. If I am not mistaken, the first row shows gatk called an 34 bp insertion in sample 001 at position 3229753. It didn't call anything for sample 001 on position 3229753, but then for position 3229756, it calls another 15bp insertion for sample 001, which overlaps completely with the first insertion.

I have three questions about this.
1) Is my interpretation of the data shown below correct
2) If this is correct, is this expected behaviour for gatk? What kind of circumstances are expected to generate these results?
3) How can I interpret these conflicting snips, should I just pick the call with the highest confidence and ignore the other? What about if a lower-confidence call is a substring of a previous call in another sample?

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001 002 003 004 gi|ref| 3229753 0 A AACTTGCCTGCCACGCTTTTCTTTATACTTAACCC 9635.2 0 AC=3;AF=1.00;AN=3;DP=304;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=59.86;QD=29.65;SOR=0.779 GT:AD:DP:GQ:PL 1:0,48:48:99:2153,0 1:0,84:84:99:3696,0 .:0,0 1:0,85:85:99:3813,0 gi|ref| 3229754 0 A ACTTGCCTGCCACGCTTTTCTTTATACTTAACCCAGGCGCTAATTCATCTGCAACG 3012.2 0 AC=1;AF=1.00;AN=1;DP=291;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.91;QD=28.35;SOR=0.910 GT:AD:DP:GQ:PL .:0,0 .:0,0 1:0,69:69:99:3039,0 .:0,0 gi|ref| 3229756 0 G GCGCTAATTCATCTGC 3654.2 0 AC=3;AF=1.00;AN=3;DP=74;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;QD=28.36;SOR=0.747 GT:AD:DP:GQ:PL 1:0,17:17:99:854,0 1:0,25:25:99:1213,0 .:0,0 1:0,32:32:99:1614,0

Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated
Closed By

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Can you tell us a bit more about how these were produced in a stepwise manner? What commands were run, etc.?

  • I don't think I can edit my original question, so I'll just add more the information here. Sorry for not including it in my original question. I've left in most of the $variables from the scripts I used, please let me know if that is a problem.

    I have pair-end reads from 100 salmonella samples. I've mapped them to the reference with bwa mem, and marked duplicates with picard-tools MarkDuplicates. Then I performed realignment around indels for each sample individually.
    gatk -T IndelRealigner\ -R $reference\ -I $bamfile\ -targetIntervals $bamfile.list \ -o $output\ --consensusDeterminationModel USE_SW
    Next I perform the genotyping step for each individual sample with HaplotypeCaller
    gatk -T HaplotypeCaller \ --sample_ploidy 1 \ -R $reference \ -I $bamfile \ -o $output \ -ERC GVCF \ -nct 7
    Finally I called the snips using GenotypeGVCFs
    gatk -T GenotypeGVCFs\ -R $reference \ -o $output \ -nt 7 \ --max_alternate_alleles 2 \ -V 001.g.vcf -V 002.g.vcf -V 003.g.vcf -V 004.g.vcf

  • Just to confirm, I'm using gatk 3.4-0-g7e26428 for all steps of the workflow I've posted above.

  • @Sheila
    Thanks for your response, good to know this is the intended behaviour of gatk. For now, I've decided to only use snips and not indels in my downstream analysis, since I'm not quite sure how to work with nested snips yet.

Sign In or Register to comment.