Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Confused by overlapping indels

VanillaVanilla Member
edited October 2015 in Ask the GATK team

Hi all, I'm currently confused about the snips called as shown below. If I am not mistaken, the first row shows gatk called an 34 bp insertion in sample 001 at position 3229753. It didn't call anything for sample 001 on position 3229753, but then for position 3229756, it calls another 15bp insertion for sample 001, which overlaps completely with the first insertion.

I have three questions about this.
1) Is my interpretation of the data shown below correct
2) If this is correct, is this expected behaviour for gatk? What kind of circumstances are expected to generate these results?
3) How can I interpret these conflicting snips, should I just pick the call with the highest confidence and ignore the other? What about if a lower-confidence call is a substring of a previous call in another sample?

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 001 002 003 004 gi|ref| 3229753 0 A AACTTGCCTGCCACGCTTTTCTTTATACTTAACCC 9635.2 0 AC=3;AF=1.00;AN=3;DP=304;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=59.86;QD=29.65;SOR=0.779 GT:AD:DP:GQ:PL 1:0,48:48:99:2153,0 1:0,84:84:99:3696,0 .:0,0 1:0,85:85:99:3813,0 gi|ref| 3229754 0 A ACTTGCCTGCCACGCTTTTCTTTATACTTAACCCAGGCGCTAATTCATCTGCAACG 3012.2 0 AC=1;AF=1.00;AN=1;DP=291;FS=0.000;MLEAC=1;MLEAF=1.00;MQ=59.91;QD=28.35;SOR=0.910 GT:AD:DP:GQ:PL .:0,0 .:0,0 1:0,69:69:99:3039,0 .:0,0 gi|ref| 3229756 0 G GCGCTAATTCATCTGC 3654.2 0 AC=3;AF=1.00;AN=3;DP=74;FS=0.000;MLEAC=3;MLEAF=1.00;MQ=60.00;QD=28.36;SOR=0.747 GT:AD:DP:GQ:PL 1:0,17:17:99:854,0 1:0,25:25:99:1213,0 .:0,0 1:0,32:32:99:1614,0

Issue · Github
by Geraldine_VdAuwera

Issue Number
240
State
closed
Last Updated
Closed By
chandrans

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Can you tell us a bit more about how these were produced in a stepwise manner? What commands were run, etc.?

  • VanillaVanilla Member

    I don't think I can edit my original question, so I'll just add more the information here. Sorry for not including it in my original question. I've left in most of the $variables from the scripts I used, please let me know if that is a problem.

    I have pair-end reads from 100 salmonella samples. I've mapped them to the reference with bwa mem, and marked duplicates with picard-tools MarkDuplicates. Then I performed realignment around indels for each sample individually.
    gatk -T IndelRealigner\ -R $reference\ -I $bamfile\ -targetIntervals $bamfile.list \ -o $output\ --consensusDeterminationModel USE_SW
    Next I perform the genotyping step for each individual sample with HaplotypeCaller
    gatk -T HaplotypeCaller \ --sample_ploidy 1 \ -R $reference \ -I $bamfile \ -o $output \ -ERC GVCF \ -nct 7
    Finally I called the snips using GenotypeGVCFs
    gatk -T GenotypeGVCFs\ -R $reference \ -o $output \ -nt 7 \ --max_alternate_alleles 2 \ -V 001.g.vcf -V 002.g.vcf -V 003.g.vcf -V 004.g.vcf

  • VanillaVanilla Member

    Just to confirm, I'm using gatk 3.4-0-g7e26428 for all steps of the workflow I've posted above.

  • VanillaVanilla Member

    @Sheila
    Thanks for your response, good to know this is the intended behaviour of gatk. For now, I've decided to only use snips and not indels in my downstream analysis, since I'm not quite sure how to work with nested snips yet.

Sign In or Register to comment.