Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Indel-Like Representation of SNP in VCF from HaplotypeCaller Causes Analysis Problems Downstream

gtiaogtiao Cambridge, MAMember

Hi, GATK Team

I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel:

6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214

This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position:

6   16327909    .   ATGCTGATGCTGC   ATGC,CTGCTGATGCTGC,A    1390.70 .   AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358   GT:AD:DP:GQ:PL

However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way?



Best Answer


  • gtiaogtiao Cambridge, MAMember

    Thanks, Geraldine, that seemed to do the trick!


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Glad to hear it. I've put in a request to the devs to have SelectVariants automatically trim the variants so that the extra processing step won't be necessary in future.

  • brdidobrdido São Paulo - BrazilMember
    edited October 2014

    Hello @Geraldine_VdAuwera !
    I'm having something like this problem.

    We are updating our pipeline with GATK 2014.3-3.2.2-7-gf9cba99.

    Now we are using the new GVCF file format and multiple samples using GenotypeGVCFs.

    All seems great but i found some differences :

    In GATK 2.8:

    20 44520237 CCTG C PASS

    In GATK 3.2 :

    20 44520237 CCTGCTG CCTG PASS

    I have tried VariantsToAllelicPrimitives and LeftAlignAndTrimVariants --trimAlleles and it doens't solve the issue above.

    This is a tricky reference site (CTG repeat) but the thing is that the minimal form doesn't seems to be respected.

    In the end, the sample lost a CTG.

    Could you please help us to elucidate this?

    [edit] The image is from activeRegions output.

    Thanks in advance,


  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Hi Rodrigo,

    Please let us know exactly what commands you ran and what the results were at each step. What do the calls look like per-sample and then per cohort? Also, at what point did you try the two other tools?


  • brdidobrdido São Paulo - BrazilMember
    edited November 2014

    Hi Sheila,

    i just wrote a bunch of details and i was mistaken: LeftAlignAndTrimVariants with --trimAlleles did the job.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    FYI, going forward (next nightly build and eventually release 3.4) SelectVariants will trim alleles by default.

Sign In or Register to comment.