Indel-Like Representation of SNP in VCF from HaplotypeCaller Causes Analysis Problems Downstream

gtiaogtiao Cambridge, MAMember

Hi, GATK Team

I've run into a strange case where a SNP called by HaplotypeCaller has been represented as if it were an indel:

6 16327909 . ATGCTGATGCTGC CTGCTGATGCTGC 1390.70 PASS AC=1;AC_Orig=2;AF=0.500;AF_Orig=0.040;AN=2;AN_Orig=50;BaseQRankSum=0.788;DP=10;FS=6.154;InbreedingCoeff=0.1807;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358;VQSLOD=2.78;culprit=FS GT:DP:GQ:PL 0/1:10:70:284,0,214

This VCF entry (for a single individual) comes from a multi-sample VCF that has multiple alternate "alleles" at that position:

6   16327909    .   ATGCTGATGCTGC   ATGC,CTGCTGATGCTGC,A    1390.70 .   AC=12,3,1;AF=0.024,6.048e-03,2.016e-03;AN=496;BaseQRankSum=0.788;DP=2791;FS=6.154;InbreedingCoeff=0.1807;MLEAC=13,3,1;MLEAF=0.026,6.048e-03,2.016e-03;MQ=59.86;MQ0=0;MQRankSum=0.406;QD=2.77;ReadPosRankSum=0.358   GT:AD:DP:GQ:PL

However, this mode of representing a SNP is causing processing and analysis problems further downstream after I've split the multi-sample VCF into individual files. Is there a way to fix this problem such that variants are listed in the most parsimonious (and hopefully standard) way?

Thanks,

Grace

Best Answer

Answers

  • gtiaogtiao Cambridge, MAMember

    Thanks, Geraldine, that seemed to do the trick!

    Grace

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Glad to hear it. I've put in a request to the devs to have SelectVariants automatically trim the variants so that the extra processing step won't be necessary in future.

  • brdidobrdido São Paulo - BrazilMember
    edited October 2014

    Hello @Geraldine_VdAuwera !
    I'm having something like this problem.

    We are updating our pipeline with GATK 2014.3-3.2.2-7-gf9cba99.

    Now we are using the new GVCF file format and multiple samples using GenotypeGVCFs.

    All seems great but i found some differences :

    In GATK 2.8:

    20 44520237 CCTG C PASS

    In GATK 3.2 :

    20 44520237 CCTGCTG CCTG PASS

    I have tried VariantsToAllelicPrimitives and LeftAlignAndTrimVariants --trimAlleles and it doens't solve the issue above.

    This is a tricky reference site (CTG repeat) but the thing is that the minimal form doesn't seems to be respected.

    In the end, the sample lost a CTG.

    Could you please help us to elucidate this?

    [edit] The image is from activeRegions output.

    Thanks in advance,

    Rodrigo.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @brdido‌

    Hi Rodrigo,

    Please let us know exactly what commands you ran and what the results were at each step. What do the calls look like per-sample and then per cohort? Also, at what point did you try the two other tools?

    -Sheila

  • brdidobrdido São Paulo - BrazilMember
    edited November 2014

    Hi Sheila,

    i just wrote a bunch of details and i was mistaken: LeftAlignAndTrimVariants with --trimAlleles did the job.

    Sorry.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    FYI, going forward (next nightly build and eventually release 3.4) SelectVariants will trim alleles by default.

Sign In or Register to comment.