We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Is there any description or a marker of Indels in VCF file?

HideoHideo Member
edited January 2013 in Ask the GATK team

I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

d02 264482 . TT T 389.93 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.224;DP=25;FS=0.000;HaplotypeScore=58.0966;MLEAC=2;MLEAF=1.00;MQ=47.51;MQ0=0;MQRankSum=1.714;QD=15.60;ReadPosRankSum=0.075;SB=-1.112e+02 GT:AD:DP:GQ:PL 1/1:15,9:20:9:431,9,0

Ld04 26597 . CCC C 2923.96 . AC=2;AF=1.00;AN=2;BaseQRankSum=-2.217;DP=98;FS=2.954;HaplotypeScore=145.2957;MLEAC=2;MLEAF=1.00;MQ=59.82;MQ0=0;MQRankSum=0.283;QD=29.84;ReadPosRankSum=1.150;SB=-1.306e+03 GT:AD:DP:GQ:PL 1/1:58,31:98:99:2966,195,0

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Ld04 26880 . T TTATT 4973 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.552;DP=99;FS=2.935;HaplotypeScore=116.0082;MLEAC=2;MLEAF=1.00;MQ=59.46;MQ0=0;MQRankSum=0.192;QD=50.23;ReadPosRankSum=1.834;SB=-1.950e+03 GT:AD:DP:GQ:PL 1/1:49,38:98:99:4973,247,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0

Ld04 19963 . A G 3571.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.895;DP=96;Dels=0.00;FS=0.000;HaplotypeScore=2.4678;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.769;QD=37.20;ReadPosRankSum=-0.316;SB=-1.523e+03 GT:AD:DP:GQ:PL 1/1:3,93:96:99:3571,205,0

I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.

PS I used this command line of V 2.0-38:

java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf

Thank you.


Post edited by Geraldine_VdAuwera on


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Hideo,
    The information you want is not part of the basic VCF specification (see here for details). I also recommend reading the FAQ article about how to interpret VCF files generated by GATK tools.

  • HideoHideo Member

    Hi Geraldine,

    It might not be the basic VCF specification. But other programs like Freebayes (BC), Mpileup (Heng) and Cortex (Zam), which I think are used in the 1000G, provide the information shown below. GATK may have different strategy than these programs. But indel calling is notoriously unreliable so it is good to compare the results. Most of users would prefer to have such information. I cannot imagine users who do not want such information. Following the VCF specification is important to avoid chaos. But omitting essential information from GATK vcf files while identifying indels is not good for analysis.



    Ld01 46881 . G GC 7.95959 . AB=0.037037;ABP=53.2759;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1M1I;DP=27;DPRA=0;EPP=5.18177;EPPR=13.5202;HWE=-0;LEN=1;MEANALT=2;MQM=60;MQMR=58.8;NS=1;NUMALT=1;ODDS=1.65844;PAIRED=1;PAIREDR=0.96;RO=25;RPP=5.18177;RPPR=57.2971;RUN=1;SAP=5.18177;SRP=13.5202;TYPE=ins;XAI=0.04;XAM=0.08;XAS=0.04;XRI=0.00402491;XRM=0.017469;XRS=0.013444;technology.C=1;BVAR

    Ld01 43823 . TCACA T 14.5579 . AB=0.0434783;ABP=44.6459;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1M4D;DP=23;DPRA=0;EPP=5.18177;EPPR=3.12459;HWE=-0;LEN=4;MEANALT=4;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=3.30227;PAIRED=1;PAIREDR=1;RO=19;RPP=5.18177;RPPR=44.2683;RUN=1;SAP=5.18177;SRP=3.12459;TYPE=del;XAI=0;XAM=0;XAS=0;XRI=0.00314039;XRM=0.0059474;XRS=0.00280702;technology.C=1;BVAR GT:GQ:DP:RO:QR:AO:QA:GL 0/1:13.1337:23:19:673:1:83:-21.645,-18.8129,-74.5345

    Ld01 46405 . T C 0.0788024 . AB=0.0666667;ABP=27.4756;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=15;DPRA=0;EPP=5.18177;EPPR=5.49198;HWE=-0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=4.00028;PAIRED=1;PAIREDR=1;RO=14;RPP=5.18177;RPPR=33.4109;RUN=1;SAP=5.18177;SRP=5.49198;TYPE=snp;XAI=0;XAM=0.0133333;XAS=0.0133333;XRI=0.0021645;XRM=0.0229588;XRS=0.0207943;technology.C=1 GT:GQ:DP:RO:QR:AO:QA:GL 0/1:17.4518:15:14:514:1:30:-3,-3.33936,-46.6271


    Ld01 70292 . CAC C 12.7 . INDEL;DP=24;VDB=0.0535;AF1=0.5;AC1=1;DP4=9,10,3,0;MQ=60;FQ=15.6;PV4=0.22,0.27,1,1 GT:PL:GQ 0/1:50,0,255:53

    Ld01 75099 . T TAT 19.5 . INDEL;DP=33;VDB=0.0490;AF1=0.5;AC1=1;DP4=13,6,3,1;MQ=60;FQ=22.5;PV4=1,0.26,1,0.4 GT:PL:GQ 0/1:57,0,255:60

    Ld01 56004 . G A 222 . DP=43;VDB=0.0535;AF1=1;AC1=2;DP4=1,0,20,21;MQ=57;FQ=-111;PV4=1,0.12,0.38,0.19 GT:PL:GQ 1/1:255,84,0:99


    Ld04 283741 var_94 AAC A . PASS PV=17;SVLEN=-2;SVTYPE=DEL GT:COV:GT_CONF 0/1:30,4:11.77

    Ld04 294638 var_457 A AACACACACACAC . PASS PV=32;SVLEN=12;SVTYPE=INS GT:COV:GT_CONF 1/1:1,22:4.04

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Hideo,

    Ultimately, adding an 'indel' tag to the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it. If you are looking for some way to pull out just the indel records from a file then you can simply use Select Variants (the recommended method) or a trivial 'awk' command. Given that you are the first user ever to request it (and we've been making indel calls for several years), I can't imagine that this tag is "essential." Although we may well decide to add it in the future, we'd be more than happy to accept a patch if you (or another user) wanted to implement it sooner!

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Incidentally, I just noticed that there's an IndelType annotation which you can use if you want. So you'd add '-A IndelType' to your Unified Genotyper command-line to get the indels annotated (but not SNPs).

  • HideoHideo Member

    Thank you.
    "the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it"
    Often we use script so it is not so obvious when we just want to grab them (grep/awk). Currently I just check the length of ref and alt bases to determine variant type.

  • grosscogrossco Member

    @ebanks said:
    Ultimately, adding an 'indel' tag to the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it. If you are looking for some way to pull out just the indel records from a file then you can simply use Select Variants (the recommended method) or a trivial 'awk' command.

    It does seem like a peculiar tag to add. At least the Freebayes example specifies ins or del.

    I usually have my VCF files split into separate files. One for INDELs and one for SNPs since they get used downstream in different ways. Python scripts that control most of my execution take care of this, but you could use the GATK tool and subsequently do a count of non-comment lines in the resulting files to automatically determine how many indels you had.

    GATK method to select only indels from a VCF (as per documentation):

     java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T SelectVariants \
       --variant input.vcf \
       -o output.vcf \
       -selectType INDEL
  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi everyone,

    I just caved in and implemented a new annotation for the Variant Annotator called VariantType. It should do exactly as you want with some minor differences: insertions and deletions are treated separately and have some extra meta information associated with them; and records are annotated as being multi-allelic if that's the case.

    The only catch here is that our 2.3 release has already been packaged up and is being rolled out today. So this new annotation will not be available until the 2.4 release.

  • HideoHideo Member

    Thank you very much. I thought this would never happen .., but flexibility shows your strength.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Glad to help. One more note: it is not enabled by default at this point (we want to see how useful it is first). You'll need to add "-A VariantType" to your Unified Genotyper / Haplotype Caller / Variant Annotator command-line in order to include it.

  • KurtKurt Member ✭✭✭

    Hi, I decided to give this a shot since some of the people in my group asked me about it. Anyways, just FYI. I get.

    ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Class VariantType is not found; please check that you have specified the class name correctly

    cmd line:

    java -jar $GATK_DIR/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R $REF_GENOME \ --input_file $BAM_list \ --dbsnp /isilon/sequencing/GATK_resource_bundle/2.2/b37/dbsnp_137.b37.vcf \ -L $BAIT_BED \ -stand_emit_conf 20.0 \ -stand_call_conf 20.0 \ -glm BOTH \ -A QualByDepth \ -A HaplotypeScore \ -A MappingQualityRankSumTest \ -A ReadPosRankSumTest \ -A FisherStrand \ -A GCContent \ -A AlleleBalanceBySample \ -A VariantType \ --baq CALCULATE_AS_NECESSARY \ -o $CORE_PATH/$OUTPATH/UNIFIED_GENOTYPER/$OUTFILE".raw.UG.vcf" \ -nt $THREADS

    I am using "GenomeAnalysisTK-2.3-9-ge5ebf34"

    I went back and used the --list argument in VariantAnnotator and didn't see this in the list.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Kurt,

    Actually we haven't released this new annotation yet. It's in the pipe for 2.4, which should come out in a week or so.

  • KurtKurt Member ✭✭✭

    oh, yeah, i just saw that was already stated. sorry about the spam geraldine

  • gupta567varungupta567varun NYMember

    I used GATK version Version=2.6-5-gba531bd and generated my vcf files. When I used UnifiedGenotyper , I did not specify -A VariantType. In this version I am not getting clear distinction of indels and snps although I can separate them into 2 different files using awk or above method cited. For newer version also , should I have to mention -A VariantType??


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Hi Varun,

    If you want explicit annotations of SNP/indel, you should use -A VariantType.

    There is also no need to use an awk script to separate indels and SNPs, you can simply use SelectVariants with the variant type selector. Please refer here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html


Sign In or Register to comment.