Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Is there any description or a marker of Indels in VCF file?

HideoHideo Posts: 10Member
edited January 2013 in Ask the GATK team

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

d02 264482 . TT T 389.93 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.224;DP=25;FS=0.000;HaplotypeScore=58.0966;MLEAC=2;MLEAF=1.00;MQ=47.51;MQ0=0;MQRankSum=1.714;QD=15.60;ReadPosRankSum=0.075;SB=-1.112e+02 GT:AD:DP:GQ:PL 1/1:15,9:20:9:431,9,0

Ld04 26597 . CCC C 2923.96 . AC=2;AF=1.00;AN=2;BaseQRankSum=-2.217;DP=98;FS=2.954;HaplotypeScore=145.2957;MLEAC=2;MLEAF=1.00;MQ=59.82;MQ0=0;MQRankSum=0.283;QD=29.84;ReadPosRankSum=1.150;SB=-1.306e+03 GT:AD:DP:GQ:PL 1/1:58,31:98:99:2966,195,0

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Ld04 26880 . T TTATT 4973 . AC=2;AF=1.00;AN=2;BaseQRankSum=-0.552;DP=99;FS=2.935;HaplotypeScore=116.0082;MLEAC=2;MLEAF=1.00;MQ=59.46;MQ0=0;MQRankSum=0.192;QD=50.23;ReadPosRankSum=1.834;SB=-1.950e+03 GT:AD:DP:GQ:PL 1/1:49,38:98:99:4973,247,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0

Ld04 19963 . A G 3571.01 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.895;DP=96;Dels=0.00;FS=0.000;HaplotypeScore=2.4678;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.769;QD=37.20;ReadPosRankSum=-0.316;SB=-1.523e+03 GT:AD:DP:GQ:PL 1/1:3,93:96:99:3571,205,0

I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.

PS I used this command line of V 2.0-38:

java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf

Thank you.

Hideo

Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,080Administrator, GATK Developer admin

    Hi Hideo, The information you want is not part of the basic VCF specification (see here for details). I also recommend reading the FAQ article about how to interpret VCF files generated by GATK tools.

    Geraldine Van der Auwera, PhD

  • HideoHideo Posts: 10Member

    Hi Geraldine,

    It might not be the basic VCF specification. But other programs like Freebayes (BC), Mpileup (Heng) and Cortex (Zam), which I think are used in the 1000G, provide the information shown below. GATK may have different strategy than these programs. But indel calling is notoriously unreliable so it is good to compare the results. Most of users would prefer to have such information. I cannot imagine users who do not want such information. Following the VCF specification is important to avoid chaos. But omitting essential information from GATK vcf files while identifying indels is not good for analysis.

    Examples:

    Freebayes:

    Ld01 46881 . G GC 7.95959 . AB=0.037037;ABP=53.2759;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1M1I;DP=27;DPRA=0;EPP=5.18177;EPPR=13.5202;HWE=-0;LEN=1;MEANALT=2;MQM=60;MQMR=58.8;NS=1;NUMALT=1;ODDS=1.65844;PAIRED=1;PAIREDR=0.96;RO=25;RPP=5.18177;RPPR=57.2971;RUN=1;SAP=5.18177;SRP=13.5202;TYPE=ins;XAI=0.04;XAM=0.08;XAS=0.04;XRI=0.00402491;XRM=0.017469;XRS=0.013444;technology.C=1;BVAR

    Ld01 43823 . TCACA T 14.5579 . AB=0.0434783;ABP=44.6459;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1M4D;DP=23;DPRA=0;EPP=5.18177;EPPR=3.12459;HWE=-0;LEN=4;MEANALT=4;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=3.30227;PAIRED=1;PAIREDR=1;RO=19;RPP=5.18177;RPPR=44.2683;RUN=1;SAP=5.18177;SRP=3.12459;TYPE=del;XAI=0;XAM=0;XAS=0;XRI=0.00314039;XRM=0.0059474;XRS=0.00280702;technology.C=1;BVAR GT:GQ:DP:RO:QR:AO:QA:GL 0/1:13.1337:23:19:673:1:83:-21.645,-18.8129,-74.5345

    Ld01 46405 . T C 0.0788024 . AB=0.0666667;ABP=27.4756;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=15;DPRA=0;EPP=5.18177;EPPR=5.49198;HWE=-0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=4.00028;PAIRED=1;PAIREDR=1;RO=14;RPP=5.18177;RPPR=33.4109;RUN=1;SAP=5.18177;SRP=5.49198;TYPE=snp;XAI=0;XAM=0.0133333;XAS=0.0133333;XRI=0.0021645;XRM=0.0229588;XRS=0.0207943;technology.C=1 GT:GQ:DP:RO:QR:AO:QA:GL 0/1:17.4518:15:14:514:1:30:-3,-3.33936,-46.6271

    Mpileup:

    Ld01 70292 . CAC C 12.7 . INDEL;DP=24;VDB=0.0535;AF1=0.5;AC1=1;DP4=9,10,3,0;MQ=60;FQ=15.6;PV4=0.22,0.27,1,1 GT:PL:GQ 0/1:50,0,255:53

    Ld01 75099 . T TAT 19.5 . INDEL;DP=33;VDB=0.0490;AF1=0.5;AC1=1;DP4=13,6,3,1;MQ=60;FQ=22.5;PV4=1,0.26,1,0.4 GT:PL:GQ 0/1:57,0,255:60

    Ld01 56004 . G A 222 . DP=43;VDB=0.0535;AF1=1;AC1=2;DP4=1,0,20,21;MQ=57;FQ=-111;PV4=1,0.12,0.38,0.19 GT:PL:GQ 1/1:255,84,0:99

    Cortex:

    Ld04 283741 var_94 AAC A . PASS PV=17;SVLEN=-2;SVTYPE=DEL GT:COV:GT_CONF 0/1:30,4:11.77

    Ld04 294638 var_457 A AACACACACACAC . PASS PV=32;SVLEN=12;SVTYPE=INS GT:COV:GT_CONF 1/1:1,22:4.04

  • ebanksebanks Posts: 679GATK Developer mod

    Hi Hideo,

    Ultimately, adding an 'indel' tag to the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it. If you are looking for some way to pull out just the indel records from a file then you can simply use Select Variants (the recommended method) or a trivial 'awk' command. Given that you are the first user ever to request it (and we've been making indel calls for several years), I can't imagine that this tag is "essential." Although we may well decide to add it in the future, we'd be more than happy to accept a patch if you (or another user) wanted to implement it sooner!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • ebanksebanks Posts: 679GATK Developer mod

    Incidentally, I just noticed that there's an IndelType annotation which you can use if you want. So you'd add '-A IndelType' to your Unified Genotyper command-line to get the indels annotated (but not SNPs).

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • HideoHideo Posts: 10Member

    Thank you.
    "the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it" Often we use script so it is not so obvious when we just want to grab them (grep/awk). Currently I just check the length of ref and alt bases to determine variant type.

  • grosscogrossco Posts: 4Member

    @ebanks said: Ultimately, adding an 'indel' tag to the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it. If you are looking for some way to pull out just the indel records from a file then you can simply use Select Variants (the recommended method) or a trivial 'awk' command.

    It does seem like a peculiar tag to add. At least the Freebayes example specifies ins or del.

    I usually have my VCF files split into separate files. One for INDELs and one for SNPs since they get used downstream in different ways. Python scripts that control most of my execution take care of this, but you could use the GATK tool and subsequently do a count of non-comment lines in the resulting files to automatically determine how many indels you had.

    GATK method to select only indels from a VCF (as per documentation):

     java -Xmx2g -jar GenomeAnalysisTK.jar \
       -R ref.fasta \
       -T SelectVariants \
       --variant input.vcf \
       -o output.vcf \
       -selectType INDEL
    
  • ebanksebanks Posts: 679GATK Developer mod

    Hi everyone,

    I just caved in and implemented a new annotation for the Variant Annotator called VariantType. It should do exactly as you want with some minor differences: insertions and deletions are treated separately and have some extra meta information associated with them; and records are annotated as being multi-allelic if that's the case.

    The only catch here is that our 2.3 release has already been packaged up and is being rolled out today. So this new annotation will not be available until the 2.4 release.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • HideoHideo Posts: 10Member

    Thank you very much. I thought this would never happen .., but flexibility shows your strength.

  • ebanksebanks Posts: 679GATK Developer mod

    Glad to help. One more note: it is not enabled by default at this point (we want to see how useful it is first). You'll need to add "-A VariantType" to your Unified Genotyper / Haplotype Caller / Variant Annotator command-line in order to include it.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • KurtKurt Posts: 148Member ✭✭✭

    Hi, I decided to give this a shot since some of the people in my group asked me about it. Anyways, just FYI. I get.

    ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Class VariantType is not found; please check that you have specified the class name correctly

    cmd line:

    java -jar $GATK_DIR/GenomeAnalysisTK.jar \
    -T UnifiedGenotyper \
    -R $REF_GENOME \
    --input_file $BAM_list \
    --dbsnp /isilon/sequencing/GATK_resource_bundle/2.2/b37/dbsnp_137.b37.vcf \
    -L $BAIT_BED \
    -stand_emit_conf 20.0 \
    -stand_call_conf 20.0 \
    -glm BOTH \
    -A QualByDepth \
    -A HaplotypeScore \
    -A MappingQualityRankSumTest \
    -A ReadPosRankSumTest \
    -A FisherStrand \
    -A GCContent \
    -A AlleleBalanceBySample \
    -A VariantType \
    --baq CALCULATE_AS_NECESSARY \
    -o $CORE_PATH/$OUTPATH/UNIFIED_GENOTYPER/$OUTFILE".raw.UG.vcf" \
    -nt $THREADS

    I am using "GenomeAnalysisTK-2.3-9-ge5ebf34"

    I went back and used the --list argument in VariantAnnotator and didn't see this in the list.

    Kurt

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,080Administrator, GATK Developer admin

    Hi Kurt,

    Actually we haven't released this new annotation yet. It's in the pipe for 2.4, which should come out in a week or so.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 148Member ✭✭✭

    oh, yeah, i just saw that was already stated. sorry about the spam geraldine

  • gupta567varungupta567varun NYPosts: 16Member

    HI I used GATK version Version=2.6-5-gba531bd and generated my vcf files. When I used UnifiedGenotyper , I did not specify -A VariantType. In this version I am not getting clear distinction of indels and snps although I can separate them into 2 different files using awk or above method cited. For newer version also , should I have to mention -A VariantType??

    Regards Varun

  • SheilaSheila Broad InstitutePosts: 378Member, GATK Developer, Broadie, Moderator admin

    @gupta567varun

    Hi Varun,

    If you want explicit annotations of SNP/indel, you should use -A VariantType.

    There is also no need to use an awk script to separate indels and SNPs, you can simply use SelectVariants with the variant type selector. Please refer here: http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_SelectVariants.html

    -Sheila

Sign In or Register to comment.