The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# Is there any description or a marker of Indels in VCF file?

Posts: 10Member
edited January 2013

Hello, I'm maybe missing something obvious but it seems a GATK vcf file does not tell a given variant is SNP, Insertion or deletion. Did I missed some command when I called variations? I can easily classify variations by eye or a script from a given vcf entry but the entry does not explicitly say variation type.

Here are deletions:

Here are insertions:

Ld04 23671 . G GAAAT 6952 . AC=2;AF=1.00;AN=2;DP=100;FS=0.000;HaplotypeScore=24.8695;MLEAC=2;MLEAF=1.00;MQ=59.54;MQ0=0;QD=69.52;SB=-3.355e+03 GT:AD:DP:GQ:PL 1/1:65,34:100:99:6952,301,0

Here are SNPs

Ld04 19890 . T C 3276.01 . AC=2;AF=1.00;AN=2;DP=85;Dels=0.00;FS=0.000;HaplotypeScore=0.7887;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=38.54;SB=-1.614e+03 GT:AD:DP:GQ:PL 1/1:0,85:85:99:3276,253,0

I do not see any markers that tell me variation type. Nor the length of variations. Such information is given in a vcf of Samtools.

PS I used this command line of V 2.0-38:

java -jar ~/GenomeAnalysisTK-2.0-38/GenomeAnalysisTK.jar -T UnifiedGenotyper --genotype_likelihoods_model BOTH -R ref1 -I IN.bam -o OUT.GK2.vcf

Thank you.

Hideo

Post edited by Geraldine_VdAuwera on
Tagged:

Hi Hideo, The information you want is not part of the basic VCF specification (see here for details). I also recommend reading the FAQ article about how to interpret VCF files generated by GATK tools.

Geraldine Van der Auwera, PhD

• Posts: 10Member

Hi Geraldine,

It might not be the basic VCF specification. But other programs like Freebayes (BC), Mpileup (Heng) and Cortex (Zam), which I think are used in the 1000G, provide the information shown below. GATK may have different strategy than these programs. But indel calling is notoriously unreliable so it is good to compare the results. Most of users would prefer to have such information. I cannot imagine users who do not want such information. Following the VCF specification is important to avoid chaos. But omitting essential information from GATK vcf files while identifying indels is not good for analysis.

Examples:

Freebayes:

Ld01 46881 . G GC 7.95959 . AB=0.037037;ABP=53.2759;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1M1I;DP=27;DPRA=0;EPP=5.18177;EPPR=13.5202;HWE=-0;LEN=1;MEANALT=2;MQM=60;MQMR=58.8;NS=1;NUMALT=1;ODDS=1.65844;PAIRED=1;PAIREDR=0.96;RO=25;RPP=5.18177;RPPR=57.2971;RUN=1;SAP=5.18177;SRP=13.5202;TYPE=ins;XAI=0.04;XAM=0.08;XAS=0.04;XRI=0.00402491;XRM=0.017469;XRS=0.013444;technology.C=1;BVAR

Ld01 43823 . TCACA T 14.5579 . AB=0.0434783;ABP=44.6459;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1M4D;DP=23;DPRA=0;EPP=5.18177;EPPR=3.12459;HWE=-0;LEN=4;MEANALT=4;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=3.30227;PAIRED=1;PAIREDR=1;RO=19;RPP=5.18177;RPPR=44.2683;RUN=1;SAP=5.18177;SRP=3.12459;TYPE=del;XAI=0;XAM=0;XAS=0;XRI=0.00314039;XRM=0.0059474;XRS=0.00280702;technology.C=1;BVAR GT:GQ:DP:RO:QR:AO:QA:GL 0/1:13.1337:23:19:673:1:83:-21.645,-18.8129,-74.5345

Ld01 46405 . T C 0.0788024 . AB=0.0666667;ABP=27.4756;AC=1;AF=0.5;AN=2;AO=1;CIGAR=1X;DP=15;DPRA=0;EPP=5.18177;EPPR=5.49198;HWE=-0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=4.00028;PAIRED=1;PAIREDR=1;RO=14;RPP=5.18177;RPPR=33.4109;RUN=1;SAP=5.18177;SRP=5.49198;TYPE=snp;XAI=0;XAM=0.0133333;XAS=0.0133333;XRI=0.0021645;XRM=0.0229588;XRS=0.0207943;technology.C=1 GT:GQ:DP:RO:QR:AO:QA:GL 0/1:17.4518:15:14:514:1:30:-3,-3.33936,-46.6271

Mpileup:

Ld01 70292 . CAC C 12.7 . INDEL;DP=24;VDB=0.0535;AF1=0.5;AC1=1;DP4=9,10,3,0;MQ=60;FQ=15.6;PV4=0.22,0.27,1,1 GT:PL:GQ 0/1:50,0,255:53

Ld01 75099 . T TAT 19.5 . INDEL;DP=33;VDB=0.0490;AF1=0.5;AC1=1;DP4=13,6,3,1;MQ=60;FQ=22.5;PV4=1,0.26,1,0.4 GT:PL:GQ 0/1:57,0,255:60

Ld01 56004 . G A 222 . DP=43;VDB=0.0535;AF1=1;AC1=2;DP4=1,0,20,21;MQ=57;FQ=-111;PV4=1,0.12,0.38,0.19 GT:PL:GQ 1/1:255,84,0:99

Cortex:

Ld04 283741 var_94 AAC A . PASS PV=17;SVLEN=-2;SVTYPE=DEL GT:COV:GT_CONF 0/1:30,4:11.77

Ld04 294638 var_457 A AACACACACACAC . PASS PV=32;SVLEN=12;SVTYPE=INS GT:COV:GT_CONF 1/1:1,22:4.04

• Posts: 683GATK Developer mod

Hi Hideo,

Ultimately, adding an 'indel' tag to the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it. If you are looking for some way to pull out just the indel records from a file then you can simply use Select Variants (the recommended method) or a trivial 'awk' command. Given that you are the first user ever to request it (and we've been making indel calls for several years), I can't imagine that this tag is "essential." Although we may well decide to add it in the future, we'd be more than happy to accept a patch if you (or another user) wanted to implement it sooner!

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 683GATK Developer mod

Incidentally, I just noticed that there's an IndelType annotation which you can use if you want. So you'd add '-A IndelType' to your Unified Genotyper command-line to get the indels annotated (but not SNPs).

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 10Member

Thank you.
"the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it" Often we use script so it is not so obvious when we just want to grab them (grep/awk). Currently I just check the length of ref and alt bases to determine variant type.

• Posts: 4Member

@ebanks said: Ultimately, adding an 'indel' tag to the VCF record is completely superfluous as it's utterly obvious that a record represents an indel just by looking at it. If you are looking for some way to pull out just the indel records from a file then you can simply use Select Variants (the recommended method) or a trivial 'awk' command.

It does seem like a peculiar tag to add. At least the Freebayes example specifies ins or del.

I usually have my VCF files split into separate files. One for INDELs and one for SNPs since they get used downstream in different ways. Python scripts that control most of my execution take care of this, but you could use the GATK tool and subsequently do a count of non-comment lines in the resulting files to automatically determine how many indels you had.

GATK method to select only indels from a VCF (as per documentation):

 java -Xmx2g -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T SelectVariants \
--variant input.vcf \
-o output.vcf \
-selectType INDEL

• Posts: 683GATK Developer mod

Hi everyone,

I just caved in and implemented a new annotation for the Variant Annotator called VariantType. It should do exactly as you want with some minor differences: insertions and deletions are treated separately and have some extra meta information associated with them; and records are annotated as being multi-allelic if that's the case.

The only catch here is that our 2.3 release has already been packaged up and is being rolled out today. So this new annotation will not be available until the 2.4 release.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 10Member

Thank you very much. I thought this would never happen .., but flexibility shows your strength.

• Posts: 683GATK Developer mod

Glad to help. One more note: it is not enabled by default at this point (we want to see how useful it is first). You'll need to add "-A VariantType" to your Unified Genotyper / Haplotype Caller / Variant Annotator command-line in order to include it.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 152Member ✭✭✭

Hi, I decided to give this a shot since some of the people in my group asked me about it. Anyways, just FYI. I get.

ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Class VariantType is not found; please check that you have specified the class name correctly

cmd line:

java -jar $GATK_DIR/GenomeAnalysisTK.jar \ -T UnifiedGenotyper \ -R$REF_GENOME \
--input_file $BAM_list \ --dbsnp /isilon/sequencing/GATK_resource_bundle/2.2/b37/dbsnp_137.b37.vcf \ -L$BAIT_BED \
-stand_emit_conf 20.0 \
-stand_call_conf 20.0 \
-glm BOTH \
-A QualByDepth \
-A HaplotypeScore \
-A MappingQualityRankSumTest \
-A FisherStrand \
-A GCContent \
-A AlleleBalanceBySample \
-A VariantType \
--baq CALCULATE_AS_NECESSARY \
-o $CORE_PATH/$OUTPATH/UNIFIED_GENOTYPER/$OUTFILE".raw.UG.vcf" \ -nt$THREADS

I am using "GenomeAnalysisTK-2.3-9-ge5ebf34"

I went back and used the --list argument in VariantAnnotator and didn't see this in the list.

Kurt

Hi Kurt,

Actually we haven't released this new annotation yet. It's in the pipe for 2.4, which should come out in a week or so.

Geraldine Van der Auwera, PhD

• Posts: 152Member ✭✭✭

oh, yeah, i just saw that was already stated. sorry about the spam geraldine

• NYPosts: 16Member

HI I used GATK version Version=2.6-5-gba531bd and generated my vcf files. When I used UnifiedGenotyper , I did not specify -A VariantType. In this version I am not getting clear distinction of indels and snps although I can separate them into 2 different files using awk or above method cited. For newer version also , should I have to mention -A VariantType??

Regards Varun