We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Using CAVA-Annotated VCF file for VariantsToTable

Hi all,

I have generated a VCF file via the 5 dollar genome analysis pipeline. Then I used a script called CAVA (https://tinyurl.com/y6bjhskc) to annotate the variants in my VCF file. CAVA added some 18 extra ##INFO lines to the beginning of the VCF file. And added extra info into the INFO column for each variant, as shown below. I wanted to try and use the VariantsToTable function in the GATK toolbox. I tested with GATK v3.3. I am getting the error of:

ERROR MESSAGE: Your input file has a malformed header: unexpected tag count 6 in line <ID=TYPE,Number=.,Type=String,Description="Variant type: Substitution, Insertion, Deletion or Complex",Source="CAVA",Version="1.2.2">

VCF file:

##fileformat=VCFv4.2
##fileDate=2019-03-04
##INFO=<ID=TYPE,Number=.,Type=String,Description="Variant type: Substitution, Insertion, Deletion or Complex",Source="CAVA",Version="1.2.2">
##INFO=<ID=GENE,Number=.,Type=String,Description="HGNC gene symbol",Source="CAVA",Version="1.2.2">
##INFO=<ID=TRANSCRIPT,Number=.,Type=String,Description="Transcript identifier",Source="CAVA",Version="1.2.2">
##INFO=<ID=GENEID,Number=.,Type=String,Description="Gene identifier",Source="CAVA",Version="1.2.2">
##INFO=<ID=TRINFO,Number=.,Type=String,Description="Transcript information: Strand/Length of transcript/Number of exons/Length of coding DNA + UTR/Protein length",Source="CAVA",Version="1.2.2">
##INFO=<ID=LOC,Number=.,Type=String,Description="Location of variant in transcript",Source="CAVA",Version="1.2.2">
##INFO=<ID=CSN,Number=.,Type=String,Description="CSN annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=PROTPOS,Number=.,Type=String,Description="Protein position",Source="CAVA",Version="1.2.2">
##INFO=<ID=PROTREF,Number=.,Type=String,Description="Reference amino acids",Source="CAVA",Version="1.2.2">
##INFO=<ID=PROTALT,Number=.,Type=String,Description="Alternate amino acids",Source="CAVA",Version="1.2.2">
##INFO=<ID=CLASS,Number=.,Type=String,Description="5PU: Variant in 5 prime untranslated region, 3PU: Variant in 3 prime untranslated region, INT: Intronic variant that does not alter splice site bases, SS: Intronic variant that alters a splice site base but not an ESS or SS5 base, ESS: Variant that alters essential splice site base (+1,+2,-1,-2), SS5: Variant that alters the +5 splice site base, but not an ESS base, SY: Synonymous change caused by a base substitution (i.e. does not alter amino acid), NSY: Nonsynonymous change (missense) caused by a base substitution (i.e. alters amino acid), IF: Inframe insertion and/or deletion (variant alters the length of coding sequence but not the frame), IM: Variant that alters the start codon, SG: Variant resulting in stop-gain (nonsense) mutation, SL: Variant resulting in stop-loss mutation, FS: Frameshifting insertion and/or deletion (variant alters the length and frame of coding sequence), EE: Inframe deletion, insertion or base substitution which affects the first or last three bases of the exon",Source="CAVA",Version="1.2.2">
##INFO=<ID=SO,Number=.,Type=String,Description="Sequence Ontology term",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTFLAG,Number=.,Type=String,Description="None: variant has the same CSN annotation regardless of its left or right-alignment, AnnNotClass/AnnNotSO/AnnNotClassNotSO: indel has an alternative CSN but the same CLASS and/or SO, AnnAndClass/AnnAndSO/AnnAndClassNotSO/AnnAndSONotClass/AnnAndClassAndSO: Multiple CSN with different CLASS and/or SO annotations",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTANN,Number=.,Type=String,Description="Alternate CSN annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTCLASS,Number=.,Type=String,Description="Alternate CLASS annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=ALTSO,Number=.,Type=String,Description="Alternate SO annotation",Source="CAVA",Version="1.2.2">
##INFO=<ID=IMPACT,Number=.,Type=String,Description="Impact group the variant is stratified into",Source="CAVA",Version="1.2.2">
##INFO=<ID=DBSNP,Number=.,Type=String,Description="rsID from dbSNP",Source="CAVA",Version="1.2.2">

.
.
.

CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    21840336    .   T   C   40.74   PASS    AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=20.37;SOR=2.303;TYPE=Substitution;TRANSCRIPT=ENST00000374840;GENE=ALPL;GENEID=ENSG00000162551;TRINFO=+/69.0kb/12/2.6kb/524;LOC=In1/2;CSN=c.-105+4326T>C;PROTPOS=.;PROTREF=.;PROTALT=.;CLASS=5PU;SO=5_prime_UTR_variant;IMPACT=3;ALTANN=.;ALTCLASS=.;ALTSO=. GT:AD:DP:GQ:PL  1/1:0,2:2:6:68,6,0

Is there a simple solution you guys can see to the problem? I could remove all the extra lines manually for now but in the future for automation purposes I wanted to skip having a parser or parser-like script in between.

And secondly, I also wanted to ask whether it would be possible to retrieve the extra info added by CAVA into the INFO column via the VariantsToTable? Such as "GENE=" or "GENEID=" that does not exist in the original VCF file.

Thanks for your time and help.

Answers

Sign In or Register to comment.