Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to handle the insertions in VCF File?

Hi All,
I am trying to annotate a VCF file I got from dbSNP. It includes information about INDELs present in the human genome. But when I try to annotate the file I keep getting the error:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 20: The reference allele cannot be missing

My assumption is;(correct me if I am wrong) as there are insertions included in the file while have an empty or null ref entry.
the initial lines in my file are:

fileformat=VCFv4.0

source=dbSNP (ftp://ftp.ncbi.nlm.nih.gov/snp/specs/00..VCF_README.txt)

variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf

INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP">

INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed">

INFO=<ID=VP,Number=1,Type=String,Description="Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf">

INFO=<ID=GENEINFO,Number=1,Type=String,Description="Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">

INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">

INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 . unspecified, 1 . Germline, 2 . Somatic, 3 . Both">

INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">

INFO=<ID=VLD,Number=0,Type=Flag,Description="Is Validated. This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.">

CHROM POS ID REF ALT QUAL FILTER INFO

1 1398673 rs3831366 TAGAG . . . RSPOS=1398673;GENEINFO=148413:LOC148413|81669:CCNL2;dbSNPBuildID=107;SAO=0;GMAF=0.0740815;VC=in.del;VLD;VP=0501282200051501003E0200
1 19270781 rs140586925 C . . . RSPOS=19270781;GENEINFO=246181:AKR7L;dbSNPBuildID=134;SAO=0;GMAF=0.0309505;VC=in.del;VLD;VP=0500000000051500003E0200
1 27358754 rs34008139 G . . . RSPOS=27358754;GENEINFO=9064:MAP3K6;dbSNPBuildID=126;SAO=0;VC=in.del;VP=050200001205000000020200
1 27373180 rs532781899 G . . . RSPOS=27373180;GENEINFO=8547:FCN3;dbSNPBuildID=136;SAO=1,1;GMAF=0.01877;VC=in.del;VLD;VP=0500680012051401113E0200
1 34761407 rs146812843 TGTC . . . RSPOS=34761407;GENEINFO=105378643:LOC105378643|127534:GJB4;dbSNPBuildID=134;SAO=0;VC=in.del;VLD;VP=0502000812050400001E0200
1 35926061 rs140864 GAA . . . RSPOS=35926061;RV;GENEINFO=192669:AGO3|26523:AGO1;dbSNPBuildID=78;SAO=0;GMAF=0.217452;VC=in.del;VLD;VP=0500000000051500003E0200
1 40515336 rs143142866 TG . . . RSPOS=40515336;GENEINFO=64789:EXO5;dbSNPBuildID=134;SAO=0;GMAF=0.0275559;VC=in.del;VLD;VP=0500000012051500003E0200
1 44268269 rs16626 . GGACTTCACG . . RSPOS=44268269;RV;GENEINFO=79033:ERI3;dbSNPBuildID=60;SAO=0;GMAF=0.428914;VC=in.del;VLD;VP=0501000800051701003E0200
1 46815075 rs3215983 AT . . . RSPOS=46815075;RV;GENEINFO=1580:CYP4B1;dbSNPBuildID=106;SAO=0;GMAF=0.135383;VC=in.del;VLD;VP=0503002012051501003E0200

Let me know what I am doing wrong

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You cannot have lines like

    1 44268269 rs16626 . GGACTTCACG . . 
    

    The reference allele cannot be a dot (missing). The file is invalid.

    Generally speaking all the other lines are odd too because they have no ALTs.

    Is this what the file looked like when you got it from dbsnp or did you modify it?

  • pranalispranalis puneMember

    Yes..
    I simply downloaded the file from dbSNP;
    I had used batch dbSNP wherein I had entered a list of IDs and extracted that data in form of VCF file.
    No manipulations were done to the file

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @pranalis
    Hi,

    It is better to simply download the dbSNP full file instead of using batch dbSNP. You can use -L with the full dbSNP file to restrict to the sites you are interested in. What reference are you using? We provide dbSNP files compatible with the b37 and hg19 human references.

    -Sheila

Sign In or Register to comment.