The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.
Are overlapping calls legal in a g.vcf file?
Some background: I'm trying to directly convert haploid g.vcf files to fasta format, to utilise all sequencing information I have for downstream analysis, and not just the snips.
In one of my samples, I've encountered two records that overlap the same position, which makes parsing the g.vcf file a lot harder.
gi|194447306|ref|NC_011083.1| 346996 . CAACA C,<NON_REF> 3723.97 . BaseQRankSum=0.371;ClippingRankSum=1.279;DP=89;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=0.536;RAW_MQ=266004.00;ReadPosRankSum=-0.701 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71 gi|194447306|ref|NC_011083.1| 347000 . A ATGTC,<NON_REF> 3723.97 . BaseQRankSum=0.825;ClippingRankSum=-1.732;DP=84;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=-0.990;RAW_MQ=250875.00;ReadPosRankSum=-0.676 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71
The last 'A' of the ref at 346996 is the same position as the 'A' of the ref at position 347000. I was wondering if this construction is allowed in a vcf file?
I've looked at the vcf4.2 standard, but I couldn't find a definitive answer. One the one hand, multiple records with the same POS are explicitly allowed.
On the other hand, the following suggests to me that records should not overlap, at least not for single-sample haploid g.vcf files:
"ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases. In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele."
If I simply take the above records, and replace the REF with ALT, I won't get the true haplotype for my sample, since the two records overlap. So my haplotype reconstructed that way will be longer then the actual haplotype.