If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Are overlapping calls legal in a g.vcf file?
Some background: I'm trying to directly convert haploid g.vcf files to fasta format, to utilise all sequencing information I have for downstream analysis, and not just the snips.
In one of my samples, I've encountered two records that overlap the same position, which makes parsing the g.vcf file a lot harder.
gi|194447306|ref|NC_011083.1| 346996 . CAACA C,<NON_REF> 3723.97 . BaseQRankSum=0.371;ClippingRankSum=1.279;DP=89;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=0.536;RAW_MQ=266004.00;ReadPosRankSum=-0.701 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71 gi|194447306|ref|NC_011083.1| 347000 . A ATGTC,<NON_REF> 3723.97 . BaseQRankSum=0.825;ClippingRankSum=-1.732;DP=84;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=-0.990;RAW_MQ=250875.00;ReadPosRankSum=-0.676 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71
The last 'A' of the ref at 346996 is the same position as the 'A' of the ref at position 347000. I was wondering if this construction is allowed in a vcf file?
I've looked at the vcf4.2 standard, but I couldn't find a definitive answer. One the one hand, multiple records with the same POS are explicitly allowed.
On the other hand, the following suggests to me that records should not overlap, at least not for single-sample haploid g.vcf files:
"ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases. In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele."
If I simply take the above records, and replace the REF with ALT, I won't get the true haplotype for my sample, since the two records overlap. So my haplotype reconstructed that way will be longer then the actual haplotype.