Are overlapping calls legal in a g.vcf file?

Some background: I'm trying to directly convert haploid g.vcf files to fasta format, to utilise all sequencing information I have for downstream analysis, and not just the snips.

In one of my samples, I've encountered two records that overlap the same position, which makes parsing the g.vcf file a lot harder.
gi|194447306|ref|NC_011083.1| 346996 . CAACA C,<NON_REF> 3723.97 . BaseQRankSum=0.371;ClippingRankSum=1.279;DP=89;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=0.536;RAW_MQ=266004.00;ReadPosRankSum=-0.701 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71 gi|194447306|ref|NC_011083.1| 347000 . A ATGTC,<NON_REF> 3723.97 . BaseQRankSum=0.825;ClippingRankSum=-1.732;DP=84;MLEAC=1,0;MLEAF=1.00,0.00;MQRankSum=-0.990;RAW_MQ=250875.00;ReadPosRankSum=-0.676 GT:AD:DP:GQ:PL:SB 1:1,83,0:84:99:3763,0,3808:0,1,12,71
The last 'A' of the ref at 346996 is the same position as the 'A' of the ref at position 347000. I was wondering if this construction is allowed in a vcf file?

I've looked at the vcf4.2 standard, but I couldn't find a definitive answer. One the one hand, multiple records with the same POS are explicitly allowed.
On the other hand, the following suggests to me that records should not overlap, at least not for single-sample haploid g.vcf files:
"ALT haplotypes are constructed from the REF haplotype by taking the REF allele bases at the POS in the reference genotype and replacing them with the ALT bases. In essence, the VCF record specifies a-REF-t and the alternative haplotypes are a-ALT-t for each alternative allele."

If I simply take the above records, and replace the REF with ALT, I won't get the true haplotype for my sample, since the two records overlap. So my haplotype reconstructed that way will be longer then the actual haplotype.

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Redmar_van_den_Berg
    Hi,

    Yes, the records are weird. But, can you please post the original bam file and bamout file in the region? Also, how are you planning to generate the FASTA file?

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It is absolutely legal in a VCF to have overlapping records. Examples of such cases include: calls made separately for SNPs and indels at the same site by UnifiedGenotyper (suboptimal but legal); a deletion in one sample spanning a SNP in another; a deletion in one sample dovetailing a deletion further down in another sample (or on the other chromosome in the same sample).

  • @Sheila

    I have attached the files you requested. The exact command that I ran:

    bamfile=093_interval.bam
    output=093_interval.g.vcf
    bamout=093_interval_bamout.bam
    reference=NC_011083.fasta
    
    gatk -T HaplotypeCaller \
                --sample_ploidy 1 \
                -R $reference \
                -I $bamfile \
                -o $output \
                -ERC GVCF \
               -bamout $bamout 
    

    With gatk version 3.5-0-g36282e4

    I've also uploaded an igv screenshot of the region, its pretty messy.

    To generate the fasta file I've written a python script that parses the g.vcf file with pyvcf. For a homref block, it outputs the sequence from the reference, and it takes snips, insertions and deletions from the g.vcf file.

  • @Geraldine_VdAuwera

    Thank you for confirming constructions like this are legal in vcf files. However, I'm still confused as to why it has happened in my sample. Its a single-sample g.vcf file created with HaplotypeCaller, so in that sense I wouldn't expect overlapping record. Should I just consider it a quirk and ignore it, or could it be indicative of a deeper problem?

Sign In or Register to comment.