Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Does CombineVariants support the gvcf format ?

seruseru BergenMember ✭✭

Hi,

I am merging several genome vcf files apparently created by gvcftools [1]. I noticed that CombineVariants is incorrectly summing allele counts (AC) in some corner cases, and was wondering if the format [2] was supported?

If the format should be supported by the CombineVariants tool (gvcf is VCF 4.1 conformant), I have isolated a minimal example that should be of help for bugfixing.

[1] https://sites.google.com/site/gvcftools/home
[2] https://sites.google.com/site/gvcftools/home/about-gvcf

Thanks,
Paweł

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Pawel,

    We have a separate tool for handling gVCFs called CombineGVCFs. I think it should work on the Illumina gVCFs but we haven't tested that explicitly.

    Note that there are a few major differences between Illumina's gVCFs and those produced by HaplotypeCaller, most notably that the Illumina gVCFs don't include an estimate of reference confidence, so they can't be used for joint genotyping.

  • seruseru BergenMember ✭✭

    Hi Geraldine,

    I think I didn't even try CombineGVCFs. The docs state that it is not ment for other GVCFs than these produced by HC, so I assumed they are quite specific. So what you are saying is that it could work fine for purely technical merge (my aim is to get correct allele frequencies), and the warning is just in case someone tries to joint genotype GVCFs from other source? I will give it a try and report back.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @seru,

    That's right, I'm hoping a purely technical merge should work -- I think only GenotypeGVCFs rejects non-HC gVCFs outright. But I could be wrong. Thanks for trying.

    Otherwise my worry about CombineVariants is that it's not equipped to do an appropriate merge of hom-ref blocks that differ in span between files, so I don't think that should be used at all.

    Does the gvcftools package itself not offer a file combining functionality?

  • seruseru BergenMember ✭✭

    Hi again,

    CombineGVCFs did not accept the illumina gvcf files. The error was:

    ERROR MESSAGE: The list of input alleles must contain as an allele but that is not the case at position 10184776; please use the Haplotype Caller with gVCF output to generate appropriate records

    I guess it might not be enough to replace the . that is representing hom-refs in illumina gvcf files with NON_REF, because the problematic record is not the first hom-ref record in the input. It is neither the first record that has hom-ref and variant call in the same position (as consecutive entries):

    chr2    10192851        .       G       .       2052.63 PASS    DP=685;MQ=49;MQ0=0      GT:AD:DP:GQ:MQ:GQX      0/0:680:685:99:49:99
    chr2    10192851        rs77171153      G       GT,GTT  17259.52        R8      BaseQRankSum=1.458;DP=685;FS=3.956;HaplotypeScore=2830.7693;MQ=49;MQ0=0;MQRankSum=-0.007;QD=25.20;ReadPosRankSum=0.718;SB=-2280.41;TI=NM_003597,NM_001177716,NM_001177718;GI=KLF11,KLF11,KLF11;FC=Noncoding,Noncoding,Noncoding;EXON    GT:AD:DP:GQ:PL:MQ:GQX:VF        1/2:65,147,353:685:99:17260,9855,8843,2829,0,2505:49:99:0.885
    

    Regarding your worry, these gvcf files have no bands/blocks, so in this respect they are very much like regular vcfs. The only difference I found is the ALT column containing "." for hom-ref calls, which you don't have in vcfs.

    The gvcftools package has a merging tool indeed, but the INFO COLUMN in merged vcf contains a dot for every record. I didn't find a way to force the software to recalculate the INFO field (if you recal, I am interested in AC and AN stats).

    Paweł

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, it seems we have the same check on CombineGVCFs then, so that's not an option. But from your description this sounds like just an all-sites vcf (which is not what we call gVCF), so CombineVariants should handle it correctly after all. Sorry about the confusion then, this is just the first time we're confronted with that format.

    I guess it remains to determine whether the issue you're seeing is a bug or a limitation of the tool. Can you please describe exactly what the problem is, and post the records before and after combining?

  • seruseru BergenMember ✭✭

    I tested in on a following minimal example with GATK 3.2-2:
    Having 9 all-site vcf fiiles with only two entries each (the position/genotype for the entries are identical in all files/samples, but INFO fields vary):

    chr7    44198944    .   G   .   349.05  PASS    DP=111;MQ=50;MQ0=0  GT:AD:DP:GQ:MQ:GQX  0/0:111:111:99:50:99
    chr7    44198944    rs76717641  G   GC  4623.29 PASS    BaseQRankSum=0.577;DP=111;FS=0.000;HRun=1;HaplotypeScore=1414.9202;MQ=50;MQ0=0;MQRankSum=-0.921;QD=41.65;ReadPosRankSum=1.139;SB=-0.00;TI=NM_000162;GI=GCK;FC=Noncoding GT:AD:DP:GQ:PL:MQ:GQX:VF    1/1:1,110:111:99:4623,292,0:50:99:0.991
    

    I invoked

    java -jar $GATK -R $REFERENCE -T CombineVariants -env -o combined_s1-9.vcf -V S1_S1.genome.vcf ... -V S9_S9.genome.vcf 
    

    The resulting file has one variant with incorrect AC (should be AC=18, as all samples are 1/1). According to the merged vcf only samples 1 and 2 are hom-alt.

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2351-SJ-0001    2351-SJ-0002    2351-SJ-0003    2351-SJ-0004    2351-SJ-0005    2351-SJ-0006    2351-SJ-0007    2351-SJ-0008    2351-SJ-0009
    chr7    44198944    rs76717641  G   GC  4623.29 PASS    AC=4;AF=0.222;AN=18;DP=1500;FC=Noncoding;FS=0.000;GI=GCK;HRun=1;MQ=50;MQ0=0;SB=-0.00;TI=NM_000162;set=Intersection  GT:AD:DP:GQ:GQX:MQ:PL:VF    1/1:1,110:111:99:99:50:4623,292,0:0.991 1/1:2,64:66:99:99:50:2762,198,0:0.970   0/0:74:75:99:99:50  0/0:70:70:99:99:50  0/0:114:116:99:99:50    0/0:66:67:99:99:50  0/0:71:71:99:99:50  0/0:81:82:99:99:50  0/0:92:92:99:99:50
    

    Running the same command on a subset of samples (5 - 9), I also get AC=4, but now samples 5 and 6 are supposedly hom-alt:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  2351-SJ-0005    2351-SJ-0006    2351-SJ-0007    2351-SJ-0008    2351-SJ-0009
    chr7    44198944    rs76717641  G   GC  4826.17 PASS    AC=4;AF=0.400;AN=10;DP=856;FC=Noncoding;FS=0.000;GI=GCK;HRun=1;MQ=50;MQ0=0;SB=-0.00;TI=NM_000162;set=Intersection   GT:AD:DP:GQ:GQX:MQ:PL:VF    1/1:3,113:116:99:99:50:4826,307,0:0.974 1/1:1,66:67:99:99:50:2838,202,0:0.985   0/0:71:71:99:99:50  0/0:81:82:99:99:50  0/0:92:92:99:99:50
    

    Let me know if you need the files in order to reproduce it.

  • seruseru BergenMember ✭✭

    Sorry, I have left this thread for a bit too long and forgot to reply. You are right, this format is all-site vcf. I dropped the idea of combining the vcf for now, so can't report if it worked. I will mark your answer though. Thank you for the help.

    Paweł

Sign In or Register to comment.