We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Gzipped gVCF files

Is there a way to use gzipped gVCFs files when using HaplotypeCaller and GenotypeGVCFs.

If so how do you make the index files? I can't seem to get it to work.

(I have searched the forum and can't seem to find a definitive answer. Sorry)

Best Answers

Answers

  • pdexheimerpdexheimer Member ✭✭✭✭

    Yes - make sure you use bgzip, not gzip. The indexes are made with tabix. Alternatively, you can specify the output from any of the steps as a .vcf.gz, and GATK will properly compress and index

  • Great, thanks. So do I do this:

    bgzip test.gVCF

    tabix -p vcf test.gVCF.gz

  • Hi - I tested this bgzip compression on a couple of small gVCF files and it all worked fine. When I tried to combine 19 large gVCF files I got this message for each file:

    "doesn't have a sequence dictionary built in, skipping dictionary validation"

    is this OK? Will it still work properly? I used tabix to index the bgzipped files.

    The command line was:

    java -Xmx45g -Djava.io.tmpdir=/home/LANPARK/mboursnell/javatempdir -jar /opt/gatk/GenomeAnalysisTK.jar -R /home/genetics/canfam3/canfam3.fasta -T GenotypeGVCFs -nt 1 -V gVCF_14809_MS.gVCF.gz -V gVCF_1617_Dennis.gVCF.gz -V gVCF_17289_BGVP.gVCF.gz -V gVCF_23005_V.gVCF.gz -V gVCF_24093_BC.gVCF.gz -V gVCF_24604_WSS_AB.gVCF.gz -V gVCF_25078_WSS_AB.gVCF.gz -V gVCF_25314_SHY.gVCF.gz -V gVCF_25852_SBT.gVCF.gz -V gVCF_26042_BOT.gVCF.gz -V gVCF_26102_LR.gVCF.gz -V gVCF_26133_G.gVCF.gz -V gVCF_26569_FBD.gVCF.gz -V gVCF_7897_FCR.gVCF.gz -V gVCF_CKCS_17377.gVCF.gz -V gVCF_FCR_25382.gVCF.gz -V gVCF_FCR_25384.gVCF.gz -V gVCF_ISP_21897.gVCF.gz -V gVCF_SV_25951.gVCF.gz -o WGS_19_variants_gVCF.vcf -S STRICT

  • jrandalljrandall Member

    Geraldine,

    Just to clarify, when you say it should be fine does that also mean that the issue with speed when handling bgzip / tabix compressed gVCFs has been resolved, or just that the output will be correct (but may take longer to produce)? Basically, I'd like to know whether the pinned post in [http://gatkforums.broadinstitute.org/discussion/5349/genotypegvcfs-warn-track-variant-doesnt-have-a-sequence-dictionary-built-in] still applies, or if this answer supercedes that.

    In my case I am getting the Track variant<N> doesn't have a sequence dictionary built in, skipping dictionary validation warning from CombineGVCFs (GATK v3.5) with .g.vcf.gz / .g.vcf.gz.tbi inputs (generated by HaplotypeCaller). I haven't yet tried it on uncompressed data, but it does seem like CombineGVCFs is running somewhat slowly.

    Is there still a known issue with runtime when using compressed gVCFs, or has that been resolved?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sorry for the confusion, I meant it will be accurate/correct. We haven't actually done any work to profile or improve performance when compression is used because we don't use any in production.

  • Can I ask: what does that error message mean?

    Track variant doesn't have a sequence dictionary built in, skipping dictionary validation

  • jrandalljrandall Member

    I did a few example runs with and without compressed input (input to CombineGVCFs was two samples, either bgzip/tabix compressed by HaplotypeCaller or uncompressed with an associated tribble index.

    A very small region of ~1.3Mbp (using --intervals), with 5 trials for each:

    uncompressed/tribble input,uncompressed output: 8.582(2.35) 8.415(2.33) 8.337(2.33) 8.754(2.44) 8.600(2.231)    median(real time): 8.5s median(ProgressMeter): 2.3s
    hc-compressed/hc-tabix input,uncompressed output:   7.550(2.54) 7.450(2.51) 7.537(2.55) 7.829(2.67) 8.081(2.78) median(real time): 7.6s median(ProgressMeter): 2.6s
    hc-compressed/hc-tabix input,compressed output: 7.642(2.60) 7.717(2.59) 7.626(2.61) 7.497(2.51) 7.451(2.54) median(real time): 7.6s median(ProgressMeter): 2.6s
    bgzip-compressed/tabix input,uncompressed output:   7.646(2.55) 7.628(2.73) 7.752(2.68) 7.640(2.56) 7.326(2.38) median(real time): 7.6s median(ProgressMeter): 2.6
    

    The difference in real time vs progressmeter time is probably due to one-time start up costs of ~5s. Ignoring that, it looks like processing compressed and tabix indexed data rather than uncompressed tribble indexed data is ~13% slower on average (2.6s vs 2.3s), which doesn't seem that bad.

Sign In or Register to comment.