Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotypeGVCFs with draft quality reference genome

lkw222lkw222 United StatesMember

I am using the GATK pipeline to call variants by aligning reads to a draft quality reference genome that is ~367000 scaffolds. I split the scaffolds up into 50 intervals and successfully (and pretty quickly) generated GVCFs for 25 individuals using the -L option. However, I am having the worst of times with GenotypeGVCFs. After running for nearly 2 days on the first interval list, GenotypeGVCFs has not even output a file. Based on another post in the forum, I removed the scaffolds that are NOT in the interval from the GVCF header, and that sped up the process slightly - I have a combined VCF file with just the header generated after about 18 hours. Not sure how much longer the process has as the progress meter doesn't seem to be making any sense.

Is there any known way(s) to optimize this process?

Currently using the following command:
java -Djava.io.tmpdir=/data/lwwvd/genoGVCF.tmp -XX:ParallelGCThreads=4 -Xmx15g -jar /usr/local/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -nt 16 -T GenotypeGVCFs -R ../ref_genomes/bbu_ref_UMD_CASPUR_WB_2.0.fa -L interval_lists/bbub.refctgs.49.interval_list -V ./1095/1095.49.g.vcf.gz -V ./189/189.49.g.vcf.gz -V ./190/190.49.g.vcf.gz -V ./196/196.49.g.vcf.gz -V ./246/246.49.g.vcf.gz -V ./337/337.49.g.vcf.gz -V ./581/581.49.g.vcf.gz -V ./583/583.49.g.vcf.gz -V ./662/662.49.g.vcf.gz -V ./701/701.49.g.vcf.gz -V ./850/850.49.g.vcf.gz -V ./92764/92764.49.g.vcf.gz -V ./92765/92765.49.g.vcf.gz -V ./92766/92766.49.g.vcf.gz -V ./92767/92767.49.g.vcf.gz -V ./92768/92768.49.g.vcf.gz -V ./92769/92769.49.g.vcf.gz -V ./92770/92770.49.g.vcf.gz -V ./92771/92771.49.g.vcf.gz -V ./92774/92774.49.g.vcf.gz -V ./92775/92775.49.g.vcf.gz -V ./92776/92776.49.g.vcf.gz -V ./92777/92777.49.g.vcf.gz -V ./92778/92778.49.g.vcf.gz -V ./92795/92795.49.g.vcf.gz -o BBUB.combined.49.vcf

Best Answer


  • lkw222lkw222 United StatesMember

    As a test I have also run genotypeGVCFs with only the first 10 scaffolds and just 2 individuals. After 24 hours of running, it has not completed. Unsure why, but in desparate need of a solution!

  • lkw222lkw222 United StatesMember

    Thanks, Geraldine. If that solves the problem that isn't too difficult, except I would have to redo all the previous steps, all the way back to aligning to these new super-scaffolds.

    I wonder, could I just chop my reference file up into the same intervals I used to make the GVCFs, remove the scaffolds that are NOT in the interval from the GVCF header and run GenotypeGVCFs on each interval that way?

  • SheilaSheila Broad InstituteMember, Broadie admin


    It is best to start over with the correct new reference. I'm a little confused about how you generated the GVCFs easily. What were the exact commands you ran?


  • lkw222lkw222 United StatesMember
    edited February 2016

    I attached the command log (from adaptor trimming all the way to GVCFs) for one individual. I also attached the time log (in seconds) for GATK. You will see GATK actually performed reasonably well (although much slower than what we see with the much better cow reference). Most intervals had 100-500 contigs (a couple had many more).

  • SheilaSheila Broad InstituteMember, Broadie admin



    It looks like you attached the time log twice and did not include the commands.


  • lkw222lkw222 United StatesMember

    Sorry, here's the log!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually it could be a memory limitation compounding the problem. GenotypeGVCFs is notoriously greedy for memory and 15G might not be nearly enough.

    Chopping up your reference as you describe might help. We're very leery of doctoring sequence dictionaries in BAM files but VCFs are more forgiving. At the very least it's worth a try -- but I do think you'll need to increase the memory you give to GenotypeGVCFs. You can also try decreasing the multithreading; sometimes it causes more problems than it solves.

    One final question -- is this a non-diploid organism?

  • lkw222lkw222 United StatesMember

    Thanks, I'll give it a shot. It is diploid though, with about a 3 Gbp genome.

  • lkw222lkw222 United StatesMember

    FYI - increasing the memory to 150G (may have been overkill) and using only one thread sped it up a lot! Did the first ten contigs in about 2 hours (previously this ran for almost a day without finishing). It is still probably be more efficient to stitch the contigs together with N's and run the pipeline that way, but adjusting those two parameters did have a significant impact.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @lkw222 Thanks for reporting back!

Sign In or Register to comment.