GenotypeGVCFs with draft quality reference genome

lkw222lkw222 United StatesMember

I am using the GATK pipeline to call variants by aligning reads to a draft quality reference genome that is ~367000 scaffolds. I split the scaffolds up into 50 intervals and successfully (and pretty quickly) generated GVCFs for 25 individuals using the -L option. However, I am having the worst of times with GenotypeGVCFs. After running for nearly 2 days on the first interval list, GenotypeGVCFs has not even output a file. Based on another post in the forum, I removed the scaffolds that are NOT in the interval from the GVCF header, and that sped up the process slightly - I have a combined VCF file with just the header generated after about 18 hours. Not sure how much longer the process has as the progress meter doesn't seem to be making any sense.

Is there any known way(s) to optimize this process?

Currently using the following command:
java -Djava.io.tmpdir=/data/lwwvd/genoGVCF.tmp -XX:ParallelGCThreads=4 -Xmx15g -jar /usr/local/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -nt 16 -T GenotypeGVCFs -R ../ref_genomes/bbu_ref_UMD_CASPUR_WB_2.0.fa -L interval_lists/bbub.refctgs.49.interval_list -V ./1095/1095.49.g.vcf.gz -V ./189/189.49.g.vcf.gz -V ./190/190.49.g.vcf.gz -V ./196/196.49.g.vcf.gz -V ./246/246.49.g.vcf.gz -V ./337/337.49.g.vcf.gz -V ./581/581.49.g.vcf.gz -V ./583/583.49.g.vcf.gz -V ./662/662.49.g.vcf.gz -V ./701/701.49.g.vcf.gz -V ./850/850.49.g.vcf.gz -V ./92764/92764.49.g.vcf.gz -V ./92765/92765.49.g.vcf.gz -V ./92766/92766.49.g.vcf.gz -V ./92767/92767.49.g.vcf.gz -V ./92768/92768.49.g.vcf.gz -V ./92769/92769.49.g.vcf.gz -V ./92770/92770.49.g.vcf.gz -V ./92771/92771.49.g.vcf.gz -V ./92774/92774.49.g.vcf.gz -V ./92775/92775.49.g.vcf.gz -V ./92776/92776.49.g.vcf.gz -V ./92777/92777.49.g.vcf.gz -V ./92778/92778.49.g.vcf.gz -V ./92795/92795.49.g.vcf.gz -o BBUB.combined.49.vcf

Best Answer

Answers

  • lkw222lkw222 United StatesMember

    As a test I have also run genotypeGVCFs with only the first 10 scaffolds and just 2 individuals. After 24 hours of running, it has not completed. Unsure why, but in desparate need of a solution!

  • lkw222lkw222 United StatesMember

    Thanks, Geraldine. If that solves the problem that isn't too difficult, except I would have to redo all the previous steps, all the way back to aligning to these new super-scaffolds.

    I wonder, could I just chop my reference file up into the same intervals I used to make the GVCFs, remove the scaffolds that are NOT in the interval from the GVCF header and run GenotypeGVCFs on each interval that way?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @lkw222
    Hi,

    It is best to start over with the correct new reference. I'm a little confused about how you generated the GVCFs easily. What were the exact commands you ran?

    Thanks,
    Sheila

  • lkw222lkw222 United StatesMember
    edited February 2016

    I attached the command log (from adaptor trimming all the way to GVCFs) for one individual. I also attached the time log (in seconds) for GATK. You will see GATK actually performed reasonably well (although much slower than what we see with the much better cow reference). Most intervals had 100-500 contigs (a couple had many more).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @lkw222

    Hi,

    It looks like you attached the time log twice and did not include the commands.

    -Sheila

  • lkw222lkw222 United StatesMember

    Sorry, here's the log!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Actually it could be a memory limitation compounding the problem. GenotypeGVCFs is notoriously greedy for memory and 15G might not be nearly enough.

    Chopping up your reference as you describe might help. We're very leery of doctoring sequence dictionaries in BAM files but VCFs are more forgiving. At the very least it's worth a try -- but I do think you'll need to increase the memory you give to GenotypeGVCFs. You can also try decreasing the multithreading; sometimes it causes more problems than it solves.

    One final question -- is this a non-diploid organism?

  • lkw222lkw222 United StatesMember

    Thanks, I'll give it a shot. It is diploid though, with about a 3 Gbp genome.

  • lkw222lkw222 United StatesMember

    FYI - increasing the memory to 150G (may have been overkill) and using only one thread sped it up a lot! Did the first ten contigs in about 2 hours (previously this ran for almost a day without finishing). It is still probably be more efficient to stitch the contigs together with N's and run the pipeline that way, but adjusting those two parameters did have a significant impact.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @lkw222 Thanks for reporting back!

Sign In or Register to comment.