GenotypeGVCFs fails to complete in 10 days

I have been following the "HAPLOTYPECALLER GVCF - Exome/Panel + Whole Genome" best practices on a set of 13 whole-genome sequences from a non-model organism (~10x coverage). I have successfully created my per-sample g.vcf files, but when I have attempted to perform the Joint Genotyping step (without merging files) it has failed to complete within the maximum time that I can allot to a job on the computing cluster I am using (10 days). This seems unusual as the step is noted as being fast! I have searched the forum and tried a number of different approaches based on questions which seem to address similar problems, but have so far not had any success. There are no program errors in the output file, it just slows down and then times out.

Notably, when I extracted just the mitochondrial data from my reads and ran the best practices on that, I was able to complete this step and get my list of variants. The file sizes of my g.vcfs range from 23G to 37G for the whole-genome data, though were only between 431K and 731K for the mitochondrial data.

I have been using GATK version 3.6 with java version 1.8. The command that I have been using is as follows:

java -jar ~/Programmes/GenomeAnalysisTK-3.8/GenomeAnalysisTK.jar -T GenotypeGVCFs -R /data/scratch/btw822/KwRp101_Broad_guided_min1000.fa -nt 8 --variant KwRp034.raw.snps.indels.g.vcf --variant KwRp037.raw.snps.indels.g.vcf --variant KwRp040.raw.snps.indels.g.vcf --variant KwRp044.raw.snps.indels.g.vcf --variant KwRp044b.raw.snps.indels.g.vcf --variant KwRp045.raw.snps.indels.g.vcf --variant KwRp050.raw.snps.indels.g.vcf --variant KwRp051.raw.snps.indels.g.vcf --variant KwRp053.raw.snps.indels.g.vcf --variant KwRp055.raw.snps.indels.g.vcf --variant KwRp097.raw.snps.indels.g.vcf --variant KwRp101.raw.snps.indels.g.vcf -o /data/omicsScratch/btw822/Rphil_Trimmed/Rphil_Pilot.vcf

Tagged:

Best Answers

Answers

  • KimWarrenKimWarren LondonMember

    Thank you for your answer.

    My initial runs of the command did not include the -nt argument, but still appeared subject to this same problem. Including that was one of my attempts to work around this issue!

    My genome assembly is not to chromosome level - this isn't currently possible with my study organism. I originally had issues getting HaplotypeCaller to run to completion, which I was able to rectify after finding a comment on the forum about it choking on small contigs as it was not designed to work with draft genomes. As such, I pruned my data set to remove anything below 1000bp, leaving me with 47,912 contigs between 1001 and 1,876,277 base pairs in length. Is it possible that this is still causing a problem at GenotypeGVCFs and, if so, do you have any idea what the size threshold is at which I should prune my data?

    For now, I'll try splitting things down and combining them - thank you again.

  • KimWarrenKimWarren LondonMember

    Thank you again. I did search the forum pretty extensively, but apparently didn't use the correct key words to find this particular answer!

    Do you have any ballpark idea for how many contigs is likely to work, so that I can aim for this when manipulating my data?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yeah we should really just write a doc article about this since it comes up a lot.

    I would recommend aiming for hundreds -- that should go through alright. Try to balance aggregate sizes so if you parallelize over contigs, the segments have a chance of taking about the same time. In practice that doesn't always work out as expected but it's worth aiming for.

Sign In or Register to comment.