GenotypeGVCFs: Long runtime exclusively with a single sample

I have been having some trouble with long runtime with several of GATK utilities.
However it was manageable.
I could arrive at a g.vcf file( I used HaplotypeCaller instead of UnifiedGenotyper upon a suggestion made on a seperate thread).

Now I two different g.vcf file for two different samples and for one of them I could get a vcf file using GenotypeGVCFs within 45 minutes or so.
However with another sample I am getting ** a 40 week long runtime.**
The samples are that of Aedes aegypti and Aedes albopictus (this is the one giving trouble).

The walker starts walking instantly with Aedes aegypti sample and gives me the vcf without any errors.However In the Aedes albopictus the walker itself is initiated after an hour or so.

The command used is:

java -jar GenomeAnalysisTK-3.7-0-gcfedb6 -T GenotypeGVCFs -nt 12 -R ref-ab/GCA_001444175.2_A.albopictus_v1.1_genomic.fasta --variant output-AB.raw.snps.indels.g.vcf -o genotyped-ab.vcf

It should be noted that this exact command has worked for the other sample(except that the necessary files were changed).

The log is as follows:
INFO 19:56:34,300 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 19:56:34,301 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 22:49:04,685 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.4 w 37.4 w
INFO 22:50:04,687 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.7 w 37.6 w
INFO 22:51:04,689 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 37.9 w 37.9 w
INFO 22:52:04,690 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 38.1 w 38.1 w
INFO 22:53:04,694 ProgressMeter - KQ560100.1:879201 0.0 2.9 h 15250.3 w 0.0% 38.3 w 38.3 w
(the run time is increasing instead of decreasing)

IMPORTANT NOTES:

1)The genome sizes are:
1.9 G for A.albopictus and 1.4 G for A.aegypti
2)Cannot blame it on space
I have around 48 usable threads at the moment and enough RAM space
I have tried using different number of threads as well. Its not making any difference.

3) have tried re-running the a.aegypti sample parallely (to get rid of any doubts that the computation maybe have been faster due to uncertain variables at that point in time),and its reproducing its behaviour i.e gets done in 45 minutes or so.But the a.albopictus sample is still showing the same problem.

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @shubhra,

    It sounds like there is something peculiar about your A. albopictus sample. Generating its GVCF was not an issue but you say genotyping with GenotypeGVCFs is giving an extremely high estimated run time.

    Here are some suggestions towards troubleshooting:

    • Compare the number of records in your A. aegypti and A. albopictus GVCFs. Are they within a similar range?
    • View the A. albopictus sample data on IGV to ensure it was aligned to the correct reference and the extent of the variation is within a normal range. Remember you can also view GVCFs in IGV. What you will see is mostly reference blocks.
    • Assuming a particular region is causing trouble, divide by halves the portion of the genome analyzed using -L to narrow down the section that is causing trouble. For example, if you have 8 contigs in the reference, then make two GenotypeGVCFs commands, one with -L 1 -L 2 -L 3 -L 4 and the other with -L 5 -L 6 -L 7 -L 8. See which of these reflects the long runtime and then divide again and so on.

    I hope this is helpful.

  • Thanks @shlee
    I will try out the suggestions made by you.
    Lets see if what can make this work :)

Sign In or Register to comment.