Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotypeGVCFs hanging with -L at some intervals

Hi All,

I am running GenotypeGVCFs [with GATK version 4.1.0.0, java version 1.8.0.111] on a dataset with 300+ samples.
I can run it with -L Chr1 no problems. But it is dramatically slow. So I tried to paralyze the run by using specific intervals and I cut the intervals by the Ns in attempt to not affect the result. However, some of the intervals never worked... Some could.

And those intervals which worked were all intervals at the end of the chromosome: from 166928923 to 192978400.

I have attached the standard output for a hanging run.

03:09:34.447 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
03:09:36.463 INFO GenotypeGVCFs - ------------------------------------------------------------
03:09:36.463 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.0.0
03:09:36.463 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
03:09:36.464 INFO GenotypeGVCFs - Executing as [email protected] on Linux v3.10.107-1.el6.elrepo.x86_64 amd64
03:09:36.464 INFO GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_111-b14
03:09:36.464 INFO GenotypeGVCFs - Start Date/Time: April 27, 2019 3:09:34 AM PDT
03:09:36.464 INFO GenotypeGVCFs - ------------------------------------------------------------
03:09:36.465 INFO GenotypeGVCFs - ------------------------------------------------------------
03:09:36.465 INFO GenotypeGVCFs - HTSJDK Version: 2.18.2
03:09:36.465 INFO GenotypeGVCFs - Picard Version: 2.18.25
03:09:36.465 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
03:09:36.466 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
03:09:36.466 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
03:09:36.466 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
03:09:36.466 INFO GenotypeGVCFs - Deflater: IntelDeflater
03:09:36.466 INFO GenotypeGVCFs - Inflater: IntelInflater
03:09:36.466 INFO GenotypeGVCFs - GCS max retries/reopens: 20
03:09:36.466 INFO GenotypeGVCFs - Requester pays: disabled
03:09:36.466 INFO GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
03:10:00.529 INFO IntervalArgumentCollection - Processing 115310 bp from intervals
03:10:00.542 INFO GenotypeGVCFs - Done initializing engine
03:10:00.631 INFO ProgressMeter - Starting traversal
03:10:00.632 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records

A sample of the command I run:

interval=$chr:$Istart-$Iend
java -d64 -Xmx8g -XX:ParallelGCThreads=1 -jar $GATK GenotypeGVCFs \
-R $ref \
-V gendb://$idir/${REF}_$i \
-O $mdir/${REF}_$interval.vcf \
--max-genotype-count 2 \
--use-new-qual-calculator \
--verbosity DEBUG \
-L $interval

Do you have any idea why it hangs there? Or is there a better way to run the whole chromosome faster?

Thanks for your time,
Yuan

Best Answers

  • suestringsuestring
    Accepted Answer

    Hi,

    Although I still don't get why adding specific region range made the program hangs there. I finally worked around the problem and speed up it by parallel at the step of generating GenomicsDBimport instead of GenotypeGVCFs.

    Thank you for the help here.

    Best,
    Yuan

Answers

  • AdelaideRAdelaideR Member admin

    HI @suestring

    The warnings are safe to ignore for best-practices workflows. It may just be that the memory is getting consumed by the process, so one thing you can do is increase the size of your java memory and number of parallel threads:

    -Xmx64g -XX:ParallelGCThreads=8

  • suestringsuestring Member

    Hi @AdelaideR ,

    Thank you for the solution. It does help the program to be faster without assigning -L intervals. However, those intervals still hang without giving any output... Any idea why?

    Best,
    Yuan

  • AdelaideRAdelaideR Member admin

    @suestring

    What is your reference file ($ref)? I know that sometimes the phase3 data from our Resource Bundle will throw errors.

    Adelaide

  • suestringsuestring Member

    @AdelaideR

    The reference file is a HiC scaffolded assembly of a rodent... Why does the reference file matter? Do you mean the format or the data?

    Yuan

  • AdelaideRAdelaideR Member admin

    I was wondering if it was the phase3 genome from the human reference bundle, so it does not appear to be that issue.

    However, a Hi-C assembled genome may be very large, how do the chromosome sizes compare?

    It could be hanging because of the extremely large size of your files. The error messages may be clogging up the standard error, this issue is being tracked here

    Or it is possible that GenotypeGVCF is struggling with too many multiallelic sites, that error is discussed here

    @suestring

    There is a very good troubleshooting discussion on the forum here

    Try some of these recommendations to see if you can keep your workflow from hanging.

    Good luck.

  • suestringsuestring Member

    Hi @AdelaideR ,

    Thank you for sharing these posts. Sadly, I've read them before posting this.
    And I've used --new-qual and lower the -alternate-allele option.

    The genome size is 2.3 Gb while I run it in parallel and each chromosome is ~100 Mb.

    Also, the main problem is that although I can get the whole chromosome run, try to run it with smaller regions causes it to hang there. The fact that the whole chromosome gets able to run indicates an ok environment for the small region...

    Yuan

  • suestringsuestring Member

    Also, even if I assigned more GC threads to it, it only used 1... The speedup was kind of illusion...

  • suestringsuestring Member

    Hi @init_js ,

    Thank you for the help! It does make sense that too many reader threads would do something unexpected and deadlock the whole thing... But my job with only -L the whole chromosome never hangs (with all those small region GenotypeGVCFs hanging there, at any time). This doesn't support the theory that well...

    I also tried to run it one region at a time now and see if that's the issue. Unluckily, it still hangs by now ( after 30min) while a run with the whole chromosome would start after 1 min... I will come back and see if it works with more time or not.

    The command and environment I run for the one small region and whole chromosome are identical except adding "-L Chr1:172966437-174037391" to the command for the region.

    java -d64 -Xmx32g -XX:ParallelGCThreads=8 -jar $GATK GenotypeGVCFs -R $ref -V gendb://$idir/${REF}_$i -O try.vcf --max-genotype-count 4 --max-alternate-alleles 3 -L Chr1:172966437-174037391

    Best,
    Yuan

  • suestringsuestring Member
    Accepted Answer

    Hi,

    Although I still don't get why adding specific region range made the program hangs there. I finally worked around the problem and speed up it by parallel at the step of generating GenomicsDBimport instead of GenotypeGVCFs.

    Thank you for the help here.

    Best,
    Yuan

Sign In or Register to comment.