We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
Performance troubleshooting tips for GenotypeGVCFs

Hi,
I am running a GATK4 variant calling analysis on few hundred Solanum lycopersicum samples via bcbio (species is diploid, 1gigaBase reference) . Everything went fine up to and including importing to the Intel GenomicsDB (GenomicsDBImport) for chromosome region splits. The haplotype caller step just gave warnings about no AVX support.
The GenotypeGVCFs progress however is very slow. Even for very small genome chunks GenotypeGVCFs is not proceeding much let alone finishing. The progress log shows either (almost) no updates or a progress of something like 3 variants per minute. In total I expect to have 100M+ variants.
I am currently running on older hardware without any AVX support. So I am planning to try to run on modern hardware with AVX support to see if that makes a difference.
In the mean while I am wondering if there are other things that I could try to get the analysis to complete.
I noticed that the GenotypeGVCFs was running single threaded and using ca 60+ GB of memory. The -nt
multi threading option from GATK3.X does not exist anymore. Is there another option for activating multi threading? And is the 60GB memory normal for few hundred samples?
I tried lowering --max-alternate-alleles
from the default of 6 to 4. This did not seem to help.
Do you expect AVX support to make a large difference in performance of GenotypeGVCFs?
In the past I successfully analyzed this data set using Freebayes single batch variant calling. So I don't think this is a data issue.
Is there anything else I could try to fix the performance of GenotypeGVCFs?
Thank you.
P.S.
Here is some log info
[2018-02-18T18:34Z] machine9: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xms500m -Xmx37265m -XX:+UseSerialGC -Djava.io.tmpdir=/data/run/Projects/DA_1080/DA_1080_bam_tables/work/bcbiotx/tmp7DOD_G -jar /data/prod/Tools/bcbio/1.0.8/anaconda/share/gatk4-4.0.1.1-0/gatk-package-4.0.1.1-local.jar GenotypeGVCFs --variant gendb:///data/run/Projects/DA_1080/DA_1080_bam_tables/work/joint/gatk-haplotype-joint/DA_1080/Chr_00/DA_1080-Chr_00_2336001_2741048_genomicsdb -R /data/prod/Tools/bcbio/1.0.8/genomes/reference/reference/seq/reference.fa --output /data/run/Projects/DA_1080/DA_1080_bam_tables/work/bcbiotx/tmp_bQJhv/DA_1080-Chr_00_2336001_2741048.vcf.gz -L Chr_00:2336002-2741048 [2018-02-18T18:34Z] machine9: 19:34:49.876 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/prod/Tools/bcbio/1.0.8/anaconda/share/gatk4-4.0.1.1-0/gatk-package-4.0.1.1-local.jar!/com/intel/gkl/native/libgkl_compression.so [2018-02-18T18:34Z] machine9: 19:34:50.385 INFO GenotypeGVCFs - ------------------------------------------------------------ [2018-02-18T18:34Z] machine9: 19:34:50.385 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.1.1 [2018-02-18T18:34Z] machine9: 19:34:50.385 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/ [2018-02-18T18:34Z] machine9: 19:34:50.386 INFO GenotypeGVCFs - Executing as [email protected] on Linux v2.6.32-642.4.2.el6.x86_64 amd64 [2018-02-18T18:34Z] machine9: 19:34:50.386 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15 [2018-02-18T18:34Z] machine9: 19:34:50.386 INFO GenotypeGVCFs - Start Date/Time: February 18, 2018 7:34:49 PM CET [2018-02-18T18:34Z] machine9: 19:34:50.386 INFO GenotypeGVCFs - ------------------------------------------------------------ [2018-02-18T18:34Z] machine9: 19:34:50.386 INFO GenotypeGVCFs - ------------------------------------------------------------ [2018-02-18T18:34Z] machine9: 19:34:50.388 INFO GenotypeGVCFs - HTSJDK Version: 2.14.1 [2018-02-18T18:34Z] machine9: 19:34:50.388 INFO GenotypeGVCFs - Picard Version: 2.17.2 [2018-02-18T18:34Z] machine9: 19:34:50.388 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1 [2018-02-18T18:34Z] machine9: 19:34:50.388 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false [2018-02-18T18:34Z] machine9: 19:34:50.388 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true [2018-02-18T18:34Z] machine9: 19:34:50.388 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false [2018-02-18T18:34Z] machine9: 19:34:50.389 INFO GenotypeGVCFs - Deflater: IntelDeflater [2018-02-18T18:34Z] machine9: 19:34:50.389 INFO GenotypeGVCFs - Inflater: IntelInflater [2018-02-18T18:34Z] machine9: 19:34:50.389 INFO GenotypeGVCFs - GCS max retries/reopens: 20 [2018-02-18T18:34Z] machine9: 19:34:50.389 INFO GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes [2018-02-18T18:34Z] machine9: 19:34:50.389 INFO GenotypeGVCFs - Initializing engine [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: 19:34:51.245 INFO IntervalArgumentCollection - Processing 405047 bp from intervals [2018-02-18T18:34Z] machine9: 19:34:51.251 INFO GenotypeGVCFs - Done initializing engine [2018-02-18T18:34Z] machine9: 19:34:51.858 INFO ProgressMeter - Starting traversal [2018-02-18T18:34Z] machine9: 19:34:51.858 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T18:35Z] machine9: 19:35:33.941 INFO ProgressMeter - Chr_00:2337001 0.7 1000 1425.8 [2018-02-18T19:12Z] machine9: 20:12:55.889 INFO ProgressMeter - Chr_00:2338001 38.1 2000 52.5
Here is some typical progress info
[2018-02-18T10:52Z] machine20: 11:52:25.520 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute [2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records [2018-02-18T13:37Z] machine16: 14:37:09.060 INFO ProgressMeter - Chr_00:879729 1400.7 1000 0.7 [2018-02-18T13:37Z] machine16: 14:37:54.815 INFO ProgressMeter - Chr_00:880729 1401.4 2000 1.4 [2018-02-18T13:51Z] machine16: 14:51:38.628 INFO ProgressMeter - Chr_00:881729 1415.1 3000 2.1 [2018-02-18T13:53Z] machine16: 14:53:04.278 INFO ProgressMeter - Chr_00:884120 1416.6 5000 3.5 [2018-02-18T13:54Z] machine16: 14:54:12.151 INFO ProgressMeter - Chr_00:886120 1417.7 7000 4.9
Best Answer
-
WimS ✭✭
Hi @Sheila The
-new-qual
option indeed improves performance of GenotypeGVCFs a lot.With that option GenotypeGVCFs now process a few thousand variants (c.a. 4K) variants per minute for the full set of a few hundred samples. This is even with
--max-alternate-alleles
at the default of 6.The performance effect of the
-new-qual
option is not clear from the GenotypeGVCFs tool documentation. This should maybe be added to the documentation of GenotypeGVCFs tool page. Also I am wondering what (if any) the downsides are of this option, since it is not the default option?Still I am curious if there are any other performance bottlenecks and improvements that you can identify for this kind of data. For example more dynamic up front determination of which alleles to genotype? (instead of all alleles that have any support up to the
--max-alternate-alleles
number )Thank you very much for testing and creating the developer ticket! I can now create a nice square genotype tabel for my end users.
Answers
Hi @Sheila . I can confirm that I have the same issue (very slow GenotypeGVCFs progress ) on modern hardware.
This is the (first) cpu_info of the hardware on which I am now running. It supports AVX and AVX2.
I tried to run on a full chromosome and on a smaller public subset of the data.
Running GenomicsDBImport and GenotypeGVCFs on a full chromosome instead of a small chromosome region does not make a difference. Still (almost) no progress.
Running GenotypeGVCFs on a smaller subset of just 84 public domain samples shows at least some progress. Some progress output is being produces stating a few 100 variants to be processed per minute. Often no progress log statements are output for longer time.
I am assuming that this is not the speed at which GenotypeGVCFs is supposed to run for just few hundred diploid samples.
@Sheila Can I send you the GVCF files for the 84 public samples and the reference genome so you can have a look if this issue reproduces at your side? I could send the full GVCF files or subset for a single chromosome or even the small genome region from the log above.
Thank you.
Hi @Sheila , @Geraldine_VdAuwera
I figured out that by lowering --max-alternate-alleles for
GenotypeGVCFs
I can get the analysis to proceed. Modern or old hardware does not seem to make a big difference.See these benchmarking results for running GATK4.0.1.1
GenotypeGVCFs
on 84 public Solanum lycopersicum samples.This still seems like really low performance to me. Is this the performance that you would expect for just 84 samples?
I uploaded a zip file name
WimS_GenotypeGVCFs_Solanum_lycopersicum_to_broad.zip
to your FTP.Following the guidelines from here: https://gatkforums.broadinstitute.org/gatk/discussion/1894/how-do-i-submit-a-detailed-bug-report
This zip file has a self contained example:
4 shell scripts to run
GenomicsDBImport
andGenotypeGVCFs
The GATK jar that I am using
You should just be able to unzip the file and run one of the shell script to reproduce these low performance numbers (200 variants per minute). I just tested this.
In total I have a few hundred GVCF files for this species that I would like to merge with GenotypeGVCFs.
For this larger set
GenotypeGVCFs
(--max-alternate-allele 2) processes c.a. 2K variants per minute (instead of 40K variants per minute when processing just the 84 samples).So just adding a few hundred samples drops the performance again c.a. 20X.
Can you please have a look why
GenotypeGVCFs
is running so slow for just 84 Solanum lycopersicum samples? I hoped GATK4GenotypeGVCFs
would scale to at least a few thousand samples for all the species that we work with.Thank you very much.
@WimS
Hi,
What is your sample ploidy? We have seen issues with high ploidy and large number of alternate alleles. EDIT: I see it is diploid. I will have a look at your bug report.
-Sheila
@Sheila Thank you for having a look. The species is diploid yes, and the reference is c.a. 1 gigabase. The SNP frequency and genetic diversity is higher than in human.
I did manage to jointly variant call the same set of samples in the past with Freebayes, without limiting the number of alternative alleles to consider.
These are the bcftools stats for the VCF file produced with Freebayes. It shows the high SNP frequency, but the number of indels and multi-allelics is not extremely high (c.a. 10%, not that different from human I think).
The above variant calling stats are from the joint Freebayes variant calling of the few hundred samples. Variant calling of just the 84 public samples should result in c.a. 78M short variants according to ensembl on the same reference as included in the zip file http://plants.ensembl.org/Solanum_lycopersicum/Info/Annotation/
The 84 public samples (fastq+bam files) are also available online: https://www.ebi.ac.uk/ena/data/view/PRJEB5235
@WimS
Hi,
I just notified the developers of this issue. You can keep track of it here.
-Sheila
Issue · Github
by Sheila
Hi @Sheila The
-new-qual
option indeed improves performance of GenotypeGVCFs a lot.With that option GenotypeGVCFs now process a few thousand variants (c.a. 4K) variants per minute for the full set of a few hundred samples. This is even with
--max-alternate-alleles
at the default of 6.The performance effect of the
-new-qual
option is not clear from the GenotypeGVCFs tool documentation. This should maybe be added to the documentation of GenotypeGVCFs tool page. Also I am wondering what (if any) the downsides are of this option, since it is not the default option?Still I am curious if there are any other performance bottlenecks and improvements that you can identify for this kind of data. For example more dynamic up front determination of which alleles to genotype? (instead of all alleles that have any support up to the
--max-alternate-alleles
number )Thank you very much for testing and creating the developer ticket! I can now create a nice square genotype tabel for my end users.
@WimS
Hi,
The newQual model was introduced in GATK3. It should become the default in GATK4 soon (I think the team is finishing up some testing). Once that becomes the default, I won't need to put in a document fix
As for other flags you can add to speed this up, I am not sure. Let's wait until the team has had a chance to look at your data, and see what they say. I hope you will have some good news soon.
-Sheila
Do you know why this warning emssage occurs?
WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
@Rosmaninho
Hi,
It means the annotation cannot be calculated unless you input at least ten samples. Or, it could be related to this post.
-Sheila