Performance troubleshooting tips for GenotypeGVCFs

WimSWimS Member
edited February 20 in Ask the GATK team

Hi,

I am running a GATK4 variant calling analysis on few hundred Solanum lycopersicum samples via bcbio (species is diploid, 1gigaBase reference) . Everything went fine up to and including importing to the Intel GenomicsDB (GenomicsDBImport) for chromosome region splits. The haplotype caller step just gave warnings about no AVX support.

The GenotypeGVCFs progress however is very slow. Even for very small genome chunks GenotypeGVCFs is not proceeding much let alone finishing. The progress log shows either (almost) no updates or a progress of something like 3 variants per minute. In total I expect to have 100M+ variants.

I am currently running on older hardware without any AVX support. So I am planning to try to run on modern hardware with AVX support to see if that makes a difference.

In the mean while I am wondering if there are other things that I could try to get the analysis to complete.

I noticed that the GenotypeGVCFs was running single threaded and using ca 60+ GB of memory. The -nt multi threading option from GATK3.X does not exist anymore. Is there another option for activating multi threading? And is the 60GB memory normal for few hundred samples?

I tried lowering --max-alternate-alleles from the default of 6 to 4. This did not seem to help.

Do you expect AVX support to make a large difference in performance of GenotypeGVCFs?

In the past I successfully analyzed this data set using Freebayes single batch variant calling. So I don't think this is a data issue.

Is there anything else I could try to fix the performance of GenotypeGVCFs?

Thank you.

P.S.

Here is some log info

[2018-02-18T18:34Z] machine9:     java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xms500m -Xmx37265m -XX:+UseSerialGC -Djava.io.tmpdir=/data/run/Projects/DA_1080/DA_1080_bam_tables/work/bcbiotx/tmp7DOD_G -jar /data/prod/Tools/bcbio/1.0.8/anaconda/share/gatk4-4.0.1.1-0/gatk-package-4.0.1.1-local.jar GenotypeGVCFs --variant gendb:///data/run/Projects/DA_1080/DA_1080_bam_tables/work/joint/gatk-haplotype-joint/DA_1080/Chr_00/DA_1080-Chr_00_2336001_2741048_genomicsdb -R /data/prod/Tools/bcbio/1.0.8/genomes/reference/reference/seq/reference.fa --output /data/run/Projects/DA_1080/DA_1080_bam_tables/work/bcbiotx/tmp_bQJhv/DA_1080-Chr_00_2336001_2741048.vcf.gz -L Chr_00:2336002-2741048
[2018-02-18T18:34Z] machine9: 19:34:49.876 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/prod/Tools/bcbio/1.0.8/anaconda/share/gatk4-4.0.1.1-0/gatk-package-4.0.1.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
[2018-02-18T18:34Z] machine9: 19:34:50.385 INFO  GenotypeGVCFs - ------------------------------------------------------------
[2018-02-18T18:34Z] machine9: 19:34:50.385 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.1.1
[2018-02-18T18:34Z] machine9: 19:34:50.385 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
[2018-02-18T18:34Z] machine9: 19:34:50.386 INFO  GenotypeGVCFs - Executing as [email protected] on Linux v2.6.32-642.4.2.el6.x86_64 amd64
[2018-02-18T18:34Z] machine9: 19:34:50.386 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
[2018-02-18T18:34Z] machine9: 19:34:50.386 INFO  GenotypeGVCFs - Start Date/Time: February 18, 2018 7:34:49 PM CET
[2018-02-18T18:34Z] machine9: 19:34:50.386 INFO  GenotypeGVCFs - ------------------------------------------------------------
[2018-02-18T18:34Z] machine9: 19:34:50.386 INFO  GenotypeGVCFs - ------------------------------------------------------------
[2018-02-18T18:34Z] machine9: 19:34:50.388 INFO  GenotypeGVCFs - HTSJDK Version: 2.14.1
[2018-02-18T18:34Z] machine9: 19:34:50.388 INFO  GenotypeGVCFs - Picard Version: 2.17.2
[2018-02-18T18:34Z] machine9: 19:34:50.388 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
[2018-02-18T18:34Z] machine9: 19:34:50.388 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
[2018-02-18T18:34Z] machine9: 19:34:50.388 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
[2018-02-18T18:34Z] machine9: 19:34:50.388 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
[2018-02-18T18:34Z] machine9: 19:34:50.389 INFO  GenotypeGVCFs - Deflater: IntelDeflater
[2018-02-18T18:34Z] machine9: 19:34:50.389 INFO  GenotypeGVCFs - Inflater: IntelInflater
[2018-02-18T18:34Z] machine9: 19:34:50.389 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
[2018-02-18T18:34Z] machine9: 19:34:50.389 INFO  GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
[2018-02-18T18:34Z] machine9: 19:34:50.389 INFO  GenotypeGVCFs - Initializing engine
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: 19:34:51.245 INFO  IntervalArgumentCollection - Processing 405047 bp from intervals
[2018-02-18T18:34Z] machine9: 19:34:51.251 INFO  GenotypeGVCFs - Done initializing engine
[2018-02-18T18:34Z] machine9: 19:34:51.858 INFO  ProgressMeter - Starting traversal
[2018-02-18T18:34Z] machine9: 19:34:51.858 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:34Z] machine9: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T18:35Z] machine9: 19:35:33.941 INFO  ProgressMeter -       Chr_00:2337001              0.7                  1000           1425.8
[2018-02-18T19:12Z] machine9: 20:12:55.889 INFO  ProgressMeter -       Chr_00:2338001             38.1                  2000             52.5

Here is some typical progress info

[2018-02-18T10:52Z] machine20: 11:52:25.520 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
[2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T10:52Z] machine20: WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
[2018-02-18T13:37Z] machine16: 14:37:09.060 INFO  ProgressMeter -        Chr_00:879729           1400.7                  1000              0.7
[2018-02-18T13:37Z] machine16: 14:37:54.815 INFO  ProgressMeter -        Chr_00:880729           1401.4                  2000              1.4
[2018-02-18T13:51Z] machine16: 14:51:38.628 INFO  ProgressMeter -        Chr_00:881729           1415.1                  3000              2.1
[2018-02-18T13:53Z] machine16: 14:53:04.278 INFO  ProgressMeter -        Chr_00:884120           1416.6                  5000              3.5
[2018-02-18T13:54Z] machine16: 14:54:12.151 INFO  ProgressMeter -        Chr_00:886120           1417.7                  7000              4.9

Issue · Github
by Sheila

Issue Number
4512
State
open
Last Updated
Assignee
Array
Milestone
Array

Answers

  • WimSWimS Member

    Hi @Sheila . I can confirm that I have the same issue (very slow GenotypeGVCFs progress ) on modern hardware.

    This is the (first) cpu_info of the hardware on which I am now running. It supports AVX and AVX2.

    processor       : 0
    vendor_id       : GenuineIntel
    cpu family      : 6
    model           : 79
    model name      : Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
    stepping        : 1
    microcode       : 0xb000021
    cpu MHz         : 2099.507
    cache size      : 46080 KB
    physical id     : 0
    siblings        : 36
    core id         : 0
    cpu cores       : 18
    apicid          : 0
    initial apicid  : 0
    fpu             : yes
    fpu_exception   : yes
    cpuid level     : 20
    wp              : yes
    flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local
    bogomips        : 4200.38
    clflush size    : 64
    cache_alignment : 64
    address sizes   : 46 bits physical, 48 bits virtual
    power management:
    

    I tried to run on a full chromosome and on a smaller public subset of the data.

    Running GenomicsDBImport and GenotypeGVCFs on a full chromosome instead of a small chromosome region does not make a difference. Still (almost) no progress.

    Running GenotypeGVCFs on a smaller subset of just 84 public domain samples shows at least some progress. Some progress output is being produces stating a few 100 variants to be processed per minute. Often no progress log statements are output for longer time.

    I am assuming that this is not the speed at which GenotypeGVCFs is supposed to run for just few hundred diploid samples.

    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xms500m -Xmx200265m -XX:+UseSerialGC -Djava.io.tmpdir=/data/run/Projects/DA_1080/work/test_my_user_combine_and_genotype_gvcfs/tmp -jar /data/prod/Tools/bcbio/1.0.8/anaconda/share/gatk4-4.0.1.1-0/gatk-package-4.0.1.1-local.jar GenotypeGVCFs --variant gendb:///data/run/Projects/DA_1080/work/test_my_user_combine_and_genotype_gvcfs/DA_1080-Chr_00_1721154_2011618_public_genomicsdb -R /data/prod/Tools/bcbio/1.0.8/genomes/my_reference/my_reference/seq/my_reference.fa --output /data/run/Projects/DA_1080/work/test_my_user_combine_and_genotype_gvcfs/DA_1080-Chr_00_1721154_2011618_public.vcf.gz -L Chr_00:1721155-2011618
    13:44:14.027 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data/prod/Tools/bcbio/1.0.8/anaconda/share/gatk4-4.0.1.1-0/gatk-package-4.0.1.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
    13:44:14.128 INFO  GenotypeGVCFs - ------------------------------------------------------------
    13:44:14.128 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.1.1
    13:44:14.128 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    13:44:14.128 INFO  GenotypeGVCFs - Executing as [email protected]_modern_hardware_machine on Linux v3.10.0-514.16.1.el7.x86_64 amd64
    13:44:14.128 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_121-b15
    13:44:14.128 INFO  GenotypeGVCFs - Start Date/Time: February 22, 2018 1:44:14 PM CET
    13:44:14.129 INFO  GenotypeGVCFs - ------------------------------------------------------------
    13:44:14.129 INFO  GenotypeGVCFs - ------------------------------------------------------------
    13:44:14.129 INFO  GenotypeGVCFs - HTSJDK Version: 2.14.1
    13:44:14.129 INFO  GenotypeGVCFs - Picard Version: 2.17.2
    13:44:14.129 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    13:44:14.129 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    13:44:14.129 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    13:44:14.129 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    13:44:14.129 INFO  GenotypeGVCFs - Deflater: IntelDeflater
    13:44:14.129 INFO  GenotypeGVCFs - Inflater: IntelInflater
    13:44:14.129 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
    13:44:14.129 INFO  GenotypeGVCFs - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    13:44:14.129 INFO  GenotypeGVCFs - Initializing engine
    WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
    13:44:14.810 INFO  IntervalArgumentCollection - Processing 290464 bp from intervals
    13:44:14.817 INFO  GenotypeGVCFs - Done initializing engine
    13:44:15.380 INFO  ProgressMeter - Starting traversal
    13:44:15.380 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
    WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
    WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
    14:01:59.807 INFO  ProgressMeter -       Chr_00:1722253             17.7                  1000             56.4
    14:02:43.518 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:02:43.518 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:02:43.519 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:02:43.520 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:02:43.521 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:02:43.560 INFO  ProgressMeter -       Chr_00:1729818             18.5                  7000            379.0
    14:03:07.758 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:03:07.758 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:03:14.505 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    14:03:14.667 INFO  ProgressMeter -       Chr_00:1730850             19.0                  8000            421.3
    14:04:08.845 INFO  ProgressMeter -       Chr_00:1731859             19.9                  9000            452.5
    14:33:08.225 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:52.531 INFO  ProgressMeter -       Chr_00:1735031             95.6                 12000            125.5
    15:19:52.567 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:52.569 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:52.569 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:52.570 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.556 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.558 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.558 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.560 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.567 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.568 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.569 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.571 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.572 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.572 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.578 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.579 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.580 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.580 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.593 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.594 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.594 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.595 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.596 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.596 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.598 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.598 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.599 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.602 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:56.603 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:19:58.321 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
    15:25:03.854 INFO  ProgressMeter -       Chr_00:1748110            100.8                 21000            208.3
    15:25:31.852 INFO  ProgressMeter -       Chr_00:1752212            101.3                 25000            246.9
    

    @Sheila Can I send you the GVCF files for the 84 public samples and the reference genome so you can have a look if this issue reproduces at your side? I could send the full GVCF files or subset for a single chromosome or even the small genome region from the log above.

    Thank you.

  • WimSWimS Member
    edited February 26

    Hi @Sheila , @Geraldine_VdAuwera

    I figured out that by lowering --max-alternate-alleles for GenotypeGVCFs I can get the analysis to proceed. Modern or old hardware does not seem to make a big difference.

    See these benchmarking results for running GATK4.0.1.1 GenotypeGVCFs on 84 public Solanum lycopersicum samples.

    --max-alternate-allele average variants per minute max memory
    6 200 80G
    3 4K 2G
    2 40K 1.5G

    This still seems like really low performance to me. Is this the performance that you would expect for just 84 samples?

    I uploaded a zip file name WimS_GenotypeGVCFs_Solanum_lycopersicum_to_broad.zip to your FTP.
    Following the guidelines from here: https://gatkforums.broadinstitute.org/gatk/discussion/1894/how-do-i-submit-a-detailed-bug-report

    This zip file has a self contained example:

    • The reference genome (fasta,fai and dict)
    • 84 GVCF files for 4 small regions
    • 4 shell scripts to run GenomicsDBImport and GenotypeGVCFs

      • (--max-alternate-allele = default) for each region.
      • deletes the output directory on (re)start
    • The GATK jar that I am using

    You should just be able to unzip the file and run one of the shell script to reproduce these low performance numbers (200 variants per minute). I just tested this.

    In total I have a few hundred GVCF files for this species that I would like to merge with GenotypeGVCFs.
    For this larger set GenotypeGVCFs (--max-alternate-allele 2) processes c.a. 2K variants per minute (instead of 40K variants per minute when processing just the 84 samples).
    So just adding a few hundred samples drops the performance again c.a. 20X.

    Can you please have a look why GenotypeGVCFs is running so slow for just 84 Solanum lycopersicum samples? I hoped GATK4 GenotypeGVCFs would scale to at least a few thousand samples for all the species that we work with.

    Thank you very much.

    Post edited by WimS on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited February 26

    @WimS
    Hi,

    What is your sample ploidy? We have seen issues with high ploidy and large number of alternate alleles. EDIT: I see it is diploid. I will have a look at your bug report.

    -Sheila

  • WimSWimS Member

    @Sheila Thank you for having a look. The species is diploid yes, and the reference is c.a. 1 gigabase. The SNP frequency and genetic diversity is higher than in human.

    I did manage to jointly variant call the same set of samples in the past with Freebayes, without limiting the number of alternative alleles to consider.

    These are the bcftools stats for the VCF file produced with Freebayes. It shows the high SNP frequency, but the number of indels and multi-allelics is not extremely high (c.a. 10%, not that different from human I think).

    number of records:      144.747.986
    number of no-ALTs:      0
    number of SNPs: 134.631.496
    number of MNPs: 0
    number of indels:       12.624.854
    number of others:       32.599
    number of multiallelic sites:   16.096.631
    number of multiallelic SNP sites:       12.708.004
    
  • WimSWimS Member
    edited February 26

    The above variant calling stats are from the joint Freebayes variant calling of the few hundred samples. Variant calling of just the 84 public samples should result in c.a. 78M short variants according to ensembl on the same reference as included in the zip file http://plants.ensembl.org/Solanum_lycopersicum/Info/Annotation/

    The 84 public samples (fastq+bam files) are also available online: https://www.ebi.ac.uk/ena/data/view/PRJEB5235

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @WimS
    Hi,

    I just notified the developers of this issue. You can keep track of it here.

    -Sheila

    Issue · Github
    by Sheila

    Issue Number
    2985
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    chandrans
  • WimSWimS Member

    Hi @Sheila The -new-qual option indeed improves performance of GenotypeGVCFs a lot. :)

    With that option GenotypeGVCFs now process a few thousand variants (c.a. 4K) variants per minute for the full set of a few hundred samples. This is even with --max-alternate-alleles at the default of 6.

    The performance effect of the -new-qual option is not clear from the GenotypeGVCFs tool documentation. This should maybe be added to the documentation of GenotypeGVCFs tool page. Also I am wondering what (if any) the downsides are of this option, since it is not the default option?

    Still I am curious if there are any other performance bottlenecks and improvements that you can identify for this kind of data. For example more dynamic up front determination of which alleles to genotype? (instead of all alleles that have any support up to the --max-alternate-alleles number )

    Thank you very much for testing and creating the developer ticket! I can now create a nice square genotype tabel for my end users.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @WimS
    Hi,

    The newQual model was introduced in GATK3. It should become the default in GATK4 soon (I think the team is finishing up some testing). Once that becomes the default, I won't need to put in a document fix :smiley:

    As for other flags you can add to speed this up, I am not sure. Let's wait until the team has had a chance to look at your data, and see what they say. I hope you will have some good news soon.

    -Sheila

  • Do you know why this warning emssage occurs?
    WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Rosmaninho
    Hi,

    It means the annotation cannot be calculated unless you input at least ten samples. Or, it could be related to this post.

    -Sheila

Sign In or Register to comment.