Running genotypeGVCFs with ~4000 human exome data: stuck on "ProgressMeter - Starting"

Hello,

I am running genotypeGVCFs with ~4000 human exome data. To speed up the process, I have splited exome.interval_list into sub_interval_list which one interval file contains ~100kb regions. Then I submitted the genotypeGVCFs jobs in parallel for each sub_interval_list. e.g.

java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000000.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000000.vcf

java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000001.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000001.vcf

...

However, I kept receiving "ProgressMeter - Starting" for hours without any variants outputed.

INFO  00:09:31,580 ProgressMeter -                 | processed |    time |    per 1M |           |   total | remaining
INFO  00:09:31,581 ProgressMeter -        Location |     sites | elapsed |     sites | completed | runtime |   runtime
WARN  00:09:32,292 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi
WARN  00:09:32,295 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi
INFO  00:09:32,295 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant f
INFO  00:10:01,605 ProgressMeter -        Starting         0.0    30.0 s      49.6 w      100.0%    30.0 s       0.0 s
INFO  00:10:31,606 ProgressMeter -        Starting         0.0    60.0 s      99.2 w      100.0%    60.0 s       0.0 s
INFO  00:11:01,608 ProgressMeter -        Starting         0.0    90.0 s     148.9 w      100.0%    90.0 s       0.0 s
INFO  00:11:31,611 ProgressMeter -        Starting         0.0   120.0 s     198.5 w      100.0%   120.0 s       0.0 s
INFO  00:12:01,613 ProgressMeter -        Starting         0.0     2.5 m     248.1 w      100.0%     2.5 m       0.0 s

I have read this thread and noticed this happens for reference genome with millions of contigs. But my data is human with much fewer contigs, so I would not think they are the same cases.

I know WDL/cromwell can support scatter/gather method to speed up. However, as I understand, the principle of the scatter/gather is the same as what I did here. So even using WDL, the parallelizable jobs are still facing the same stuck situation. Is that right?

Is there anything else I can do to get this to run at all, or faster, or just wait?

Thanks a lot for any inputs!

Best,
Jinjie

Tagged:

Best Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    Accepted Answer

    @JJ_Duan
    Hi Jinjie,

    I have heard back from the developers, and here is what they say:

    The existing tool does scale to tens of thousands of samples successfully, but you need to use smaller intervals (not a whole chromosome at once), and use the --batchSize argument to control how many samples are processed at once when there are many samples.

    From the tool documentation: "--batchSize controls the number of samples for which readers are open at once and therefore provides a way to minimize memory consumption. However, it can take longer to complete. Use the consolidate flag if more than a hundred batches were used. This will improve feature read time. batchSize=0 means no batching (i.e. readers for all samples will opened at once)" The team recommends using a value of 50.

    As for the intervals, they recommend about one interval per sample. So, for 4000 samples, you can split the chromosomes into 4000 intervals from 23 (if you are working with humans).

    I hope this helps.

    -Sheila

Answers

  • dbeckerdbecker MunichMember

    I think you might have to use CombineGVCF before you start genotyping.

    "This is an optional step that applies only to the DNA workflow. It consists of merging GVCF files hierarchically in order to reduce the number of files that must be handled simultaneously in the next step. This is only necessary if you are working with more than a few hundred samples. Note that this is NOT equivalent to the joint genotyping step; variants in the resulting merged GVCF cannot be considered to have been called jointly."

    https://software.broadinstitute.org/gatk/best-practices/bp_3step.php?case=GermShortWGS&p=2

  • JJ_DuanJJ_Duan DenmarkMember

    @dbecker oh, I forgot to say I have performed CombineGVCF step (200 samples per batch) before GenotypeGVCFs. So it may not the reason for it. But still many thanks for your comments.

  • JJ_DuanJJ_Duan DenmarkMember

    @Geraldine_VdAuwera said:
    @JJ_Duan The solution to this is to use GenotypeGVCFs in GATK4 (currently in beta, but these tools are pretty mature). You will need to prepare the GVCFs for input using a tool called as described here: https://software.broadinstitute.org/gatk/documentation/article?id=10061

    Thanks, @Geraldine_VdAuwera . I have followed the instruction. At the beginning, I have tried to import DB for the whole chromosome 9 with 4000 samples. This step has been running for 4 days using 850GB memory and it seemed to get stuck, so I have terminated before it completed. Was it supposed to be so much time and memory consuming?

    so I have changed the strategy to just import the chunk regions around 30kb and then run joint genotying. It seemed the importDB worked with true returned. But it failed in joint-genotying step with error info "VariantQueryProcessorException : Could not open array genomicsdb_array at workspace". I didn't move database after creation any all since you mentioned in the insturction. But why did it report couldn't open the array? could you please shed some light on this? Thanks.

    The command I used,

    cd /test_gatk4
    
    gatk-launch --javaOptions '-Xmx1000G' GenomicsDBImport --genomicsDBWorkspace chr9-110062321-110094103DB --intervals 9:110062321-110094103 --sampleNameMap chr9.bgzipped.map
    
    gatk-launch GenotypeGVCFs -O /test_gatk4/chr_9_sub00436.vcf -R /ref_b37/human_g1k_v37.fasta -V gendb:///test_gatk4/chr9-110062321-110094103DB/ --dbsnp /ref_b37/dbsnp_138.b37.vcf -L /chr_9/chr_9_sub00436.interval_list
    

    The log from importing DB step:

    Using GATK jar /01software/gatk-4.beta.2/gatk-package-4.beta.2-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -Xmx1000G -jar /01software/gatk-4.beta.2/gatk-package-4.beta.2-local.jar GenomicsDBImport --genomicsDBWorkspace chr9-110062321-110094103DB --intervals 9:110062321-110094103 --sampleNameMap chr9.bgzipped.map
    Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1
    23:29:58.179 WARN  IntelGKLUtils - Error starting process to check for AVX support : grep -i avx /proc/cpuinfo
    23:29:58.191 WARN  IntelGKLUtils - Error starting process to check for AVX support : grep -i avx /proc/cpuinfo
    [September 10, 2017 11:29:58 PM CEST] GenomicsDBImport  --genomicsDBWorkspace chr9-110062321-110094103DB --sampleNameMap chr9.bgzipped.map --intervals 9:110062321-110094103  --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNameMap false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPrefetchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --disableToolDefaultReadFilters false
    [September 10, 2017 11:29:58 PM CEST] Executing as [email protected] on Linux 2.6.32-642.1.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26; Version: 4.beta.2
    23:29:58.195 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    23:29:58.195 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    23:29:58.195 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    23:29:58.195 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    23:29:58.196 INFO  GenomicsDBImport - Deflater: JdkDeflater
    23:29:58.196 INFO  GenomicsDBImport - Inflater: JdkInflater
    23:29:58.196 INFO  GenomicsDBImport - Initializing engine
    23:29:58.335 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:29:58.336 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:29:58.376 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:29:59.199 INFO  IntervalArgumentCollection - Processing 31783 bp from intervals
    23:29:59.206 INFO  GenomicsDBImport - Done initializing engine
    23:29:59.207 INFO  GenomicsDBImport - Vid Map JSON file will be written to chr9-110062321-110094103DB/vidmap.json
    23:29:59.207 INFO  GenomicsDBImport - Callset Map JSON file will be written to chr9-110062321-110094103DB/callset.json
    23:29:59.207 INFO  GenomicsDBImport - Importing to array - chr9-110062321-110094103DB/genomicsdb_array
    23:29:59.561 INFO  ProgressMeter - Starting traversal
    23:29:59.562 INFO  ProgressMeter -        Current Locus  Elapsed Minutes     Batches Processed   Batches/Minute
    23:29:59.575 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:29:59.576 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:29:59.592 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    ......
    ......
    ......
    23:37:02.678 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:37:02.679 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:37:02.696 WARN  IntelDeflaterFactory - IntelInflater is not supported, using Java.util.zip.Inflater
    23:37:02.716 INFO  GenomicsDBImport - Importing batch 1 with 4377 samples
    01:20:36.462 INFO  ProgressMeter -          9:110062321            110.6                     1              0.0
    01:20:36.462 INFO  GenomicsDBImport - Done importing batch 1/1
    01:20:36.463 INFO  ProgressMeter -          9:110062321            110.6                     1              0.0
    01:20:36.463 INFO  ProgressMeter - Traversal complete. Processed 1 total batches in 110.6 minutes.
    01:20:36.463 INFO  GenomicsDBImport - Import completed!
    01:20:47.542 INFO  GenomicsDBImport - Shutting down engine
    [September 11, 2017 1:20:47 AM CEST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 110.82 minutes.
    Runtime.totalMemory()=353828864000
    Tool returned:
    true
    

    The log from joint-genotyping step:

    11:21:25.240 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    11:21:25.240 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    11:21:25.240 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    11:21:25.240 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    11:21:25.240 INFO  GenotypeGVCFs - Deflater: IntelDeflater
    11:21:25.240 INFO  GenotypeGVCFs - Inflater: IntelInflater
    11:21:25.240 INFO  GenotypeGVCFs - Initializing engine
    11:21:29.335 INFO  FeatureManager - Using codec VCFCodec to read file file:///ref_b37/dbsnp_138.b37.vcf
    terminate called after throwing an instance of 'VariantQueryProcessorException'
      what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /test_gatk4/chr9-110062321-110094103DB
    
  • JJ_DuanJJ_Duan DenmarkMember

    @Geraldine_VdAuwera said:
    @JJ_Duan There was a bug that has since been fixed since the beta.2 that you are using; please try again with the latest release (beta.5).

    Thanks, @Geraldine_VdAuwera . It worked with beta.5. However, when I imported 30kb chunked regions, it took 2 hours with 350GB memory for 4000 samples. Is it supposed to so much memory-consuming? Because in order to speed up the whole procedure, we have to submit thousands of importing jobs in parallel. We have only 3 fat nodes for such big memory-consuming jobs. That means it may take more than 1 month to import all exome regions. Does (/will) GenomicsDBImport support spark? Do you have any practical suggestions for importing step with thousands of samples? Many thanks in advance.

    Jinjie

    Issue · Github
    by Sheila

    Issue Number
    2499
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JJ_Duan
    Hi Jinjie,

    I will have Geraldine get back to you soon.

    -Sheila

  • JJ_DuanJJ_Duan DenmarkMember

    @Sheila said:
    @JJ_Duan
    Hi Jinjie,

    I will have Geraldine get back to you soon.

    -Sheila

    Thank you, @Sheila.

    I have made several tests for importing different lengths of chunk regions followed with genotypeGVCFs, e.g. 30kb, 100kb, 300kb, 500kb, 1MB, 5MB, 20MB. 30kb always works with correct VCF output. However, the rest lengths sometimes succeeded, sometimes failed with exact same error info as previous. Considered it may succeed if I resubmit the job again and again, I am wondering whether is a GATK bug or our cluster issue. Here is the failed error example.

    Using GATK jar /gatk-4.beta.5/gatk-package-4.beta.5-local.jar
    Running:
        java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /gatk-4.beta.5/gatk-package-4.beta.5-local.jar GenotypeGVCFs -O /test_gatk4/chr9-200kb.vcf -R /ref_b37/human_g1k_v37.fasta -V gendb:///test_gatk4/chr9-200kb --dbsnp /ref_b37/dbsnp_138.b37.vcf -L 9:110062371-110262371
    Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1
    12:00:55.467 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/intel/gkl/native/libgkl_compression.so
    [September 15, 2017 12:00:55 PM CEST] GenotypeGVCFs  --output /test_gatk4/chr9-200kb.vcf --dbsnp /ref_b37/dbsnp_138.b37.vcf --variant gendb:///test_gatk4/chr9-200kb --intervals 9:110062371-110262371 --reference /ref_b37/human_g1k_v37.fasta  --useNewAFCalculator false --annotateNDA false --heterozygosity 0.001 --indel_heterozygosity 1.25E-4 --heterozygosity_stdev 0.01 --standard_min_confidence_threshold_for_calling 10.0 --max_alternate_alleles 6 --max_genotype_count 1024 --sample_ploidy 2 --group StandardAnnotation --onlyOutputCallsStartingInIntervals false --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVariantIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 40 --cloudIndexPrefetchBuffer -1 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false --gcs_max_retries 20 --disableToolDefaultReadFilters false
    [September 15, 2017 12:00:55 PM CEST] Executing as [email protected] on Linux 2.6.32-642.1.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_20-b26; Version: 4.beta.5
    12:00:55.687 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    12:00:55.687 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:00:55.687 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:00:55.687 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:00:55.687 INFO  GenotypeGVCFs - Deflater: IntelDeflater
    12:00:55.687 INFO  GenotypeGVCFs - Inflater: IntelInflater
    12:00:55.687 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
    12:00:55.688 INFO  GenotypeGVCFs - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
    12:00:55.688 INFO  GenotypeGVCFs - Initializing engine
    12:00:59.863 INFO  FeatureManager - Using codec VCFCodec to read file file:///ref_b37/dbsnp_138.b37.vcf
    terminate called after throwing an instance of 'VariantQueryProcessorException'
      what():  VariantQueryProcessorException : Could not open array genomicsdb_array at workspace: /test_gatk4/chr9-200kb
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    Accepted Answer

    @JJ_Duan
    Hi Jinjie,

    I have heard back from the developers, and here is what they say:

    The existing tool does scale to tens of thousands of samples successfully, but you need to use smaller intervals (not a whole chromosome at once), and use the --batchSize argument to control how many samples are processed at once when there are many samples.

    From the tool documentation: "--batchSize controls the number of samples for which readers are open at once and therefore provides a way to minimize memory consumption. However, it can take longer to complete. Use the consolidate flag if more than a hundred batches were used. This will improve feature read time. batchSize=0 means no batching (i.e. readers for all samples will opened at once)" The team recommends using a value of 50.

    As for the intervals, they recommend about one interval per sample. So, for 4000 samples, you can split the chromosomes into 4000 intervals from 23 (if you are working with humans).

    I hope this helps.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JJ_Duan
    Hi again Jinjie,

    The team has discovered a bug in GenomicsDBImport that affects the sample names. If you are using --batch, the sample names may be swapped and therefore have incorrect genotypes assigned to them. The team is working on a fix for this asap.

    You do not need to fret if you already ran the workflow, however. Existing callsets will be repairable by just modifying the VCF header. The team will be releasing a script or tool to perform the repair :smiley:

    I will update you here when a fix is ready.

    -Sheila

  • JJ_DuanJJ_Duan DenmarkMember

    @Sheila said:
    @JJ_Duan
    Hi again Jinjie,

    The team has discovered a bug in GenomicsDBImport that affects the sample names. If you are using --batch, the sample names may be swapped and therefore have incorrect genotypes assigned to them. The team is working on a fix for this asap.

    You do not need to fret if you already ran the workflow, however. Existing callsets will be repairable by just modifying the VCF header. The team will be releasing a script or tool to perform the repair :smiley:

    I will update you here when a fix is ready.

    -Sheila

    @Sheila Thanks a lot for informing me. :smile:

    Issue · Github
    by Sheila

    Issue Number
    2585
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @JJ_Duan
    Hi Jinjie,

    GATK4 beta 6 is out now with a fix for this issue. You can read more about the fix here :smile:

    -Sheila

Sign In or Register to comment.