Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotypeGVCFs on X chromosome stuck in the process progression

MaguelonneMaguelonne ParisMember

Hi,

I am processing 1000 WGS with gatk-4.0.11.0.

I ran HaplotypeCaller by sample and by chromosome and I took into account the sex status of my samples for the calling of the X chromosome (ploidy=1 for male, ploidy=2 for female).
This step executed successfully.

I then run CombineGVCFs by chromosome and this step also executed successfully.

I finally run GenotypeGVCFS by chromosome. It worked for the autosomes. But I have an issue with the X chromosome.
In fact, GenotypeGVCFs starts but get stuck at the first "ProgressMeter".

"12:35:56.093 INFO ProgressMeter - chrX:60999 29.8 1000 33.5"

As GenotypeGVCFs worked for all the autosomes, my interpretation is that maybe "dealing with differences of ploidy" would take much more memory and time.
Could it be that? If so, how can I handle it? If not, could you help me finding and solving my problem?

Thank you,

Maguelonne

Best Answer

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin

    @Maguelonne

    What command did you use for your tools? Also please post the stacktrace containing the error from GenotypeGVCF. Also what compute resources are you using for the machine(s) this workflow is running?

  • MaguelonneMaguelonne ParisMember
    edited July 15

    @bshifaw

    I am using HPC with nodes of 120G.

    Again, GenotypeGVCFs worked for all the chromosomes except for X chromosome.
    For the X chromosome, I am not having an "error": the progression of GenotypeGVCFs is stuck at the first "ProgressMeter".

    "12:35:56.093 INFO ProgressMeter - chrX:60999 29.8 1000 33.5"

    For the X chromosome, I give you my commands taking 2 samples as examples:

    1) Haplotype caller:

    gatk-4.0.2.1/gatk --java-options "-Xmx10g" HaplotypeCaller \
    -I Sample1.bam \
    -R hs37d5_all_chr.fasta \
    --emit-ref-confidence GVCF \
    -mbq 20 \
    -L chrX \
    --dbsnp dbsnp_138.b37.vcf \
    --pcr-indel-model NONE \
    --sample-ploidy 2 \
    -G StandardAnnotation \
    --output res_haplotypecaller_Sample1_chrX.g.vcf.gz \
    --bam-output Sample1_chrX_bamout.bam
    
    gatk-4.0.2.1/gatk --java-options "-Xmx10g" HaplotypeCaller \
    -I Sample2.bam \
    -R hs37d5_all_chr.fasta \
    --emit-ref-confidence GVCF \
    -mbq 20 \
    -L chrX \
    --dbsnp dbsnp_138.b37.vcf \
    --pcr-indel-model NONE \
    --sample-ploidy 1 \
    -G StandardAnnotation \
    --output res_haplotypecaller_Sample2_chrX.g.vcf.gz \
    --bam-output Sample2_chrX_bamout.bam
    

    2) CombineGVCFS:

    gatk-4.0.11.0/gatk --java-options "-Xmx100g" CombineGVCFs \
    -R hs37d5_all_chr.fasta \
    --variant res_haplotypecaller_Sample1_chrX.g.vcf.gz \
    --variant res_haplotypecaller_Sample2_chrX.g.vcf.gz \
    -L chrX \
    --dbsnp dbsnp_138.b37.vcf \
    -G StandardAnnotation \
    --output combine_chrX.g.vcf.gz
    

    3) GenotypeGVCFS:

    gatk-4.0.11.0/gatk --java-options "-Xmx120g" GenotypeGVCFs \
    -R hs37d5_all_chr.fasta \
    -L chrX \
    --dbsnp dbsnp_138.b37.vcf \
    -G StandardAnnotation \
    --variant combine_chrX.g.vcf.gz \
    -O genotype_chrX.vcf.gz \
    

    Furthermore, for this last command, I tried to add "-Djava.io.tmpdir=TMPdir -XX:ParallelGCThreads=4" to the java options but the problem remains.

    Post edited by bshifaw on
  • bshifawbshifaw Member, Broadie, Moderator admin

    @Maguelonne,

    You may need to use smaller intervals for your analysis, here is an answer to a similar post for GenotypeGVCFs being stuck in ProgressMeterlink. Instead of increasing the memory the solution was to breakdown the intervals on which the tool conducted its processing.

  • MaguelonneMaguelonne ParisMember

    @bshifaw

    Correct me if I am wrong but the "--batchSize" parameter is only available for GenomicsDBimport, isn't it?

    Also, isn't it a problem to generate subintervals of the X chromosome for GenotypeGVCFs if I used the whole chromosome for the HaplotypeCaller and CombineGVCFs steps?

  • bshifawbshifaw Member, Broadie, Moderator admin

    I spoke with the dev team: Before you get started with above suggestion would you please try reducing the xmx for GenotypeGVCFs. Currently the tool is running on a machine with 120G and the tool is set to use 120G, this will cause java to use 100% of the system memory for the tool without leaving any memory for the server for regular system processes. They're suggestion is xmx of 16G.

    Also please post the stacktrace from the tool, listing everything that was shown on the terminal.

  • MaguelonneMaguelonne ParisMember

    I tried and it's still stuck at the first step.

    Here's my logfile (I removed the paths):

    Using GATK jar /bin/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx16g -Djava.io.tmpdir=TMPDIR -XX:ParallelGCThreads=1 -jar /bin/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar GenotypeGVCFs -R hs37d5_all_chr.fasta -L chrX --dbsnp dbsnp_138.b37.vcf -G StandardAnnotation --variant combine_chrX.g.vcf.gz -O genotype_chrX.vcf.gz
    12:11:31.651 WARN GATKAnnotationPluginDescriptor - Redundant enabled annotation group (StandardAnnotation) is enabled for this tool by default
    12:11:31.855 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/bin/gatk-4.0.11.0/gatk-package-4.0.11.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    12:11:34.285 INFO GenotypeGVCFs - ------------------------------------------------------------
    12:11:34.286 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.11.0
    12:11:34.286 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
    12:11:34.287 INFO GenotypeGVCFs - Executing as xxx on Linux v2.6.32-754.11.1.el6.x86_64 amd64
    12:11:34.287 INFO GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_20-b26
    12:11:34.287 INFO GenotypeGVCFs - Start Date/Time: July 16, 2019 12:11:31 PM CEST
    12:11:34.287 INFO GenotypeGVCFs - ------------------------------------------------------------
    12:11:34.287 INFO GenotypeGVCFs - ------------------------------------------------------------
    12:11:34.288 INFO GenotypeGVCFs - HTSJDK Version: 2.16.1
    12:11:34.288 INFO GenotypeGVCFs - Picard Version: 2.18.13
    12:11:34.288 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    12:11:34.288 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    12:11:34.288 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    12:11:34.288 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    12:11:34.288 INFO GenotypeGVCFs - Deflater: IntelDeflater
    12:11:34.288 INFO GenotypeGVCFs - Inflater: IntelInflater
    12:11:34.288 INFO GenotypeGVCFs - GCS max retries/reopens: 20
    12:11:34.288 INFO GenotypeGVCFs - Requester pays: disabled
    12:11:34.288 INFO GenotypeGVCFs - Initializing engine
    12:11:34.872 INFO FeatureManager - Using codec VCFCodec to read file file:dbsnp_138.b37.vcf
    12:11:35.388 INFO FeatureManager - Using codec VCFCodec to read file file:combine_chrX.g.vcf.gz
    12:11:35.600 INFO IntervalArgumentCollection - Processing 155270560 bp from intervals
    12:11:35.606 INFO GenotypeGVCFs - Done initializing engine
    12:11:35.747 INFO ProgressMeter - Starting traversal
    12:11:35.747 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
    12:11:36.503 WARN RMSMappingQuality - MQ annotation data is not properly formatted. This GATK version expects key RAW_MQandDP with an int tuple of sum of squared MQ values and total reads over variant genotypes as the va
    lue. Attempting to use deprecated MQ calculation.
    12:37:43.826 INFO ProgressMeter - chrX:60999 26.1 1000 38.3

  • MaguelonneMaguelonne ParisMember

    It's actually working with 4.1.2.0 version!

    Thank you for your help!

Sign In or Register to comment.