Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

Trouble with running GenomicsDBImport

I'm writing a pipeline using GATK4 for our local cluster which uses Slurm as job scheduler. The command below seems to run successfully, however, it took only a few seconds and the output file sizes are very small. Using the genomics db file from the output as input for joint genotyping, the output vcf only contains header section.

gatk --java-options "-Xmx8000M" GenomicsDBImport -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_sample_2745_T_AS.g.vcf -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_sample_2753_T_AS.g.vcf --genomicsdb-workspace-path /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB -L chr1
Using GATK jar /util/common/bioinformatics/GATK/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Xmx8000M -jar /util/common/bioinformatics/GATK/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar GenomicsDBImport -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_MMRF_2745_T_AS.g.vcf -V /gpfs/scratch/jw24/variant_discovery/raw_vcf/TEST/TEST_MMRF_2753_T_AS.g.vcf --genomicsdb-workspace-path /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB -L chr1
11:47:00.370 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/util/common/bioinformatics/GATK/gatk-4.0.0.0/gatk-package-4.0.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
11:47:00.540 INFO GenomicsDBImport - ------------------------------------------------------------
11:47:00.541 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.0.0
11:47:00.541 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
11:47:00.541 INFO GenomicsDBImport - Executing as [email protected] on Linux v3.10.0-693.11.6.el7.x86_64 amd64
11:47:00.542 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_45-b14
11:47:00.542 INFO GenomicsDBImport - Start Date/Time: February 5, 2018 11:47:00 AM EST
11:47:00.542 INFO GenomicsDBImport - ------------------------------------------------------------
11:47:00.542 INFO GenomicsDBImport - ------------------------------------------------------------
11:47:00.543 INFO GenomicsDBImport - HTSJDK Version: 2.13.2
11:47:00.543 INFO GenomicsDBImport - Picard Version: 2.17.2
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 1
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
11:47:00.543 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
11:47:00.543 INFO GenomicsDBImport - Deflater: IntelDeflater
11:47:00.544 INFO GenomicsDBImport - Inflater: IntelInflater
11:47:00.544 INFO GenomicsDBImport - GCS max retries/reopens: 20
11:47:00.544 INFO GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
11:47:00.544 INFO GenomicsDBImport - Initializing engine
11:47:01.241 INFO IntervalArgumentCollection - Processing 248956422 bp from intervals
11:47:01.244 INFO GenomicsDBImport - Done initializing engine
Created workspace /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB
11:47:01.437 INFO GenomicsDBImport - Vid Map JSON file will be written to /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/vidmap.json
11:47:01.437 INFO GenomicsDBImport - Callset Map JSON file will be written to /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/callset.json
11:47:01.438 INFO GenomicsDBImport - Complete VCF Header will be written to /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/vcfheader.vcf
11:47:01.438 INFO GenomicsDBImport - Importing to array - /gpfs/scratch/jw24/variant_discovery/genomicsDB/chr1GenomicDB/genomicsdb_array
11:47:01.456 INFO ProgressMeter - Starting traversal
11:47:01.457 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
11:47:01.704 INFO GenomicsDBImport - Importing batch 1 with 2 samples
11:47:01.850 INFO GenomicsDBImport - Done importing batch 1/1
11:47:01.851 INFO ProgressMeter - chr1:1 0.0 1 152.3
11:47:01.852 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.0 minutes.
11:47:01.852 INFO GenomicsDBImport - Import completed!
11:47:01.872 INFO GenomicsDBImport - Shutting down engine
[February 5, 2018 11:47:01 AM EST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=2356150272
Tool returned:
true

Thanks

Jason

Best Answer

Answers

  • jianxinwangjianxinwang Member
    Accepted Answer

    Never mind. I figured out the cause of the problem.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jianxinwang
    Hi Jason,

    If you could post your solution, it may help others who run into the same issue :smiley:

    Thanks,
    Sheila

  • jianxinwangjianxinwang Member

    Right. So the problem is caused by operating on a whole genome g.vcf, rather than on a g.vcf consisting only a single contig/chr.

  • manolismanolis Member ✭✭
    edited March 13

    Hi,

    I have the same error. In my case during the two "bqsrrd" steps I use the entire chromosome (1-22, X, Y); during the HaplotypeCaller step I use the "chromosomes, excluding the gaps regions".

    Now, if I perform the GenomicsIDImport step with the last interval list (chromosomes minus the gaps regions), everything works. If I use my exome targeted regions the final vcf is empty.

    What I have to do? For me is very important to use my exome targeted inteval list...

    Mu exome interval list is composed by 234.000 regions/intervals.... I have to use this list in the HaplotypeCaller step as also in the GenomicsDBImport step?

    Thanks

    Post edited by manolis on
  • manolismanolis Member ✭✭
    edited March 13

    Using GATK jar /share/apps/bio/gatk-4.0.2.1/gatk-package-4.0.2.1-local.jar
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level
    15:23:03.246 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/share/apps/bio/gatk-4.0.2.1/gatk-package-4.0.2.1-local.jar!/com/intel/gkl/n
    15:23:03.444 INFO GenomicsDBImport - ------------------------------------------------------------
    15:23:03.444 INFO GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.0.2.1
    15:23:03.445 INFO GenomicsDBImport - For support and documentation go to https://software.broadinstitute.org/gatk/
    15:23:03.445 INFO GenomicsDBImport - Executing as [email protected] on Linux v3.5.0-36-generic amd64
    15:23:03.445 INFO GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_91-b14
    15:23:03.446 INFO GenomicsDBImport - Start Date/Time: March 13, 2018 3:23:03 PM CET
    15:23:03.446 INFO GenomicsDBImport - ------------------------------------------------------------
    15:23:03.446 INFO GenomicsDBImport - ------------------------------------------------------------
    15:23:03.447 INFO GenomicsDBImport - HTSJDK Version: 2.14.3
    15:23:03.447 INFO GenomicsDBImport - Picard Version: 2.17.2
    15:23:03.448 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 1
    15:23:03.448 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    15:23:03.448 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    15:23:03.448 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    15:23:03.448 INFO GenomicsDBImport - Deflater: IntelDeflater
    15:23:03.449 INFO GenomicsDBImport - Inflater: IntelInflater
    15:23:03.449 INFO GenomicsDBImport - GCS max retries/reopens: 20
    15:23:03.449 INFO GenomicsDBImport - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tr
    15:23:03.449 INFO GenomicsDBImport - Initializing engine
    15:23:04.721 INFO IntervalArgumentCollection - Processing 1121 bp from intervals
    15:23:04.726 INFO GenomicsDBImport - Done initializing engine
    Created workspace /home/manolis/GATK4/IlluminaExomePairEnd/5.gVCF/mergedGVCFdb/006000
    15:23:05.025 INFO GenomicsDBImport - Vid Map JSON file will be written to /home/manolis/GATK4/IlluminaExomePairEnd/5.gVCF/mergedGVCFdb/006000/vidmap.json
    15:23:05.026 INFO GenomicsDBImport - Callset Map JSON file will be written to /home/manolis/GATK4/IlluminaExomePairEnd/5.gVCF/mergedGVCFdb/006000/callset.json
    15:23:05.026 INFO GenomicsDBImport - Complete VCF Header will be written to /home/manolis/GATK4/IlluminaExomePairEnd/5.gVCF/mergedGVCFdb/006000/vcfheader.vcf
    15:23:05.026 INFO GenomicsDBImport - Importing to array - /home/manolis/GATK4/IlluminaExomePairEnd/5.gVCF/mergedGVCFdb/006000/genomicsdb_array
    15:23:05.046 INFO ProgressMeter - Starting traversal
    15:23:05.047 INFO ProgressMeter - Current Locus Elapsed Minutes Batches Processed Batches/Minute
    15:23:05.047 INFO GenomicsDBImport - Starting batch input file preload
    15:23:05.680 INFO GenomicsDBImport - Finished batch preload
    15:23:05.680 INFO GenomicsDBImport - Importing batch 1 with 3 samples
    15:23:06.619 INFO GenomicsDBImport - Done importing batch 1/1
    15:23:06.621 INFO ProgressMeter - chr1:38081502 0.0 1 38.2
    15:23:06.621 INFO ProgressMeter - Traversal complete. Processed 1 total batches in 0.0 minutes.
    15:23:06.621 INFO GenomicsDBImport - Import of all batches to GenomicsDB completed!
    15:23:06.712 INFO GenomicsDBImport - Shutting down engine
    [March 13, 2018 3:23:06 PM CET] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.06 minutes.
    Runtime.totalMemory()=2397569024
    Tool returned:
    true

    An example of the code is:

    /share/apps/bio/gatk/gatk --java-options -Xmx4000m GenomicsDBImport --genomicsdb-workspace-path 006000 --batch-size 50 -L "chr1:38082002-38082122" --sample-name-map gVCF.list --reader-threads 5 -ip 500

  • manolismanolis Member ✭✭

    @jianxinwang said:
    Right. So the problem is caused by operating on a whole genome g.vcf, rather than on a g.vcf consisting only a single contig/chr.

    Sorry @jianI but Idon't understand what you explain a little better... I'm new in GATK. Thank you

  • jianxinwangjianxinwang Member

    Hi Manolis,

    I do not know exactly what is the cause for your problem because I used a different command line. However, I would use more RAM to do a test. 4Gb is too small based on my experience. You may also want to experiment using the input g.vcf files directly on your command rather than putting them in a file. Just suggestions.

Sign In or Register to comment.