Attention: Want an end-to-end pipelining solution for GATK Best Practices?


Check out Terra here! For more details on whether this is the right fit for you checkout our blogs here.

Silent failure in CollectAllelicCounts

mowreywmowreyw Member
edited May 2018 in Ask the GATK team

I'm attempting to do CNV analysis in the latest GATK docker container (4.0.4.0), but have run into a problem at the CollectAllelicCounts stage. The function appears to fail silently after initializing engine. stdout is below:

Using GATK jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx8g -jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar CollectAllelicCounts -L /refgene/1000G_phase1_omni_hapmap_ENTIRE_dbsnp_146_PADDED_ONLY.interval_list -I /gatk4_test/aligned/CU-03135_S44_L001_001_180426_M05411_0064_000000000-BTHDG.bam -R /refgene/Homo_sapiens_assembly38.fasta -O /gatk4_test/variants/CU-03135_S44_L001_001_180426_M05411_0064_000000000-BTHDG.alleliccounts.tsv
13:46:15.196 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.0.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
13:46:15.524 INFO  CollectAllelicCounts - ------------------------------------------------------------
13:46:15.525 INFO  CollectAllelicCounts - The Genome Analysis Toolkit (GATK) v4.0.4.0
13:46:15.526 INFO  CollectAllelicCounts - For support and documentation go to https://software.broadinstitute.org/gatk/
13:46:15.527 INFO  CollectAllelicCounts - Executing as [email protected] on Linux v4.9.87-linuxkit-aufs amd64
13:46:15.527 INFO  CollectAllelicCounts - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
13:46:15.528 INFO  CollectAllelicCounts - Start Date/Time: May 25, 2018 1:46:15 PM UTC
13:46:15.529 INFO  CollectAllelicCounts - ------------------------------------------------------------
13:46:15.530 INFO  CollectAllelicCounts - ------------------------------------------------------------
13:46:15.531 INFO  CollectAllelicCounts - HTSJDK Version: 2.14.3
13:46:15.531 INFO  CollectAllelicCounts - Picard Version: 2.18.2
13:46:15.532 INFO  CollectAllelicCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:46:15.532 INFO  CollectAllelicCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:46:15.532 INFO  CollectAllelicCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:46:15.533 INFO  CollectAllelicCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:46:15.534 INFO  CollectAllelicCounts - Deflater: IntelDeflater
13:46:15.534 INFO  CollectAllelicCounts - Inflater: IntelInflater
13:46:15.535 INFO  CollectAllelicCounts - GCS max retries/reopens: 20
13:46:15.535 INFO  CollectAllelicCounts - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
13:46:15.535 WARN  CollectAllelicCounts - 

[1m[31m   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

   Warning: CollectAllelicCounts is a BETA tool and is not yet ready for use in production

   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!![0m


13:46:15.536 INFO  CollectAllelicCounts - Initializing engine
Using GATK jar /gatk/build/libs/gatk-package-4.0.4.0-local.jar

The last line is where it starts analyzing the next record, before producing any output. The filenames specified by the -O in my command are never created.

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @mowreyw,

    Can you post the error stacktrace please? These should be the last section of the standard output printed to your terminal screen.

  • mowreywmowreyw Member

    I'm attaching another example with DEBUG verbosity (includes stdout and stderr). I don't see any error messages. The function simply stops running without producing any output files.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @mowreyw,

    Above, your output is

    -O /gatk4_test/variants/CU-03135_S44_L001_001_180426_M05411_0064_000000000-BTHDG.alleliccounts.tsv
    

    In your stdout you attach, at the very end we see a command printed with

    -O /gatk4_test/variants_test/GP10001-01-A_S1_L001_001_180426_M05411_0064_000000000-BTHDG.alleliccounts.tsv
    

    These files are different and so it appears another command is being launched. Are you by chance running these commands via a bash script? If so did you set set -e for the script to break if a command should fail?

  • mowreywmowreyw Member
    edited May 2018

    @shlee,

    Thanks for your fast reply. Sorry, I should have clarified that in the two examples CollectAllelicCounts were run against different input files. I am running this via a bash script, which is essentially the same as what I've used for all prior steps of CNV analysis (with a few minor modifications). The result of running the script seems to be the same regardless of whether set -e is included. I'm attaching the output again with bash debug information (set -xv) so you can see the gatk command that is executed. It seems that all paths exist and are correct.

    Perhaps something is wrong with one of the input files then? This would be strange because these files are all used in the earlier stages of the CNV analysis and seem to work fine for generating normalized coverage. It's odd because the output suggests that CollectAllelicCounts is actually carryout out the analysis, but no output file is generated.

    Thanks,
    Bill

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited May 2018

    Hi @mowreyw,

    Do you get the same hanging behavior when you run the command directly, outside of the bash script? I've run this tool without any issue, as I describe in Section 5 of Tutorial#11683.

    I'm just noticing, for your common germline variant sites, you are using an interval_list that may be padded--I see the label PADDED_ONLY in the file name:

    -L /refgene/1000G_phase1_omni_hapmap_ENTIRE_dbsnp_146_PADDED_ONLY.interval_list
    

    CollectAllelicCounts expects SNPs-only sites. Is this the case for your intervals?

    P.S. There is no need for you to run your own bash scripts. Are you aware that these workflows are already pipelined in WDL? Please check out these two sources of pipelined scripts:

  • mowreywmowreyw Member

    @shlee,

    I tried running the command directly and saw the same results. It runs for a minute or two and then finishes without writing the output file.

    The interval list is a Picard style interval list. I confirmed that all intervals are only SNPs (ie, interval start and end are the same position). A snippet of the interval list is below:

    @SQ SN:HLA-DRB1*15:03:01:01 LN:11567    M5:e6a26b767a62b282e0db211bbc9a7363 AS:38   UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens
    @SQ SN:HLA-DRB1*15:03:01:02 LN:11569    M5:4e0d459b9bd15bff8645de84334e3d25 AS:38   UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens
    @SQ SN:HLA-DRB1*16:02:01    LN:11005    M5:4a972df76bd3ee2857b87bd5be5ea00a AS:38   UR:/seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta SP:Homo sapiens
    chr1    51479   51479   +   rs116400033
    chr1    55299   55299   +   rs10399749
    chr1    55367   55367   +   .
    chr1    55388   55388   +   .
    chr1    55394   55394   +   rs2949420
    chr1    55550   55550   +   rs2949421
    chr1    55852   55852   +   .
    chr1    56981   56981   +   rs2691310
    chr1    61462   61462   +   rs56992750
    chr1    62157   62157   +   rs10399597
    chr1    82571   82571   +   rs4030303
    chr1    82609   82609   +   .
    chr1    82652   82652   +   rs4030300
    
    

    I will check out the WDL scripts. I was just running some quick bash scripts to see how the CNV tools differ in new GATK release compared to the previous version I was using (4.beta.5). I hit this snag somewhat unexpectedly.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for checking these things @mowreyw. At this point, I would say this is a bug in the tool and we need to file a bug report. Would it be possible for you to share enough of your data to recapitulate this hanging behavior? Instructions for bug report submission are at https://software.broadinstitute.org/gatk/guide/article?id=1894. Thanks.

  • mowreywmowreyw Member

    It seems that the size of the interval list is the issue. I tried chopping down the different files, and it turns out the function runs fine if the interval list is made smaller (decreasing BAM size makes no difference).

    The interval list I was trying to use was quite large (approaching 1 GB), so it is likely a memory issue. I tried running with bam index caching disabled, but this didn't help. The issue seems to be related to my laptop specifically, as I can run the same function call in an identical Docker container successfully on an EC2 instance. I'm guessing that my laptop's 16 GB of memory is not enough to support analysis with such a large set of intervals.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @mowreyw
    Hi,

    Thanks for letting us know. Glad it is not a bug :smile:

    -Sheila

Sign In or Register to comment.