Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

MarkDuplicates - 0 pairs never matched

jonahcullenjonahcullen Member
edited July 29 in Ask the GATK team
Hello, I am trying to use MarkDuplicates in order to combine uBAMs generated from paired fastq files across two lanes (WGS on Illumina NovaSeq) using the GATK paired-fastq-to-unmapped-bam.wdl. I believe I have parsed the correct RG information in generating the uBAMs (ValidateSamFile passed) but I am a little confused as to the MarkDuplicates output. I have three, likely related questions.

1) What is the meaning of "Tracking 0 as yet unmatched pairs" and "0 pairs never matched"?
2) There was also a WARNING - what is meant by "Reading will be unbuffered"?

```
INFO 2019-07-28 10:15:16 MarkDuplicates Read 506,000,000 records. Elapsed time: 07:30:29s. Time for last 1,000,000: 43s. Last read position: chr1:102,335,300
INFO 2019-07-28 10:15:16 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2019-07-28 10:15:47 MarkDuplicates Read 506691598 records. 0 pairs never matched.
INFO 2019-07-28 10:16:24 MarkDuplicates After buildSortedReadEndLists freeMemory: 16451310808; totalMemory: 16540762112; maxMemory: 16540762112
INFO 2019-07-28 10:16:24 MarkDuplicates Will retain up to 516898816 duplicate indices before spilling to disk.
INFO 2019-07-28 10:16:25 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2019-07-28 10:16:25 SortingCollection Creating merging iterator from 5 files
WARNING 2019-07-28 10:16:27 SortingCollection There is not enough memory per file for buffering. Reading will be unbuffered.
```

3) Finally, the workflow ends by transitioning to a terminal state (return code not 0), yet it appears MarkDuplicates is "done".

```
INFO 2019-07-28 12:27:32 MarkDuplicates Before output close freeMemory: 8443873968; totalMemory: 13359382528; maxMemory: 14944829440
INFO 2019-07-28 12:27:33 MarkDuplicates After output close freeMemory: 13288318136; totalMemory: 13332643840; maxMemory: 14944829440
[Sun Jul 28 12:27:33 CDT 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 582.83 minutes.
Runtime.totalMemory()=13332643840

[2019-07-28 12:32:26,21] [info] WorkflowManagerActor WorkflowActor-69a61389-a401-4e87-af57-66a767c7b994 is in a terminal state: WorkflowFailedState
[2019-07-28 12:32:37,63] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
[2019-07-28 12:32:40,80] [info] Workflow polling stopped
[2019-07-28 12:32:40,84] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
[2019-07-28 12:32:40,84] [info] 0 workflows released by cromid-95ee9e9
[2019-07-28 12:32:40,85] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
[2019-07-28 12:32:40,88] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
[2019-07-28 12:32:40,90] [info] JobExecutionTokenDispenser stopped
[2019-07-28 12:32:40,90] [info] Aborting all running workflows.
[2019-07-28 12:32:40,99] [info] WorkflowStoreActor stopped
[2019-07-28 12:32:41,00] [info] WorkflowLogCopyRouter stopped
[2019-07-28 12:32:41,00] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor All workflows finished
[2019-07-28 12:32:41,00] [info] WorkflowManagerActor stopped
[2019-07-28 12:32:42,32] [info] Connection pools shut down
[2019-07-28 12:32:42,32] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] SubWorkflowStoreActor stopped
[2019-07-28 12:32:42,32] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] Shutting down IoProxy - Timeout = 1800 seconds
[2019-07-28 12:32:42,32] [info] KvWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] WriteMetadataActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
[2019-07-28 12:32:42,32] [info] CallCacheWriteActor stopped
[2019-07-28 12:32:42,32] [info] IoProxy stopped
[2019-07-28 12:32:42,33] [info] JobStoreActor stopped
[2019-07-28 12:32:42,33] [info] ServiceRegistryActor stopped
[2019-07-28 12:32:42,33] [info] DockerHashActor stopped
[2019-07-28 12:32:42,35] [info] Database closed
[2019-07-28 12:32:42,35] [info] Stream materializer shut down
[2019-07-28 12:32:42,35] [info] WDL HTTP import resolver closed
```
------------------------------------------------
I am using the gatk4-data-processing workflow with cromwell 40. I was unable to post the entirety of the workflow output and the code blocks do not seem to be working for me - I am not sure why as I have never posted before!

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen , sorry for the delay. I will get back to you soon with more information.
    If anyone else can chime in, please do!

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen ,

    What is the meaning of "Tracking 0 as yet unmatched pairs" and "0 pairs never matched"?

    The tool scans along looking for pairs, and sometimes it finds one but it's mate isn't nearby so it builds up a collection of unmated ones that need to be matched.

    I see WARNING 2019-07-28 10:16:27 SortingCollection There is not enough memory per file for buffering. Reading will be unbuffered.
    Can you increase your memory in the WDL for this task to see if that makes this go away?

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, thank you for your response. I reran this workflow with 250Gb of RAM and got the same WARNING. I am also including the error from the top of the log file hs_err_pid9437.log as perhaps this could give more clues as the issue?

    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGBUS (0x7) at pc=0x00002b369802bc48, pid=9437, tid=0x00002b3542bcd200
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    #
    [error occurred during error reporting (printing problematic frame), id 0x7]
    
    # Core dump written. Default location: /scratch.global/friedlab_PREPARE_TEST/mixb/D06403/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/7d60aa8c-c853-4cbb-b320-029db4a1c502/call-MarkDuplicates/exe
    cution/core or core.9437
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    #
    

    Interestingly, I was able to successfully run the entire workflow using only a random subset of reads from the original fastq files. Seems like the issue is Java or memory related but I cannot imagine how it would require more than 250 Gb?

  • jonahcullenjonahcullen Member
    edited August 6

    Sorry I forgot to include the WARNING portion:

    ...
    INFO    2019-08-06 11:26:11     MarkDuplicates  Read   506,000,000 records.  Elapsed time: 08:44:15s.  Time for last 1,000,000:  401s.  Last read position: chr1:102,335,300                                   
    INFO    2019-08-06 11:26:11     MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
    INFO    2019-08-06 11:30:43     MarkDuplicates  Read 506691598 records. 0 pairs never matched.
    INFO    2019-08-06 11:32:16     MarkDuplicates  After buildSortedReadEndLists freeMemory: 28694895504; totalMemory: 28858384384; maxMemory: 28858384384                                                        
    INFO    2019-08-06 11:32:16     MarkDuplicates  Will retain up to 901824512 duplicate indices before spilling to disk.                                                                                         
    INFO    2019-08-06 11:32:19     MarkDuplicates  Traversing read pair information and detecting duplicates.                                                                                                     
    INFO    2019-08-06 11:32:19     SortingCollection       Creating merging iterator from 3 files
    WARNING 2019-08-06 11:32:21     SortingCollection       There is not enough memory per file for buffering. Reading will be unbuffered.                                                                         
    INFO    2019-08-06 11:34:33     MarkDuplicates  Traversing fragment information and detecting duplicates.                                                                                                      
    INFO    2019-08-06 11:34:33     SortingCollection       Creating merging iterator from 5 files
    WARNING 2019-08-06 11:34:35     SortingCollection       There is not enough memory per file for buffering. Reading will be unbuffered.                                                                         
    INFO    2019-08-06 11:38:06     MarkDuplicates  Sorting list of duplicate records.
    INFO    2019-08-06 11:38:10     MarkDuplicates  After generateDuplicateIndexes freeMemory: 19570729224; totalMemory: 26959937536; maxMemory: 28631367680                                                       
    INFO    2019-08-06 11:38:10     MarkDuplicates  Marking 14513620 records as duplicates.
    INFO    2019-08-06 11:38:10     MarkDuplicates  Found 10216414 optical duplicate clusters.
    INFO    2019-08-06 11:38:13     MarkDuplicates  Reads are assumed to be ordered by: queryname
    INFO    2019-08-06 11:38:13     MarkDuplicates  Output will not be re-sorted. Output header will state SO:unknown GO:query   
    INFO    2019-08-06 11:40:30     MarkDuplicates  Written    10,000,000 records.  Elapsed time: 00:02:17s.  Time for last 10,000,000:  137s.  Last read position: chr30:2,795,313
    INFO    2019-08-06 11:42:48     MarkDuplicates  Written    20,000,000 records.  Elapsed time: 00:04:34s.  Time for last 10,000,000:  137s.  Last read position: chr23:957,673  
    
    ...
    
    INFO    2019-08-06 13:31:43     MarkDuplicates  Written   500,000,000 records.  Elapsed time: 01:53:30s.  Time for last 10,000,000:  138s.  Last read position: chr37:18,829,626                               
    INFO    2019-08-06 13:33:12     MarkDuplicates  Before output close freeMemory: 12424962064; totalMemory: 20611858432; maxMemory: 28631367680                                                                  
    INFO    2019-08-06 13:33:13     MarkDuplicates  After output close freeMemory: 20764185200; totalMemory: 20811087872; maxMemory: 28631367680                                                                   
    [Tue Aug 06 13:33:13 CDT 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 651.33 minutes.                                                                                                    
    Runtime.totalMemory()=20811087872
    
    [2019-08-06 13:35:55,78] [info] WorkflowManagerActor WorkflowActor-7d60aa8c-c853-4cbb-b320-029db4a1c502 is in a terminal state: WorkflowFailedState                                                            
    [2019-08-06 13:36:27,17] [info] SingleWorkflowRunnerActor workflow finished with status 'Failed'.
    [2019-08-06 13:36:30,23] [info] Workflow polling stopped
    [2019-08-06 13:36:30,28] [info] Shutting down WorkflowStoreActor - Timeout = 5 seconds
    [2019-08-06 13:36:30,28] [info] 0 workflows released by cromid-8c6b3ed
    [2019-08-06 13:36:30,29] [info] Shutting down WorkflowLogCopyRouter - Timeout = 5 seconds
    [2019-08-06 13:36:30,32] [info] Shutting down JobExecutionTokenDispenser - Timeout = 5 seconds
    [2019-08-06 13:36:30,32] [info] Aborting all running workflows.
    [2019-08-06 13:36:30,36] [info] JobExecutionTokenDispenser stopped
    [2019-08-06 13:36:30,43] [info] WorkflowStoreActor stopped
    [2019-08-06 13:36:30,43] [info] WorkflowLogCopyRouter stopped
    [2019-08-06 13:36:30,44] [info] Shutting down WorkflowManagerActor - Timeout = 3600 seconds
    [2019-08-06 13:36:30,44] [info] WorkflowManagerActor All workflows finished
    [2019-08-06 13:36:30,44] [info] WorkflowManagerActor stopped
    [2019-08-06 13:36:31,34] [info] Connection pools shut down
    [2019-08-06 13:36:31,34] [info] Shutting down SubWorkflowStoreActor - Timeout = 1800 seconds
    [2019-08-06 13:36:31,34] [info] Shutting down JobStoreActor - Timeout = 1800 seconds
    [2019-08-06 13:36:31,34] [info] Shutting down CallCacheWriteActor - Timeout = 1800 seconds
    [2019-08-06 13:36:31,34] [info] Shutting down ServiceRegistryActor - Timeout = 1800 seconds
    [2019-08-06 13:36:31,34] [info] Shutting down DockerHashActor - Timeout = 1800 seconds
    [2019-08-06 13:36:31,34] [info] Shutting down IoProxy - Timeout = 1800 seconds
    [2019-08-06 13:36:31,34] [info] SubWorkflowStoreActor stopped
    [2019-08-06 13:36:31,35] [info] KvWriteActor Shutting down: 0 queued messages to process
    [2019-08-06 13:36:31,35] [info] WriteMetadataActor Shutting down: 0 queued messages to process
    [2019-08-06 13:36:31,35] [info] CallCacheWriteActor Shutting down: 0 queued messages to process
    [2019-08-06 13:36:31,35] [info] CallCacheWriteActor stopped
    [2019-08-06 13:36:31,35] [info] JobStoreActor stopped
    [2019-08-06 13:36:31,35] [info] IoProxy stopped
    [2019-08-06 13:36:31,35] [info] DockerHashActor stopped
    [2019-08-06 13:36:31,35] [info] ServiceRegistryActor stopped
    [2019-08-06 13:36:31,37] [info] Database closed
    [2019-08-06 13:36:31,37] [info] Stream materializer shut down
    [2019-08-06 13:36:31,37] [info] WDL HTTP import resolver closed                                                                                
    
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Are you able to attach the log file and WDL script you are using (or point me to the right version of the WDL from wherever you got it online)? If you are running this on Terra or FireCloud, let me know because you could just share the workspace.

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, I am attaching the log file and WDL script which is a very lightly modified (e.g. removed docker runtime params) version of processing-for-variant-discovery-gatk4.wdl from https://github.com/gatk-workflows/. I had to change the file extension to .wdl.txt as the file format was not allowed for upload. Currently I am only testing this workflow on a single node on a HPC cluster.

  • jonahcullenjonahcullen Member
    edited August 8

    Hi again @Tiffany_at_Broad, I decided to check if there was something wrong with the MarkDuplicates of the BAMs using GATK directly (i.e. outside of cromwell execution) via a job submission. I used the BAM files generated during the previous attempt that transitioned to a failed state immediately after MarkDuplicates (above).

    /home/fried255/cull0084/projects/gatk4_pipeline/gatk-4.1.1.0/gatk --java-options "-Dsamjdk.compression_level=5 -Xms4000m" \
        MarkDuplicates \
        --INPUT ${b1} --INPUT ${b2} \
        --OUTPUT D06403.canfam3.aligned.unsorted.duplicates_marked.bam \
        --METRICS_FILE D06403.canfam3.duplicate_metrics \
        --VALIDATION_STRINGENCY SILENT \
        --OPTICAL_DUPLICATE_PIXEL_DISTANCE 2500 \
        --ASSUME_SORT_ORDER "queryname" \
        --CREATE_MD5_FILE true
    

    This worked fine with return code 0 and a valid BAM. Additionally, the D06403.canfam3.duplicate_metrics was identical between the failed workflow run and the MarkDuplicates step only. Does this suggest there is something failing with cromwell attempting to start the next task SortAndFixTasks? I've attached the stdout.

    Post edited by jonahcullen on
  • jonahcullenjonahcullen Member

    Apologies for the continued posts!

    I just noticed something in the stdout where it appears MarkDuplicates is actually completed successfully ("Tool returned: 0") prior to a java error. I'm not sure how to interpret or address the error.

    Tool returned:
    0
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGBUS (0x7) at pc=0x00002b9cb402bc48, pid=1649, tid=0x00002b9b5eae4200
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # 
    [error occurred during error reporting (printing problematic frame), id 0x7]
    
    # Core dump written. Default location: /scratch.global/friedlab_PREPARE_TEST/mixb/D06403/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/537da21c-9e11-4255-9140-49b01fa88985/call-MarkDuplicates/execution/core or core.1649
    #
    # An error report file with more information is saved as:
    # /scratch.global/friedlab_PREPARE_TEST/mixb/D06403/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/537da21c-9e11-4255-9140-49b01fa88985/call-MarkDuplicates/execution/hs_err_pid1649.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen can you share the log with the error in it? It would be good to see what is happening before the error.

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, I have attached the full log file (processing-for-variant-discovery-gatk4.stdout.txt). Line 202 is where the error appears to begin.

    I am also attaching the stderr (stderr.txt), stdout from MarkDuplicates (MarkDuplicates.stdout.txt), and the java error report (hs_err_pid1649.log) referenced in the stdout.

    Thank you again for your help! Please let me know if there is anything I can provide or do to help figure this out!

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    I am sorry this is taking so long for us to get back to you.
    Something I noticed in the log was Job PreProcessingForVariantDiscovery_GATK4.MarkDuplicates:NA:1 exited with return code **250** which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details
    You could use continueOnReturnCode in the WDL. I am trying to find out more about what the 250 means. Stay tuned.

    Also here is more info on MarkDuplicates memory requirements
    Q: MarkDuplicates takes a long time. Is something wrong?
    A: Not necessarily. MarkDuplicates is very memory-intensive. This is required in order to detect interchromosomal duplication. At Broad, we run MarkDuplicates with 2GB Java heap (java -Xmx2g) and 10GB hard memory limit. Some example run times:

    An 8.6GB file (63M records) took about 1 hours
    A 20GB file (133M records) took about 2.5 hours
    Increasing the amount of RAM allocated to MarkDuplicates (the -Xmx value) will improve the speed somewhat. Note that if you have a hard memory limit, this needs to be increased also. One workaround for large datasets involves marking duplicates for different libraries independently and subsequently merging the marked files.

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, no problem at all!

    I had considered using the continueOnReturnCode: 250 runtime option, I was just a little hesitant as I could not find any information on what 250 is indicating...However, I just started it anyways so I'll let you know how that shakes out.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Curious to see your results :)

  • jonahcullenjonahcullen Member

    Unfortunately it did not work...the next step in the workflow also had a return code of 250.

  • jonahcullenjonahcullen Member

    I'm rerunning the workflow with continueOnReturnCode: [0, 250] for all tasks...we'll see what that does!

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Are the results for MarkDuplicates the same as when you ran it locally? Are there any runtime limits you are coming up against imposed by the HPC cluster? Can you share the configuration file you are using? This doc is about specifying runtime attributes for your hpc tasks (in case you haven't seen it). I'm hoping we can help you figure this out soon!

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, I did use that doc (and others) to put together the configuration file. I attempted to run the workflow again and specifically ran MarkDuplicates using the backend I defined in the configuration file. This resulted in the same error where the actual tool returned 0 but a return code of 250 was reported to the workflow, thus ending the workflow. Has there been any luck in determining what 250 is indicating? This issue only occurs when running MarkDuplicates as part of a workflow and not when ran as a stand alone tool...

    Tool returned:
    0
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGBUS (0x7) at pc=0x00002b8d10e39c48, pid=11958, tid=0x00002b8c151cd200
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # 
    [error occurred during error reporting (printing problematic frame), id 0x7]
    
    # Core dump written. Default location: /scratch.global/friedlab_PREPARE_TEST/mixb/D06403/BACKEND_FULL/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/ce6e2b14-513e-4367-8c62-a16349b3582c/call-MarkDuplicates/execution/core or core.11958
    #
    # An error report file with more information is saved as:
    # /scratch.global/friedlab_PREPARE_TEST/mixb/D06403/BACKEND_FULL/cromwell-executions/PreProcessingForVariantDiscovery_GATK4/ce6e2b14-513e-4367-8c62-a16349b3582c/call-MarkDuplicates/execution/hs_err_pid11958.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen
    This Java error, means: Java VM dies with SIGBUS (0x7) when temp directory is full on linux.
    So Java is crashing and that is why it is returning the 250.
    1.) You can try pointing to a bigger temp directory somewhere on a disk with ample space by adding this argument to MarkDuplicates:

    2.) Try reducing your heap size (use -Xmx in addition to -Xms) to increase the memory assigned to your node.
    Let me know if one of these options helps?

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, thank you very much for that information - that makes complete sense! I started a workflow with the --TMP_DIR argument and will report back when it finishes.

    Could you provide an input on how I should allocate the heap size? For example, when using MarkDuplicates to process two bam files (each ~30gb), it took ~11 hours with -Xms4000m and -Xmx32g on a node with 250gb of memory. This attempt was prior to using --TMP_DIR. I don't mind the amount of time it takes per se, but I worry I'm going to have memory issues as I attempt to process more than two bam files.

    Thanks again for all your help - I really appreciate it!

  • jonahcullenjonahcullen Member

    The workflow that included --TMP_DIR resulted in the same return code of 250. However, there were a couple differences now in the stderr and stdout (both attached). The stdout no longer contained # A fatal error has been detected by the Java Runtime Environment: block of text, just Tool returned: 0. Additionally, the stderr now contains a different error I have not seen for MarkDuplicates before:

    pure virtual method called
    terminate called without an active exception
    
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen - I will follow up with the GATK developers on this next week.
    FYI, I looked at our preprocessing pipeline run in Terra which uses individual GCP virtual machines per task (so MarkDuplicates has its own) and we ran a 64 GB bam through the pipeline and requested 7GB of memory and heap using Xms to 4GB, and that ran just fine. In general, you can set the XMX equal to XMS or a bit larger than the XMS, but not too close to the memory you requested overall.

  • jonahcullenjonahcullen Member

    Thank you! I will try -Xms4g and -Xmx16g and see how that runs.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen You didn't attach the logs in your last message. What is the latest?

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, sorry about that.

    I'm attaching the stderr (stderr.TMP_DIR.txt) from running MarkDuplicates as part of the processing workflow with -Xms4000m and -Xmx32g. This is the run I mentioned where the tool returned 0, the workflow return code was 250 (as per usual it seems!), but the stdout (stdout.TMP_DIR.txt) did not include the # A fatal error has been detected by the Java Runtime Environment:. This was also the one where the stderr contained

    pure virtual method called
    terminate called without an active exception
    

    I'm also attaching the stderr (stderr.MARK_DUP.txt) and stdout (stdout.MARK_DUP.txt) of running MarkDuplicates as a stand alone with the same bam files used as part of the processing above. This was run with --TMP_DIR in a location with TBs or disk space, on a 62GB node, with -Xms4g and -Xmx10g - I meant to do -Xmx16g but must have mistyped. This run ALSO resulted in the # A fatal error has been detected by the Java Runtime Environment: had a return code of 250 but the tool returned 0.

    Thank you again so much for your continued help with this!

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Are you running inside a Docker on the node?

  • jonahcullenjonahcullen Member

    Not as of yet, just "locally" (or by spawning jobs as per the backend configuration). Eventually I will get it setup to run using singularity but have not attempted that aspect of yet.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    hmm we are a little bit stumped. I am following up with a few folks to see if we can get any more ideas.

  • jonahcullenjonahcullen Member

    No problem @Tiffany_at_Broad, totally understand! I have some additional information...

    I thought I'd try to run the HaplotypeCaller workflow using the bam generated from the workflow we've been discussing with the MarkDuplicates/JRE issue. I did this by submitting each of ten intervals to the job scheduler (PBS) of my HPC. One of the ten failed with a familiar looking error in the stderr and stdout (both attached)...

    stderr

    pure virtual method called
    terminate called without an active exception
    

    stdout

    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGBUS (0x7) at pc=0x00002ad7c62c8ce2, pid=24785, tid=0x00002ad890af0700
    #
    # JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [ld-linux-x86-64.so.2+0x19ce2]
    
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen
    Our devs think you may be hitting a weird bug (some sort of C++ bug). Can you try using a newer version of Java? If that doesn't help, can you try disabling the intel inflator/deflator by using these arguments while running MarkDuplicates:
    --USE_JDK_DEFLATER
    --USE_JDK_INFLATER

  • Hi @Tiffany_at_Broad, thank you very much for that information - I will absolutely try that. I'm not sure if the following will help support that theory but if I run a bam past MarkDuplicates that appears okay but had the return code of 250, I get the following stdout from BaseRecalibrator but all the return codes are 0:

    Tool returned:
    28225319
    Tool returned:
    20861088
    Tool returned:
    27746126
    Tool returned:
    13981035
    Tool returned:
    16130505
    Tool returned:
    14229474
    Tool returned:
    14525442
    Tool returned:
    20417693
    Tool returned:
    29001339
    Tool returned:
    16685929
    Tool returned:
    19602263
    Tool returned:
    25380428
    Tool returned:
    17039842
    Tool returned:
    26102788
    Tool returned:
    19701276
    Tool returned:
    26988198
    Tool returned:
    21619150
    Tool returned:
    18039419
    Tool returned:
    19194606
    Tool returned:
    23042662
    Tool returned:
    18847334
    Tool returned:
    17235495
    Tool returned:
    12298821
    Tool returned:
    19725986
    Tool returned:
    23049870
    Tool returned:
    17743470
    Tool returned:
    28352083
    
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen the return value of the tool is the number of reads processed. If this is the output from multiple runs on different parts of the same file then it is expected? If it is output from the same parts on the same file then that wouldn't be so good bc it wouldn't be giving different results multiple times.

  • Hi @Tiffany_at_Broad, I tried using a newer version java as well as the --USE_JDK_DEFLATER and --USE_JDK_INFLATER arguments but unfortunately got the same 250 return code. It seems strange that is just this step that is getting messed up. Also even with the error, it seems to take a really long time, multiple days in some cases.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen ,
    Can you recreate the issue and provide the hs log file (file path beneath the core dump one), exact commands you're running (or just confirm it is the same WDL if already attached), along with any other error logs? Are you open to trying this out on Terra?

  • Hi @Tiffany_at_Broad, I want to express my sincere gratitude for all the help you've provided attempting to solve this little issue.

    I am attaching the following:
    1. the complete (ie fastqs to gvcf) wdl file
    2. json
    3. cromwell.conf
    4. the stderr and stdout from MarkDuplicates (and stdout.background and stderr.background from running via job submission to an HPC)
    5. duplicate metrics

    I started the workflow with

    java -Dconfig.file=/scratch.global/friedlab_FrankFlow/gldr/D06271/cromwell.conf \
     -jar /home/fried255/cull0084/projects/gatk4_pipeline/cromwell-40.jar \
     run ./fastq-to-gvcf.wdl \
     -i ./fastq-to-gvcf.gldr.D06271.json
    

    I am using gatk-4.1.1.0.

    This particular iteration ran from fastqs to gvcf albeit with a return code of 250 from the MarkDuplicates step and without producing an hs log file (I could a different example with that output if needed). There was however a 16GB core.25133 dump.

    The java version was

    java version "1.8.0_101"
    Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
    Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
    

    Although not for this particular sample, I have attempted to run this complete workflow using

    openjdk version "11.0.2" 2019-01-15
    OpenJDK Runtime Environment 18.9 (build 11.0.2+9)
    OpenJDK 64-Bit Server VM 18.9 (build 11.0.2+9, mixed mode)
    

    as well as with the --USE_JDK_DEFLATER and --USE_JDK_INFLATER options, both attempts resulted in the 250 return code and core dump.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @jonahcullen - I hope to dig into this more tomorrow. Have you considered using MarkDuplicatesSpark over MarkDuplicates ?

  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, I have not tried MarkDuplicatesSpark but I am willing to try!

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
    edited October 25

    Hi @jonahcullen , sorry for not getting back to you earlier. I hope you were able to progress either troubleshooting this or running with MarkDuplicatesSpark. I checked out your Cromwell config file and all looked good. You could check the RC.txt file in the directory to see how Cromwell worked out what the returncode for the task was. We still think MarkDuplicates is having some sort of c++ issue on your HPC with however the node itself is configured. You could try using Docker with this pipeline or replacing the tool with MarkDuplicatesSpark. Here's a useful doc. I hope this helps!

    Post edited by Tiffany_at_Broad on
  • jonahcullenjonahcullen Member

    Hi @Tiffany_at_Broad, thank you for checking in! It's the weirdest thing but for the past few weeks, MarkDuplicates appears to be working - at least based on seeing 0 return codes. I started running samples in a new directory in scratch. It is still taking an excessive amount of time - anywhere from 2 days up to 10 days for some samples...I may try MarkDuplicatesSpark and see if that is a viable option for us.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Awesome to hear! Good luck!

Sign In or Register to comment.