We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GATK 4.1.4 DenoiseReadCounts: Sample intervals must be identical to the original intervals ...

ChipChip 415M 4053Member, Broadie

Hi

I've been getting failures in Terra from the most recent 2-CNV_Somatic_Pair workflow copied from help-gatk/Somatic-CNVs-GATK4 that has an error message:

18:58:30.763 INFO SVDDenoisingUtils - Validating sample intervals against original intervals used to build panel of normals...
18:58:31.210 INFO DenoiseReadCounts - Shutting down engine [December 9, 2019 6:58:31 PM UTC]
org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1198522368
java.lang.IllegalArgumentException: Sample intervals must be identical to the original intervals used to build the panel of normals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725) at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.denoise(SVDDenoisingUtils.java:119)

Historical forum posts regarding this error message were tracked back to differences in interval lists, but my hunt for discrepant interval lists has not yet revealed a clue. I must be missing something.

This task runs on WGS data so I modified the interval list to correspond to 1k bins across the genome with blacklist intervals gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list .

Relevant inputs to the PoN building task 1-CNV_Somatic_Panel were:

  • blacklist_intervals
    gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list

  • intervals
    gs://fc-9c84e685-79f8-4d84-9e52-640943257a9b/reference/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.interval_list

which produced an output interval list file and a pon:

  • preprocessed_intervals
    gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-PreprocessIntervals/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.preprocessed.interval_list

  • read_count_pon
    gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-CreateReadCountPanelOfNormals/attempt-3/REBC-WGS-do-gc.pon.hdf5

The relevant inputs to 2-CNV_Somatic_Pair were:

  • blacklist_intervals
    gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list

  • intervals
    gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-PreprocessIntervals/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.preprocessed.interval_list

  • read_count_pon
    gs://fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-CreateReadCountPanelOfNormals/attempt-3/REBC-WGS-do-gc.pon.hdf5

which match the pon-making intervals as far as I can tell.

I've been running this as task 2-CNV_Somatic_Pair_gatk414 in workspace rebc-oct16/rebc_analysis, which is a very old workspace that Terra and GATK teams should already have access. If not, let me know. An example failed job is e6f01225-5db5-4f61-99f8-23689a32d42f.

Thanks,

Chip

P.S.
The complete error message is:
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.95676adf
18:58:29.302 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
18:58:29.522 INFO DenoiseReadCounts - ------------------------------------------------------------
18:58:29.523 INFO DenoiseReadCounts - The Genome Analysis Toolkit (GATK) v4.1.4.0
18:58:29.523 INFO DenoiseReadCounts - For support and documentation go to https://software.broadinstitute.org/gatk/
18:58:29.523 INFO DenoiseReadCounts - Executing as [email protected] on Linux v4.19.72+ amd64
18:58:29.524 INFO DenoiseReadCounts - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03
18:58:29.524 INFO DenoiseReadCounts - Start Date/Time: December 9, 2019 6:58:29 PM UTC
18:58:29.524 INFO DenoiseReadCounts - ------------------------------------------------------------
18:58:29.524 INFO DenoiseReadCounts - ------------------------------------------------------------
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Version: 2.20.3
18:58:29.525 INFO DenoiseReadCounts - Picard Version: 2.21.1
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 2
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
18:58:29.525 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
18:58:29.525 INFO DenoiseReadCounts - Deflater: IntelDeflater
18:58:29.525 INFO DenoiseReadCounts - Inflater: IntelInflater
18:58:29.525 INFO DenoiseReadCounts - GCS max retries/reopens: 20
18:58:29.525 INFO DenoiseReadCounts - Requester pays: disabled
18:58:29.525 INFO DenoiseReadCounts - Initializing engine
18:58:29.525 INFO DenoiseReadCounts - Done initializing engine
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
18:58:29.603 INFO DenoiseReadCounts - Reading read-counts file (/cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/91a0cdd9-276f-40de-817f-ee4055054c5f/CNVSomaticPairWorkflow/e6f01225-5db5-4f61-99f8-23689a32d42f/call-CollectCountsNormal/SC217007.counts.hdf5)...
18:58:30.763 INFO SVDDenoisingUtils - Validating sample intervals against original intervals used to build panel of normals...
18:58:31.210 INFO DenoiseReadCounts - Shutting down engine
[December 9, 2019 6:58:31 PM UTC] org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts done. Elapsed time: 0.03 minutes.
Runtime.totalMemory()=1198522368
java.lang.IllegalArgumentException: Sample intervals must be identical to the original intervals used to build the panel of normals.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.denoise(SVDDenoisingUtils.java:119)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDReadCountPanelOfNormals.denoise(SVDReadCountPanelOfNormals.java:88)
at org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts.doWork(DenoiseReadCounts.java:200)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)
Using GATK jar /root/gatk.jar defined in environment variable GATK_LOCAL_JAR
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx62000m -jar /root/gatk.jar DenoiseReadCounts --input /cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/91a0cdd9-276f-40de-817f-ee4055054c5f/CNVSomaticPairWorkflow/e6f01225-5db5-4f61-99f8-23689a32d42f/call-CollectCountsNormal/SC217007.counts.hdf5 --count-panel-of-normals /cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-CreateReadCountPanelOfNormals/attempt-3/REBC-WGS-do-gc.pon.hdf5 --standardized-copy-ratios SC217007.standardizedCR.tsv --denoised-copy-ratios SC217007.denoisedCR.tsv

.

Answers

  • sleeslee Member, Broadie, Dev ✭✭✭

    Hi @Chip, thanks for the detailed report! I'd be happy to take a look. Unfortunately, I don't think I have access to that workspace yet---mind sharing it with me?

  • sleeslee Member, Broadie, Dev ✭✭✭
    edited December 2019

    Actually, I think I might see a possible source of error from your description. You'll want to make sure that the inputs to both blacklist_intervals and intervals are identical for both workflows, since both workflows use these to perform the PreprocessIntervals step (which is typically call cached once you've run it in the PoN workflow).

    However, from your description, it looks like you passed the output of PreprocessIntervals in the PoN workflow (Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.preprocessed.interval_list) to the intervals input in the pair workflow, instead of the original intervals passed to the PoN workflow (Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.interval_list). Can you confirm?

  • ChipChip 415M 4053Member, Broadie

    Hi Sam,

    I tried many variations on this theme before I submitted the post that were not included in the original post (it was a lengthy post). All these variations resulted in the same error message, including jobs in which the inputs to both blacklist_intervals and intervals identical for both workflows. In posting the message I only described the most recent variation.

    An example of a job variation consistent with your suggestion is Workflow ID:5a1ae923-a622-4c24-9384-606d0fd7d593:

    inputs to the PoN building task 1-CNV_Somatic_Panel were:

    • blacklist_intervals
      gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list

    • intervals
      gs://fc-9c84e685-79f8-4d84-9e52-640943257a9b/reference/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.interval_list

    inputs to 2-CNV_Somatic_Pair:

    • blacklist_intervals
      gs://gatk-best-practices/somatic-b37/CNV_and_centromere_blacklist.hg19.list

    • intervals
      gs://fc-9c84e685-79f8-4d84-9e52-640943257a9b/reference/Homo_sapiens_assembly19.fasta.wgs_intervals.1_22.interval_list

    This variation resulted in an error message from task CNVSomaticPairWorkflow.DenoiseReadCountsNormalHide :

    Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.1e8c618c
    14:26:50.529 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
    14:26:50.734 INFO DenoiseReadCounts - ------------------------------------------------------------
    14:26:50.735 INFO DenoiseReadCounts - The Genome Analysis Toolkit (GATK) v4.1.4.0
    14:26:50.735 INFO DenoiseReadCounts - For support and documentation go to https://software.broadinstitute.org/gatk/
    14:26:50.736 INFO DenoiseReadCounts - Executing as [email protected] on Linux v4.19.72+ amd64
    14:26:50.736 INFO DenoiseReadCounts - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_212-8u212-b03-0ubuntu1.16.04.1-b03
    14:26:50.736 INFO DenoiseReadCounts - Start Date/Time: December 11, 2019 2:26:50 PM UTC
    14:26:50.736 INFO DenoiseReadCounts - ------------------------------------------------------------
    14:26:50.736 INFO DenoiseReadCounts - ------------------------------------------------------------
    14:26:50.737 INFO DenoiseReadCounts - HTSJDK Version: 2.20.3
    14:26:50.737 INFO DenoiseReadCounts - Picard Version: 2.21.1
    14:26:50.737 INFO DenoiseReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 2
    14:26:50.737 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
    14:26:50.737 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
    14:26:50.737 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
    14:26:50.737 INFO DenoiseReadCounts - Deflater: IntelDeflater
    14:26:50.738 INFO DenoiseReadCounts - Inflater: IntelInflater
    14:26:50.738 INFO DenoiseReadCounts - GCS max retries/reopens: 20
    14:26:50.738 INFO DenoiseReadCounts - Requester pays: disabled
    14:26:50.738 INFO DenoiseReadCounts - Initializing engine
    14:26:50.738 INFO DenoiseReadCounts - Done initializing engine
    log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
    14:26:50.829 INFO DenoiseReadCounts - Reading read-counts file (/cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/cfdc1520-c642-499c-a010-5a6ea8abf0d1/CNVSomaticPairWorkflow/5a1ae923-a622-4c24-9384-606d0fd7d593/call-CollectCountsNormal/SC217007.counts.hdf5)...
    14:26:51.988 INFO SVDDenoisingUtils - Validating sample intervals against original intervals used to build panel of normals...
    14:26:52.419 INFO DenoiseReadCounts - Shutting down engine
    [December 11, 2019 2:26:52 PM UTC] org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts done. Elapsed time: 0.03 minutes.
    Runtime.totalMemory()=1200095232
    java.lang.IllegalArgumentException: Sample intervals must be identical to the original intervals used to build the panel of normals.
    at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:725)
    at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.denoise(SVDDenoisingUtils.java:119)
    at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDReadCountPanelOfNormals.denoise(SVDReadCountPanelOfNormals.java:88)
    at org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts.doWork(DenoiseReadCounts.java:200)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
    at org.broadinstitute.hellbender.Main.main(Main.java:292)
    Using GATK jar /root/gatk.jar defined in environment variable GATK_LOCAL_JAR
    Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx62000m -jar /root/gatk.jar DenoiseReadCounts --input /cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/cfdc1520-c642-499c-a010-5a6ea8abf0d1/CNVSomaticPairWorkflow/5a1ae923-a622-4c24-9384-606d0fd7d593/call-CollectCountsNormal/SC217007.counts.hdf5 --count-panel-of-normals /cromwell_root/fc-035f5652-acf7-4642-abb7-e8c10848c8ed/7615f132-7160-4ff8-a335-4c529790607b/CNVSomaticPanelWorkflow/5da4afe1-7342-4b3f-85cf-4343c6edd8fe/call-CreateReadCountPanelOfNormals/attempt-3/REBC-WGS-do-gc.pon.hdf5 --standardized-copy-ratios SC217007.standardizedCR.tsv --denoised-copy-ratios SC217007.denoisedCR.tsv

    Let me know if you have questions or suggestions!

    Thanks

  • sleeslee Member, Broadie, Dev ✭✭✭
    edited December 2019

    @Chip it looks like you may have set --padding 250 in the pair workflow and --padding 0 in the PoN workflow. This causes the results of PreprocessIntervals to differ across the workflows (intervals will be padded into the blacklisted regions prior to binning in the former), ultimately leading to the message you see in DenoiseReadCounts. Can you try setting --padding 0 in your pair workflow and let me know if that resolves things?

  • ChipChip 415M 4053Member, Broadie

    Will try that suggestion (not yet among tested variations). Seems plausible!

  • ChipChip 415M 4053Member, Broadie

    Hi Sam,

    Setting the --padding 0 solved the mismatched interval failure. Thanks for spotting the inconsistency.

    The reason I've been testing the latest GATK CNV pipeline is an over-segmentation issue resulting from an older 2-CNV_Somatic_Pair workflow (based on us.gcr.io/broad-gatk/gatk:4.1.*) apparent in ~20 of 400 WGS samples in the REBC cohort. One tumor sample (AF8T) had severe coverage dropouts resulting in more than 10k tiny segments. Other over-segmented tumors didn't have an obvious coverage normalization problem but were over-segmented anyway. It seems that arm-level SCNAs have a high propensity to be over-segmented relative to chromosomes w/o arm-level SCNAs. I had expected that that a larger PoN would remedy coverage normalization issues so I tried a PoN with more than 400 normals, but that didn't put much of a dent into the problem.

    I then tried varying the num_changepoints_penalty_factor (default 1.0 -> 2.0) and max_num_segments_per_chromosome (default 1000->100). The modified num_changepoints_penalty_factor and max_num_segments_per_chromosome appear to have a clear benefit in terms of over-segmentation. Thyroid cancer is known to have very few SCNAs but simply dialing down the number of segments by increasing the penalty and max segments to get fewer segments isn't really an optimization. What do you think?

    Last year I was working with Lee to test the GATK4 2-CNV_Somatic_Pair workflow on REBC data and I was happy with most of the tumors, but these ~20 over-segmented tumors are tougher nuts to crack.

    We can take this thread out of the GATK forum or leave it here, whatever you suggest.

    Chip

Sign In or Register to comment.