Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

DenoiseReadCounts: Sample does not have a non-negative sample median

Hi! I tried to run DenoiseReadCounts and I got the following error:

08:16:50.133 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/build/libs/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
08:16:50.531 INFO DenoiseReadCounts - ------------------------------------------------------------
08:16:50.532 INFO DenoiseReadCounts - The Genome Analysis Toolkit (GATK) v4.0.3.0
08:16:50.532 INFO DenoiseReadCounts - For support and documentation go to https://software.broadinstitute.org/gatk/
08:16:50.533 INFO DenoiseReadCounts - Executing as [email protected]_2 on Linux v4.4.0-97-generic amd64
08:16:50.534 INFO DenoiseReadCounts - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11
08:16:50.535 INFO DenoiseReadCounts - Start Date/Time: April 2, 2018 8:16:50 AM UTC
08:16:50.535 INFO DenoiseReadCounts - ------------------------------------------------------------
08:16:50.535 INFO DenoiseReadCounts - ------------------------------------------------------------
08:16:50.537 INFO DenoiseReadCounts - HTSJDK Version: 2.14.3
08:16:50.537 INFO DenoiseReadCounts - Picard Version: 2.17.2
08:16:50.537 INFO DenoiseReadCounts - HTSJDK Defaults.COMPRESSION_LEVEL : 2
08:16:50.537 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
08:16:50.537 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
08:16:50.537 INFO DenoiseReadCounts - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
08:16:50.537 INFO DenoiseReadCounts - Deflater: IntelDeflater
08:16:50.537 INFO DenoiseReadCounts - Inflater: IntelInflater
08:16:50.537 INFO DenoiseReadCounts - GCS max retries/reopens: 20
08:16:50.537 INFO DenoiseReadCounts - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes
08:16:50.538 WARN DenoiseReadCounts -

[1m[31m !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Warning: DenoiseReadCounts is a BETA tool and is not yet ready for use in production

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!![0m

08:16:50.538 INFO DenoiseReadCounts - Initializing engine
08:16:50.538 INFO DenoiseReadCounts - Done initializing engine
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
08:16:50.633 INFO DenoiseReadCounts - Reading read-counts file (SM2_counts.hdf5)...
08:16:53.203 INFO DenoiseReadCounts - No GC-content annotations for intervals found; explicit GC-bias correction will not be performed...
08:16:53.203 WARN DenoiseReadCounts - Neither a panel of normals nor GC-content annotations were provided, so only standardization will be performed...
08:16:55.244 INFO SVDDenoisingUtils - Preprocessing read counts...
08:16:55.244 INFO SVDDenoisingUtils - Transforming read counts to fractional coverage...
08:16:55.331 INFO SVDDenoisingUtils - Sample read counts preprocessed.
08:16:55.332 INFO SVDDenoisingUtils - Standardizing read counts...
08:16:55.332 INFO SVDDenoisingUtils - Dividing by sample medians and transforming to log2 space...
08:16:55.707 INFO DenoiseReadCounts - Shutting down engine
[April 2, 2018 8:16:55 AM UTC] org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=2544893952
java.lang.IllegalArgumentException: Sample does not have a non-negative sample median.
at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:681)
at org.broadinstitute.hellbender.utils.param.ParamUtils.isPositive(ParamUtils.java:165)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.lambda$divideBySampleMedianAndTransformToLog2$26(SVDDenoisingUtils.java:481)
at java.util.stream.Streams$RangeIntSpliterator.forEachRemaining(Streams.java:110)
at java.util.stream.IntPipeline$Head.forEach(IntPipeline.java:557)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.divideBySampleMedianAndTransformToLog2(SVDDenoisingUtils.java:480)
at org.broadinstitute.hellbender.tools.copynumber.denoising.SVDDenoisingUtils.preprocessAndStandardizeSample(SVDDenoisingUtils.java:169)
at org.broadinstitute.hellbender.tools.copynumber.DenoiseReadCounts.doWork(DenoiseReadCounts.java:219)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:134)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:179)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:198)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Using GATK jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/build/libs/gatk-package-4.0.3.0-local.jar DenoiseReadCounts -I SM2_counts.hdf5 --standardized-copy-ratios SM2_standardizedCR.tsv --denoised-copy-ratios SM2_denoisedCR.tsv

I am not using either PON nor annotated-intervals (although I tried with AnnotateIntervals using wgs_calling_regions_hg38.interval_list from the GATK resource bundle and the hg38 fasta file).

Any thoughts?
Thanks!

Issue · Github
by Sheila

Issue Number
3034
State
closed
Last Updated
Assignee
Array
Closed By
chandrans

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @JoanGibert
    Hi,

    I have to check with the team and get back to you.

    -Sheila

  • sleeslee Member, Broadie, Dev ✭✭✭

    @JoanGibert That error indicates that the median of the counts in each interval is non-negative (most likely, zero). You can examine the counts by opening SM2_counts.hdf5 to see if this is the case.

    Let's step back a bit---what are the bins/intervals you collected coverage over to generate SM2_counts.hdf5? Are you sure your BAM has coverage in those bins?

    You mention wgs_calling_regions_hg38.interval_list, but this is not an appropriate resource for CNV analyses. The bins/intervals that you pass to AnnotateIntervals should be identical to those you collect coverage on. You can create appropriate bins/intervals for both annotation and coverage collection by using PreprocessIntervals; you may find the documentation for that tool helpful.

  • JoanGibertJoanGibert Member

    Hi @slee,

    I used GATK hg38 .fasta file in order to perform the mappings and, as I have WGS files, I used the PreprocessIntervals without any interval list with the following command:

    gatk PreprocessIntervals -R Homo_sapiens_assembly38.fasta --bin-length 1000 --padding 0 -O preprocess_interval.interval_list

    As explained in the GATK workshop here: https://drive.google.com/open?id=1rL7-9zq_HvW9Fw-mjnTPCgOZSmQTV-SU

    Ok for the wgs_calling_regions_hg38.interval_list, I guess that I misunderstood some point of the pipeline.

    So, I guess that I should use the preprocess_interval.interval_list for the DenoiseReadCounts, right? And the non-negative problem?

    Best,
    Joan

    @slee said:
    @JoanGibert That error indicates that the median of the counts in each interval is non-negative (most likely, zero). You can examine the counts by opening SM2_counts.hdf5 to see if this is the case.

    Let's step back a bit---what are the bins/intervals you collected coverage over to generate SM2_counts.hdf5? Are you sure your BAM has coverage in those bins?

    You mention wgs_calling_regions_hg38.interval_list, but this is not an appropriate resource for CNV analyses. The bins/intervals that you pass to AnnotateIntervals should be identical to those you collect coverage on. You can create appropriate bins/intervals for both annotation and coverage collection by using PreprocessIntervals; you may find the documentation for that tool helpful.

  • sleeslee Member, Broadie, Dev ✭✭✭
    edited April 2018

    Hi @JoanGibert,

    Yes, that PreprocessIntervals call will generate preprocess_interval.interval_list, which will consist of 1000bp bins across the genome. You should pass preprocess_interval.interval_list via -L to the coverage collection tool (CollectFragmentCounts or CollectReadCounts, depending on which version you are using). This will collect counts in these bins.

    If your BAM is indeed well covered and 1000bp is an appropriate bin size for your depth of coverage, then you should get sensible counts---you should check that this is the case.

    You may then pass the resulting counts to DenoiseReadCounts. Without a PoN or annotated intervals, however, you will only be able to perform a very basic standardization of the data. However, you can generate annotated intervals by passing preprocess_interval.interval_list to the AnnotateIntervals tool. Using the resulting annotated intervals with DenoiseReadCounts will allow you to perform GC correction as well.

    Hope this helps!

  • JoanGibertJoanGibert Member

    Hi @slee,

    I have kind of duplicated data (with low coverage, around 3X). Now looking at the documentation and your answers I guess that the bin size may cause the error because I could have some regions with zero counts (as you mention in your first answer). Am I right?

    If this is the case, should I increase the bin size in order to increase the counts per bin?

    Thank you!

  • JoanGibertJoanGibert Member

    Hi @slee,

    If I run QC (and MarkDuplicates) I got like 80-90% of duplicated sequences. I guess that these explains why I have this coverage. Ok, I will try to increase the bin size, there is any way to know which bin size would be enough? Or just trial/error?

    Thanks!

Sign In or Register to comment.