We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

4.1.4.0 germline copy number - missing denoising_config.json file?

rcorbettrcorbett BC Cancer AgencyMember
edited November 2019 in Ask the GATK team

Hi all,

I have a cohort of germline genome samples that I cam analyzing by following the steps in this tutorial:
https://gatkforums.broadinstitute.org/gatk/discussion/11684/how-to-call-common-and-rare-germline-copy-number-variants

I am running the following commands:

 /gatk-4.1.4.0/gatk PreprocessIntervals -R hg19a.fa --bin-length 500 -interval-merging-rule OVERLAPPING_ONLY -O hg19a.interval_list

#This is run on ~200 germline genome BAM files
/gatk-4.1.4.0/gatk CollectReadCounts  -I myBam -L hg19a.interval_list --interval-merging-rule OVERLAPPING_ONLY -O myBam.hdf5

#Trim the interval list
/home/rcorbett/bin/gatk-4.1.4.0/gatk AnnotateIntervals -L hg19a.interval_list -R hg19a.fa -imr OVERLAPPING_ONLY -O hg19a.interval_list.annotated
/home/rcorbett/bin/gatk-4.1.4.0/gatk FilterIntervals -L hg19a.interval_list -imr OVERLAPPING_ONLY --annotated-intervals hg19a.interval_list.annotated -I myBam.hdf5 -I P01447_1_lane_dupsFlagged.bam.hdf5 -I myBam2.hdf5 -I myBam3.hdf5 -I myBam4.hdf5 ... -O hg19a.interval_list.annotated.filtered

#Get the ploidy estimates
gatk-4.1.4.0/gatk DetermineGermlineContigPloidy -L hg19a.interval_list.annotated.filtered.list -imr OVERLAPPING_ONLY --contig-ploidy-priors ploidy_priors.tsv --output . --output-prefix ploidy --verbosity DEBUG -I myBam.hdf5 -I myBam2.hdf5 -I myBam2.hdf5 ...

#Get the ploidy estimates for a single sample
gatk-4.1.4.0/gatk DetermineGermlineContigPloidy --model ploidy-model -I myBam2.hdf5 -O  . --output-prefix ploidy-case --verbosity DEBUG

The last command listed above gives me an error before failing. Here's the full trace (with some IDs and Paths changed)

Using GATK jar /gatk-4.1.4.0/gatk-package-4.1.4.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/rcorbett/bin/gatk-4.1.4.0/gatk-package-4.1.4.0-local.jar GermlineCNVCaller --run-mode CASE --model ploidy-model -I P01447_1_lane_dupsFlagged.bam.hdf5 --contig-ploidy-calls polidy-case-calls --output case_output --output-prefix case_calls --verbosity DEBUG
13:17:59.610 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/rcorbett/bin/gatk-4.1.4.0/gatk-package-4.1.4.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
13:17:59.624 DEBUG NativeLibraryLoader - Extracting libgkl_compression.so to /tmp/libgkl_compression2553444245973287106.so
13:17:59.885 INFO GermlineCNVCaller - ------------------------------------------------------------
13:17:59.885 INFO GermlineCNVCaller - The Genome Analysis Toolkit (GATK) v4.1.4.0
13:17:59.885 INFO GermlineCNVCaller - For support and documentation go to https://software.broadinstitute.org/gatk/
13:17:59.886 INFO GermlineCNVCaller - Executing as [email protected] on Linux v2.6.32-573.8.1.el6.x86_64 amd64
13:17:59.886 INFO GermlineCNVCaller - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_66-b17
13:17:59.886 INFO GermlineCNVCaller - Start Date/Time: November 26, 2019 1:17:59 PM PST
13:17:59.886 INFO GermlineCNVCaller - ------------------------------------------------------------
13:17:59.886 INFO GermlineCNVCaller - ------------------------------------------------------------
13:17:59.887 INFO GermlineCNVCaller - HTSJDK Version: 2.20.3
13:17:59.887 INFO GermlineCNVCaller - Picard Version: 2.21.1
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.BUFFER_SIZE : 131072
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_INDEX : false
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.CREATE_MD5 : false
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.CUSTOM_READER_FACTORY :
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.DISABLE_SNAPPY_COMPRESSOR : false
13:17:59.889 INFO GermlineCNVCaller - HTSJDK Defaults.EBI_REFERENCE_SERVICE_URL_MASK : https://www.ebi.ac.uk/ena/cram/md5/%s
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.NON_ZERO_BUFFER_SIZE : 131072
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.REFERENCE_FASTA : null
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.SAM_FLAG_FIELD_FORMAT : DECIMAL
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:17:59.890 INFO GermlineCNVCaller - HTSJDK Defaults.USE_CRAM_REF_DOWNLOAD : false
13:17:59.890 DEBUG ConfigFactory - Configuration file values:
13:17:59.896 DEBUG ConfigFactory - gcsMaxRetries = 20
13:17:59.897 DEBUG ConfigFactory - gcsProjectForRequesterPays =
13:17:59.897 DEBUG ConfigFactory - gatk_stacktrace_on_user_exception = false
13:17:59.897 DEBUG ConfigFactory - samjdk.use_async_io_read_samtools = false
13:17:59.897 DEBUG ConfigFactory - samjdk.use_async_io_write_samtools = true
13:17:59.897 DEBUG ConfigFactory - samjdk.use_async_io_write_tribble = false
13:17:59.897 DEBUG ConfigFactory - samjdk.compression_level = 2
13:17:59.897 DEBUG ConfigFactory - spark.kryoserializer.buffer.max = 512m
13:17:59.897 DEBUG ConfigFactory - spark.driver.maxResultSize = 0
13:17:59.897 DEBUG ConfigFactory - spark.driver.userClassPathFirst = true
13:17:59.897 DEBUG ConfigFactory - spark.io.compression.codec = lzf
13:17:59.897 DEBUG ConfigFactory - spark.executor.memoryOverhead = 600
13:17:59.897 DEBUG ConfigFactory - spark.driver.extraJavaOptions =
13:17:59.897 DEBUG ConfigFactory - spark.executor.extraJavaOptions =
13:17:59.897 DEBUG ConfigFactory - codec_packages = [htsjdk.variant, htsjdk.tribble, org.broadinstitute.hellbender.utils.codecs]
13:17:59.898 DEBUG ConfigFactory - read_filter_packages = [org.broadinstitute.hellbender.engine.filters]
13:17:59.898 DEBUG ConfigFactory - annotation_packages = [org.broadinstitute.hellbender.tools.walkers.annotator]
13:17:59.898 DEBUG ConfigFactory - cloudPrefetchBuffer = 40
13:17:59.898 DEBUG ConfigFactory - cloudIndexPrefetchBuffer = -1
13:17:59.898 DEBUG ConfigFactory - createOutputBamIndex = true
13:17:59.898 INFO GermlineCNVCaller - Deflater: IntelDeflater
13:17:59.898 INFO GermlineCNVCaller - Inflater: IntelInflater
13:17:59.898 INFO GermlineCNVCaller - GCS max retries/reopens: 20
13:17:59.898 INFO GermlineCNVCaller - Requester pays: disabled
13:17:59.898 INFO GermlineCNVCaller - Initializing engine
13:17:59.903 DEBUG ScriptExecutor - Executing:
13:17:59.903 DEBUG ScriptExecutor - python
13:17:59.903 DEBUG ScriptExecutor - -c
13:17:59.903 DEBUG ScriptExecutor - import gcnvkernel

/miniconda3/envs/gatk/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
13:18:11.641 DEBUG ScriptExecutor - Result: 0
13:18:11.641 INFO GermlineCNVCaller - Done initializing engine
13:18:19.503 INFO GermlineCNVCaller - Running the tool in CASE mode...
13:18:19.504 INFO GermlineCNVCaller - Validating and aggregating data from input read-count files...
13:18:20.588 INFO GermlineCNVCaller - Aggregating read-count file myBam2.hdf5 (1 / 1)
log4j:WARN No appenders could be found for logger (org.broadinstitute.hdf5.HDF5Library).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
13:18:30.759 DEBUG ScriptExecutor - Executing:
13:18:30.759 DEBUG ScriptExecutor - python
13:18:30.759 DEBUG ScriptExecutor - /tmp/case_denoising_calling.6711074121962911645.py
13:18:30.759 DEBUG ScriptExecutor - --ploidy_calls_path=ploidy-case-calls
13:18:30.759 DEBUG ScriptExecutor - --output_calls_path=/case_output/case_calls-calls
13:18:30.759 DEBUG ScriptExecutor - --output_tracking_path=/case_output/case_calls-tracking
13:18:30.759 DEBUG ScriptExecutor - --input_model_path=/ploidy-model
13:18:30.759 DEBUG ScriptExecutor - --read_count_tsv_files
13:18:30.759 DEBUG ScriptExecutor - /tmp/sample-01045833658862224981.tsv
13:18:30.759 DEBUG ScriptExecutor - --psi_s_scale=1.000000e-04
13:18:30.759 DEBUG ScriptExecutor - --mapping_error_rate=1.000000e-02
13:18:30.765 DEBUG ScriptExecutor - --depth_correction_tau=1.000000e+04
13:18:30.765 DEBUG ScriptExecutor - --q_c_expectation_mode=hybrid
13:18:30.765 DEBUG ScriptExecutor - --p_alt=1.000000e-06
13:18:30.765 DEBUG ScriptExecutor - --cnv_coherence_length=1.000000e+04
13:18:30.765 DEBUG ScriptExecutor - --max_copy_number=5
13:18:30.765 DEBUG ScriptExecutor - --learning_rate=1.000000e-02
13:18:30.765 DEBUG ScriptExecutor - --adamax_beta1=9.000000e-01
13:18:30.765 DEBUG ScriptExecutor - --adamax_beta2=9.900000e-01
13:18:30.765 DEBUG ScriptExecutor - --log_emission_samples_per_round=50
13:18:30.765 DEBUG ScriptExecutor - --log_emission_sampling_rounds=10
13:18:30.765 DEBUG ScriptExecutor - --log_emission_sampling_median_rel_error=5.000000e-03
13:18:30.765 DEBUG ScriptExecutor - --max_advi_iter_first_epoch=5000
13:18:30.765 DEBUG ScriptExecutor - --max_advi_iter_subsequent_epochs=200
13:18:30.765 DEBUG ScriptExecutor - --min_training_epochs=10
13:18:30.765 DEBUG ScriptExecutor - --max_training_epochs=50
13:18:30.765 DEBUG ScriptExecutor - --initial_temperature=1.500000e+00
13:18:30.765 DEBUG ScriptExecutor - --num_thermal_advi_iters=2500
13:18:30.765 DEBUG ScriptExecutor - --convergence_snr_averaging_window=500
13:18:30.765 DEBUG ScriptExecutor - --convergence_snr_trigger_threshold=1.000000e-01
13:18:30.765 DEBUG ScriptExecutor - --convergence_snr_countdown_window=10
13:18:30.765 DEBUG ScriptExecutor - --max_calling_iters=10
13:18:30.765 DEBUG ScriptExecutor - --caller_update_convergence_threshold=1.000000e-03
13:18:30.765 DEBUG ScriptExecutor - --caller_internal_admixing_rate=7.500000e-01
13:18:30.765 DEBUG ScriptExecutor - --caller_external_admixing_rate=1.000000e+00
13:18:30.765 DEBUG ScriptExecutor - --disable_caller=false
13:18:30.765 DEBUG ScriptExecutor - --disable_sampler=false
13:18:30.765 DEBUG ScriptExecutor - --disable_annealing=false
13:18:40.525 INFO root - Loading modeling interval list from the provided model...
13:19:13.378 INFO root - The model contains 5631393 intervals and 24 contig(s)
13:19:13.379 INFO root - Loading 1 read counts file(s)...
13:20:06.039 INFO gcnvkernel.io.io_metadata - Loading germline contig ploidy and global read depth metadata...
13:20:06.080 INFO root - Loading denoising model configuration from the provided model...
miniconda3/envs/gatk/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Traceback (most recent call last):
File "/tmp/case_denoising_calling.6711074121962911645.py", line 176, in
update_args_dict_from_saved_model(args.input_model_path, args_dict)
File "/tmp/case_denoising_calling.6711074121962911645.py", line 106, in update_args_dict_from_saved_model
with open(os.path.join(input_model_path, "denoising_config.json"), 'r') as fp:
FileNotFoundError: [Errno 2] No such file or directory: '/ploidy-model/denoising_config.json'
13:20:13.878 DEBUG ScriptExecutor - Result: 1
13:20:13.880 INFO GermlineCNVCaller - Shutting down engine
[November 26, 2019 1:20:13 PM PST] org.broadinstitute.hellbender.tools.copynumber.GermlineCNVCaller done. Elapsed time: 2.24 minutes.
Runtime.totalMemory()=5160566784
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
......

In the example data that are supplied there is a denoising_config.json file in the folders of each of the samples, but I don't get one in my folders created with the above commands. Is that the source of my error?

Answers

  • sleeslee Member, Broadie, Dev ✭✭✭

    Looks like you might be trying to run GermlineCNVCaller with the model generated by DetermineGermlineContigPloidy, which is incorrect. Is it actually the last command above (DetermineGermlineContigPloidy in CASE mode) which is causing issues, or is it GermlineCNVCaller?

  • rcorbettrcorbett BC Cancer AgencyMember

    Thanks Slee,
    It is the GermlineCNVCaller command which is giving me the error. I didn't list it in my text, but is the command from which all of the output above came.

    I am following the tutorial at this location:
    https://gatkforums.broadinstitute.org/gatk/discussion/11684/how-to-call-common-and-rare-germline-copy-number-variants
    In which it says "The ploidy-model directory contains aggregated model data for the cohort. This is the model to provide to a case-mode DetermineGermlineContigPloidy analysis and to GermlineCNVCaller. The tutorial ploidy-model directory contains the eight files as follows."

  • sleeslee Member, Broadie, Dev ✭✭✭

    @rcorbett that looks like a typo, thanks for bringing it to our attention. You do not need to pass the ploidy model to GermlineCNVCaller, but you do need to pass the ploidy calls. The command lines in the tutorial should be correct, though.

  • rcorbettrcorbett BC Cancer AgencyMember

    Hi @slee,

    Thanks for picking this up, however, I'm still trying to get this running. Using the commands I have in the description I am running up to the following step without any errors. But when I run this command:

    /home/rcorbett/bin/gatk-4.1.4.0/gatk GermlineCNVCaller --run-mode CASE -I P01447_1_lane_dupsFlagged.bam.hdf5 --contig-ploidy-calls polidy-case-calls --model ploidy-model --output case_output --output-prefix 1a --verbosity DEBUG

    It throws an error. I posted the entire trace up in the description, but the only specific error I see is
    FileNotFoundError: [Errno 2] No such file or directory: '/projects/rcorbettprj2/clinicalGermline/cnv/ploidy-model/denoising_config.json'

    I am basing the GermlineCNVCaller caller as outlined in the tutorial and if I don't supply the ploidy model parameter I get the following error:
    "java.lang.IllegalArgumentException: An input denoising-model directory must be provided in CASE mode.
    "

    When I download the tutorial data I see that the ploidy model folders include a "denoising_config.json" file, but it doesn't get created with any commands that I run.

  • sleeslee Member, Broadie, Dev ✭✭✭

    Hi @rcorbett,

    I think you are still running your GermlineCNVCaller command incorrectly. You include the option --model ploidy-model and pass the directory containing the model generated by running DetermineGermlineContigPloidy in COHORT mode (instead of that generated by running GermlineCNVCaller in COHORT mode). The GermlineCNVCaller model directory indeed contains the denoising_config.json file that the error message references, but the DetermineGermlineContigPloidy model directory does not.

    The idea is that both tools can be run in COHORT mode to generate models that are subsequently used in CASE mode. However, the tools generate different models, and each model needs to be used with the corresponding tool that generated it.

  • rcorbettrcorbett BC Cancer AgencyMember

    I see. Just to be sure, you recommend running DetermineGermlineContigPloidy in CASE mode, followed by GermlineCNVCaller in case mode. I'll give that a shot. thanks!

  • rcorbettrcorbett BC Cancer AgencyMember

    Hmmm. So if I try and run DetermineGermlineContigPloidy and GermlineCNVCaller in CASE mode I need to provide a "a_valid_ploidy_model_dir" at the DetermineGermlineContigPloidy step. How would you recommend I create this? Earlier I thought I needed to make it with the COHORT mode command, but now I'm not sure how to do it.

    I'm reading through the docs at
    https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_copynumber_DetermineGermlineContigPloidy.php

    Where it says I need to supply "previously inferred ploidy model parameters"

    thanks
    RIchard

    ps. I am working on a project with hundreds of germline genomes that will trickle in. Would you process each as its own CASE or is it better to re-run the whole COHORT each time a sample is added?

  • sleeslee Member, Broadie, Dev ✭✭✭

    @rcorbett The idea is to first run both tools in COHORT mode on a set of samples that will be representative of the sequencing bias/noise of subsequent samples. This will not only produce ploidy and CNV calls on those samples, but will also train ploidy and denoising models that can be used to call subsequent samples in CASE mode. I'd suggest using at least tens of samples for the initial run in COHORT mode and calling subsequent samples in CASE mode as they trickle in.

Sign In or Register to comment.