We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

GermlineCNV tools: Regression or incompatibility between gatk env 4.1.3.0 and gatk?

SkyWarriorSkyWarrior TurkeyMember ✭✭✭
edited August 2019 in Ask the GATK team

Hi

I realized that there is a regression in GermlineCNV tools in 4.1.3.0.

Here is the error message that I get when I try the CASE mode for a single sample.

/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/./tmp/case_determine_ploidy_and_depth.14374052578254066674.py", line 105, in <module>
    ploidy_inference_params, ploidy_config, ploidy_workspace, args.input_model_path)
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/tasks/task_case_ploidy_determination.py", line 48, in __init__
    PloidyModelReader(self.continuous_model, self.continuous_model_approx, input_model_path)()
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_ploidy.py", line 173, in __call__
    io_commons.read_mean_field_global_params(self.input_path, self.ploidy_model_approx, self.ploidy_model)
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_commons.py", line 459, in read_mean_field_global_params
    var_mu = read_ndarray_from_tsv(var_mu_input_file)
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_commons.py", line 265, in read_ndarray_from_tsv
    "\"{0}\"".format(input_file)
AssertionError: Shape and dtype information could not be found in the header of "/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/DCPoutput/GATK_CNV_Fulgent_Select_20190325_CNV-model/mu_psi_j_log__.tsv"
10:47:14.059 DEBUG ScriptExecutor - Result: 1
10:47:14.060 INFO  DetermineGermlineContigPloidy - Shutting down engine
[August 17, 2019 at 10:47:14 AM EET] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 2.75 minutes.
Runtime.totalMemory()=2109734912
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/./tmp/case_determine_ploidy_and_depth.14677844990518816111.py --sample_coverage_metadata=/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/tmp/samples-by-coverage-per-contig11621799494417272471.tsv --output_calls_path=/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/Wes40/Wes40_CASEMode-calls --mapping_error_rate=1.000000e-02 --psi_s_scale=1.000000e-04 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.990000e-01 --log_emission_samples_per_round=2000 --log_emission_sampling_rounds=100 --log_emission_sampling_median_rel_error=5.000000e-04 --max_advi_iter_first_epoch=1000 --max_advi_iter_subsequent_epochs=1000 --min_training_epochs=20 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=5000 --convergence_snr_averaging_window=5000 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=1 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=7.500000e-01 --disable_caller=false --disable_sampler=false --disable_annealing=false --input_model_path=/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/DCPoutput/GATK_CNV_Fulgent_Select_20190325_CNV-model
        at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
        at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
        at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:411)
        at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:288)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)

Here are the contents of the file mu_psi_j_log__.tsv that the script is checking

@shape=(24,)
@dtype=float64
-5.2550480494676766
-4.6356812840033434
-4.4509602741720098
-4.5404819440281745
-10.49589731942288
-10.085813548471549
-10.654300137176685
-9.6848733447421846
-9.8115553137553242
-4.531010295058743
-6.4881217190363136
-10.602344773231806
-8.4645717417385491
-6.988764081362203
-9.9435963731775381
-7.273032596243362
-7.6633282797262252
-9.1839374252546371
-3.3130505628496385
-7.598191699094234
-9.4367052087489434
-3.7485703225549494
-3.0245060575300013
-1.9448486749012255

This error occurs only when gatk conda environment installed from 4.1.3.0 package.

So bellow are comparison of different scenarios.

GATK ENV 4.1.2.0 + GATK 4.1.2.0  - No Error

GATK ENV 4.1.3.0 + GATK 4.1.3.0  - Error

GATK ENV 4.1.3.0 + GATK 4.1.2.0  - Error

GATK ENV 4.1.2.0 + GATK 4.1.3.0  - No Error

GATK docker image 4.1.2.0 - No Error

GATK docker image 4.1.3.0  - Error

Looks like there is a regression in the gatk python environment.

Regards.

Post edited by SkyWarrior on

Best Answer

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Is it possible that the format of the file is changed between version 4.1.2.0 and 4.1.3.0 or the changes introduced in io.commons.py is actually causing the actual regression.

    old io.commons.py read_ndarray_from_tsv

    def read_ndarray_from_tsv(input_file: str,
                              comment=io_consts.default_comment_char,
                              delimiter=io_consts.default_delimiter_char) -> np.ndarray:
        """Reads a vector or matrix ndarray from .tsv file.
    
        Args:
            input_file: input .tsv file
            comment: comment character
            delimiter: delimiter character
    
        Returns:
            ndarray
        """
        dtype = None
        shape = None
        rows: List[np.ndarray] = []
    
        def _get_value(key: str, _line: str):
            key_loc = _line.find(key)
            if key_loc >= 0:
                val_loc = _line.find('=')
                return _line[val_loc + 1:].strip()
            else:
                return None
    
        with open(input_file, 'r') as f:
            for line in f:
                stripped_line = line.strip()
                if len(stripped_line) == 0:
                    continue
                elif stripped_line[0] == comment:
                    if dtype is None:
                        dtype = _get_value('dtype', stripped_line)
                    if shape is None:
                        shape = _get_value('shape', stripped_line)
                else:
                    assert dtype is not None and shape is not None, \
                        "Shape and dtype information could not be found in the header of " \
                        "\"{0}\"".format(input())
                    row = np.asarray(stripped_line.split(delimiter), dtype=dtype)
                    rows.append(row)
    
        return np.vstack(rows).reshape(make_tuple(shape))
    

    New io.commons.py read_ndarray_from_tsv

    def parse_sam_comment(comment_line: str) -> Tuple:
        """Parse a SAM style comment
    
        Args:
            comment_line: a comment string
    
        Returns:
            Key-value pair represented by a SAM style comment
        """
        match = re.search(io_consts.sam_comment_key_value_regexp, comment_line, re.M)
        if match is None or len(match.groups()) != 2:
            return None, None
        result = match.groups()
        return result[0], result[1]
    
    
    def read_ndarray_from_tsv(input_file: str,
                              comment=io_consts.default_comment_char,
                              delimiter=io_consts.default_delimiter_char) -> np.ndarray:
        """Reads a vector or matrix ndarray from .tsv file.
    
        Args:
            input_file: input .tsv file
            comment: comment character
            delimiter: delimiter character
    
        Returns:
            ndarray
        """
        dtype = None
        shape = None
    
        with open(input_file, 'r') as f:
            for line in f:
                stripped_line = line.strip()
                if len(stripped_line) == 0:
                    continue
                elif stripped_line[0] == comment:
                    key, value = parse_sam_comment(stripped_line)
                    if key == io_consts.type_key_value:
                        assert dtype is None, "Multiple dtype lines are present in the header of " \
                                              "\"{0}\"".format(input_file)
                        dtype = value
                    if key == io_consts.shape_key_value:
                        assert shape is None, "Multiple shape lines are present in the header of " \
                                              "\"{0}\"".format(input_file)
                        shape = make_tuple(value)
    
        assert dtype is not None and shape is not None, \
            "Shape and dtype information could not be found in the header of " \
            "\"{0}\"".format(input_file)
    
        df = pd.read_csv(filepath_or_buffer=input_file, sep=delimiter, dtype=dtype, comment=comment)
        return df.values.reshape(shape)
    
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Sorry for the 3rd update but it seems the backwards compatibility is gone due to file format change.
    Here is the new file format for the mu_psi_j_log__.tsv

    @CO shape:(24,)
    @CO dtype:float64
    VALUE_0
    -12.033745264060789
    -9.363038430437655
    -10.621853016475988
    -5.191713278429956
    -5.259692306782298
    -10.873537893248216
    -11.34961254203809
    -10.418938503572942
    -11.620784059269889
    -10.624832129890702
    -11.384187120999064
    -11.566853295185231
    -4.269494387089599
    -11.02023941153198
    -10.810747373844645
    -8.25315397826546
    -9.309224944062938
    -8.234540636698572
    -5.37244243329481
    -9.317286738792761
    -11.738864119490641
    -7.942278933659564
    -3.663729981596678
    -1.967807972499892
    

    Can this be also included in the release notes to warn users that old models generated by older versions are not compatible with 4.1.3.0 ?

    Regards.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Those denoised copy ratios are great by the way. We can plot the result in a report much better.

    Thanks.

Sign In or Register to comment.