Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!

GermlineCNV tools: Regression or incompatibility between gatk env 4.1.3.0 and gatk?

SkyWarriorSkyWarrior TurkeyMember ✭✭✭
edited August 17 in Ask the GATK team

Hi

I realized that there is a regression in GermlineCNV tools in 4.1.3.0.

Here is the error message that I get when I try the CASE mode for a single sample.

/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Traceback (most recent call last):
  File "/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/./tmp/case_determine_ploidy_and_depth.14374052578254066674.py", line 105, in <module>
    ploidy_inference_params, ploidy_config, ploidy_workspace, args.input_model_path)
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/tasks/task_case_ploidy_determination.py", line 48, in __init__
    PloidyModelReader(self.continuous_model, self.continuous_model_approx, input_model_path)()
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_ploidy.py", line 173, in __call__
    io_commons.read_mean_field_global_params(self.input_path, self.ploidy_model_approx, self.ploidy_model)
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_commons.py", line 459, in read_mean_field_global_params
    var_mu = read_ndarray_from_tsv(var_mu_input_file)
  File "/home/exome/miniconda2/envs/gatk/lib/python3.6/site-packages/gcnvkernel/io/io_commons.py", line 265, in read_ndarray_from_tsv
    "\"{0}\"".format(input_file)
AssertionError: Shape and dtype information could not be found in the header of "/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/DCPoutput/GATK_CNV_Fulgent_Select_20190325_CNV-model/mu_psi_j_log__.tsv"
10:47:14.059 DEBUG ScriptExecutor - Result: 1
10:47:14.060 INFO  DetermineGermlineContigPloidy - Shutting down engine
[August 17, 2019 at 10:47:14 AM EET] org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy done. Elapsed time: 2.75 minutes.
Runtime.totalMemory()=2109734912
org.broadinstitute.hellbender.utils.python.PythonScriptExecutorException:
python exited with 1
Command Line: python /mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/./tmp/case_determine_ploidy_and_depth.14677844990518816111.py --sample_coverage_metadata=/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/tmp/samples-by-coverage-per-contig11621799494417272471.tsv --output_calls_path=/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/Wes40/Wes40_CASEMode-calls --mapping_error_rate=1.000000e-02 --psi_s_scale=1.000000e-04 --learning_rate=5.000000e-02 --adamax_beta1=9.000000e-01 --adamax_beta2=9.990000e-01 --log_emission_samples_per_round=2000 --log_emission_sampling_rounds=100 --log_emission_sampling_median_rel_error=5.000000e-04 --max_advi_iter_first_epoch=1000 --max_advi_iter_subsequent_epochs=1000 --min_training_epochs=20 --max_training_epochs=100 --initial_temperature=2.000000e+00 --num_thermal_advi_iters=5000 --convergence_snr_averaging_window=5000 --convergence_snr_trigger_threshold=1.000000e-01 --convergence_snr_countdown_window=10 --max_calling_iters=1 --caller_update_convergence_threshold=1.000000e-03 --caller_internal_admixing_rate=7.500000e-01 --caller_external_admixing_rate=7.500000e-01 --disable_caller=false --disable_sampler=false --disable_annealing=false --input_model_path=/mnt/D/Exomes2/GATK_CNV_Fulgent_Select_20190325/DCPoutput/GATK_CNV_Fulgent_Select_20190325_CNV-model
        at org.broadinstitute.hellbender.utils.python.PythonExecutorBase.getScriptException(PythonExecutorBase.java:75)
        at org.broadinstitute.hellbender.utils.runtime.ScriptExecutor.executeCuratedArgs(ScriptExecutor.java:126)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeArgs(PythonScriptExecutor.java:170)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:151)
        at org.broadinstitute.hellbender.utils.python.PythonScriptExecutor.executeScript(PythonScriptExecutor.java:121)
        at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.executeDeterminePloidyAndDepthPythonScript(DetermineGermlineContigPloidy.java:411)
        at org.broadinstitute.hellbender.tools.copynumber.DetermineGermlineContigPloidy.doWork(DetermineGermlineContigPloidy.java:288)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)

Here are the contents of the file mu_psi_j_log__.tsv that the script is checking

@shape=(24,)
@dtype=float64
-5.2550480494676766
-4.6356812840033434
-4.4509602741720098
-4.5404819440281745
-10.49589731942288
-10.085813548471549
-10.654300137176685
-9.6848733447421846
-9.8115553137553242
-4.531010295058743
-6.4881217190363136
-10.602344773231806
-8.4645717417385491
-6.988764081362203
-9.9435963731775381
-7.273032596243362
-7.6633282797262252
-9.1839374252546371
-3.3130505628496385
-7.598191699094234
-9.4367052087489434
-3.7485703225549494
-3.0245060575300013
-1.9448486749012255

This error occurs only when gatk conda environment installed from 4.1.3.0 package.

So bellow are comparison of different scenarios.

GATK ENV 4.1.2.0 + GATK 4.1.2.0  - No Error

GATK ENV 4.1.3.0 + GATK 4.1.3.0  - Error

GATK ENV 4.1.3.0 + GATK 4.1.2.0  - Error

GATK ENV 4.1.2.0 + GATK 4.1.3.0  - No Error

GATK docker image 4.1.2.0 - No Error

GATK docker image 4.1.3.0  - Error

Looks like there is a regression in the gatk python environment.

Regards.

Post edited by SkyWarrior on

Best Answer

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Is it possible that the format of the file is changed between version 4.1.2.0 and 4.1.3.0 or the changes introduced in io.commons.py is actually causing the actual regression.

    old io.commons.py read_ndarray_from_tsv

    def read_ndarray_from_tsv(input_file: str,
                              comment=io_consts.default_comment_char,
                              delimiter=io_consts.default_delimiter_char) -> np.ndarray:
        """Reads a vector or matrix ndarray from .tsv file.
    
        Args:
            input_file: input .tsv file
            comment: comment character
            delimiter: delimiter character
    
        Returns:
            ndarray
        """
        dtype = None
        shape = None
        rows: List[np.ndarray] = []
    
        def _get_value(key: str, _line: str):
            key_loc = _line.find(key)
            if key_loc >= 0:
                val_loc = _line.find('=')
                return _line[val_loc + 1:].strip()
            else:
                return None
    
        with open(input_file, 'r') as f:
            for line in f:
                stripped_line = line.strip()
                if len(stripped_line) == 0:
                    continue
                elif stripped_line[0] == comment:
                    if dtype is None:
                        dtype = _get_value('dtype', stripped_line)
                    if shape is None:
                        shape = _get_value('shape', stripped_line)
                else:
                    assert dtype is not None and shape is not None, \
                        "Shape and dtype information could not be found in the header of " \
                        "\"{0}\"".format(input())
                    row = np.asarray(stripped_line.split(delimiter), dtype=dtype)
                    rows.append(row)
    
        return np.vstack(rows).reshape(make_tuple(shape))
    

    New io.commons.py read_ndarray_from_tsv

    def parse_sam_comment(comment_line: str) -> Tuple:
        """Parse a SAM style comment
    
        Args:
            comment_line: a comment string
    
        Returns:
            Key-value pair represented by a SAM style comment
        """
        match = re.search(io_consts.sam_comment_key_value_regexp, comment_line, re.M)
        if match is None or len(match.groups()) != 2:
            return None, None
        result = match.groups()
        return result[0], result[1]
    
    
    def read_ndarray_from_tsv(input_file: str,
                              comment=io_consts.default_comment_char,
                              delimiter=io_consts.default_delimiter_char) -> np.ndarray:
        """Reads a vector or matrix ndarray from .tsv file.
    
        Args:
            input_file: input .tsv file
            comment: comment character
            delimiter: delimiter character
    
        Returns:
            ndarray
        """
        dtype = None
        shape = None
    
        with open(input_file, 'r') as f:
            for line in f:
                stripped_line = line.strip()
                if len(stripped_line) == 0:
                    continue
                elif stripped_line[0] == comment:
                    key, value = parse_sam_comment(stripped_line)
                    if key == io_consts.type_key_value:
                        assert dtype is None, "Multiple dtype lines are present in the header of " \
                                              "\"{0}\"".format(input_file)
                        dtype = value
                    if key == io_consts.shape_key_value:
                        assert shape is None, "Multiple shape lines are present in the header of " \
                                              "\"{0}\"".format(input_file)
                        shape = make_tuple(value)
    
        assert dtype is not None and shape is not None, \
            "Shape and dtype information could not be found in the header of " \
            "\"{0}\"".format(input_file)
    
        df = pd.read_csv(filepath_or_buffer=input_file, sep=delimiter, dtype=dtype, comment=comment)
        return df.values.reshape(shape)
    
  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Sorry for the 3rd update but it seems the backwards compatibility is gone due to file format change.
    Here is the new file format for the mu_psi_j_log__.tsv

    @CO shape:(24,)
    @CO dtype:float64
    VALUE_0
    -12.033745264060789
    -9.363038430437655
    -10.621853016475988
    -5.191713278429956
    -5.259692306782298
    -10.873537893248216
    -11.34961254203809
    -10.418938503572942
    -11.620784059269889
    -10.624832129890702
    -11.384187120999064
    -11.566853295185231
    -4.269494387089599
    -11.02023941153198
    -10.810747373844645
    -8.25315397826546
    -9.309224944062938
    -8.234540636698572
    -5.37244243329481
    -9.317286738792761
    -11.738864119490641
    -7.942278933659564
    -3.663729981596678
    -1.967807972499892
    

    Can this be also included in the release notes to warn users that old models generated by older versions are not compatible with 4.1.3.0 ?

    Regards.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Those denoised copy ratios are great by the way. We can plot the result in a report much better.

    Thanks.

Sign In or Register to comment.