Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Usage of DenoiseReadCounts in PerformSegmentation

One of our collaborators wanted to compare the differences between the outputs of ModelSegments (in GATK 4.1) and PerformSegmentation (in GATK 4.0b6). The tooldocs, however, was not particularly clear whether the output of DenoiseReadCounts(in GATK 4.1) can be used in PerformSegmentation, as the archived NormalizeSomaticReadCounts doc wasn't very specific about how it performs normalization.

Can the GATK team please confirm the following:
1. Can the output of DenoiseReadCounts be used as the -TN input in PerformSegmentation?
2. Even if (1) is positive, are the outputs of DenoiseReadCounts mathematically equivalent to that of NormalizeSomaticReadCounts, provided the same PON is used?

Thanks a lot for your help!

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @johnma

    Can the output of DenoiseReadCounts be used as the -TN input in PerformSegmentation?

    I think this would create errors because the versions are so far apart the formats of files might create some inconsistencies. I would recommend against it.

  • sleeslee Member, Broadie, Dev ✭✭✭
    edited June 19

    @johnma, the output denoised log2 copy-ratio values are essentially identical to those output by NormalizeSomaticReadCounts (down to levels of ~1E-16) if appropriate parameters are selected during the coverage collection step (recall that the --transform option of CalculateTargetCoverage allows for either RAW or PCOV output; however, this choice can affect the results of filtering/denoising during PoN creation---see https://github.com/broadinstitute/gatk/issues/2858#issuecomment-314529956. I think you will want to choose PCOV to get identical results).

    However, the format of the output differs superficially, as there is a sequence dictionary in the header of the DenoiseReadCounts output and the column headers also differ. However, it probably wouldn't be too much work to write a script to convert to the format output by NormalizeSomaticReadCounts if you'd like to generate the CBS result using PerformSegmentation. You might be interested in other comments in that linked issue to get an intuition for what differences might arise between the two segmentation methods.

    We done some evaluations of PerformSegmentation against ModelSegments using TCGA SNP arrays as ground truth. We find that by adjusting the 'ModelSegments parameters that control sensitivity, we can always at least match CBS performance, and typically outperform it by a decent margin. Some of these results can be seen in the GATK workshop slides for the somatic CNV pipeline.

  • johnmajohnma Member
    edited June 24

    Just by removing the sequence dictionary doesn't make the denoised coverage file work with PerformSegmentation. After reviewing version 4.0b6's source code, I noticed, in ReadCountCollectionUtils.java, that the input for PerformSegmentation would be different from DenoiseReadCount's output enough (it requires the target names as a separate field) that I decided to not pursue converting the output to a format that can be used by PerformSegmentation.

    Instead, I made small edits to the CBS.R file from the 4.0b6 source code so that it can accept DenoiseReadCount output files and run DNACopy the same way as PerformSegmentation. I can share this script, although I doubt the GATK forums is the best place to promote things like that.

  • sleeslee Member, Broadie, Dev ✭✭✭

    Thanks for reporting back, @johnma. It's been some time since I looked at the old beta code, but I don't believe that the target names were ever used (although there is probably a check that they are all unique). So inserting dummy target names for that column might be another option.

Sign In or Register to comment.