Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

GATK4 CNV ModelSegments hets output

FPBarthelFPBarthel HoustonMember

Hi GATK team!

I'm having trouble precisely understanding the ModelSegments hets output when ran on a tumor sample provided both a tumor and normal AllelicCounts are given.

The documentation reads:

If the matched normal is available, its allelic counts will be used to genotype the sites, and we will simply assume these genotypes are the same in the case sample. (This can be critical, for example, for determining sites with loss of heterozygosity in high purity case samples; such sites will be genotyped as homozygous if the matched-normal sample is not available.)

If this were truly the case then why:
1. Is a different number of variants (and not 1:1 exactly overlapping) output to hets.tsv and hets.normal.tsv
2. If I roughly quantify variant allele fractions in the hets.normal.tsv file, a large portion of them are far away from 0.5

Both these observations seem to contradict what the documentation states. Can someone explain the difference and similarities between the hets.tsv and hets.normal.tsv output file in a way other than stated in the documentation because I'm not understanding this explanation.

Tagged:

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin
    edited December 5

    Hello @FPBarthel,

    Bhanu asked that I look into your question. I believe you are quoting from the tool documentation. Tutorial#11683 briefly differentiates the two files in question, the hets.tsv and hets.normal.tsv. It states:

    Allelic counts for sites in the control that are heterozygous are written to hets.normal.tsv. For the same sites in the case, allelic counts are written to hets.tsv.

    Here the control refers to your normal and the case your tumor.

    I'm actually in Taiwan this week giving a four-day GATK workshop. Today is Day 3 of the workshop and I happened to present the CNV hands-on tutorial this afternoon. So the workflow is fresh on my mind. In the hands-on tutorial worksheet I state similarly:

    Of note, we have ​two​ files with ​.hets.​ in the extension, ​.hets.normal.tsv​ and .hets.tsv​. The former contains the normal control's heterozygous sites. The latter contains the tumor's allele counts for the normal's heterozygous sites.

    This being said, I think you may be less concerned with how each of these files are made and more concerned about (i) a discrepency in counts of sites and (ii) an imbalance in the allele fractions for your normal.

    Both these observations seem to contradict what the documentation states. Can someone explain the difference and similarities between the hets.tsv and hets.normal.tsv output file in a way other than stated in the documentation because I'm not understanding this explanation.

    Can you be more specific about the discrepencies you are observing? Also, it would be helpful for us to have the GATK version and commands you ran in case we need to trace a bug.

    If the mismatch in counts is what I think it is, then this is expected. Let me explain. All of the sites you observe in .hets.tsv should also be present in .hets.normal.tsv. However, .hets.normal.tsv can contain sites that are absent in .hets.tsv. Remember that there is some filtering the tool performs to remove data points that do not pass muster. The stdout of ModelSegments will inform you as to all that is going on. For example, ModelSegments informs you that it uses a 30 minimum read depth for allelic counts. This is actually a question I ask in the tutorial worksheet:



    You can confirm for yourself that this is the case for your data with something like the follow command:

    diff <(cat hets.tsv | grep -v '@' | cut -f1,2) <(cat hets.normal.tsv | grep -v '@' | cut -f1,2) | grep '>'
    

    The command will pull out the sites that differ and you can then use these loci to check in IGV your alignments.

    As for your second question, it would be helpful to know the outcome of your segmentation. Can you please post your plots and also give more detail about the skew? Thanks.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    HI @FPBarthel

    We have not heard from you in more than 2 business days. I hope the reply from @shlee helped you out. Please get back to us if you have any more questions.

  • FPBarthelFPBarthel HoustonMember
    edited December 10

    Apologies for the delay in getting back to you @shlee and @bhanuGandham.

    This being said, I think you may be less concerned with how each of these files are made and more concerned about (i) a discrepency in counts of sites and (ii) an imbalance in the allele fractions for your normal.

    Looking a bit more into the allele fractions (ii) this doesn't seem to be a big problem in hindsight. Looking at the distribution in several samples shows that it's clearly centered around 0.5, albeit slightly long tails reaching 0.2 and 0.8, but in hindsight I guess that's to be expected. For (i) count sites see below.

    Can you be more specific about the discrepencies you are observing? Also, it would be helpful for us to have the GATK version and commands you ran in case we need to trace a bug.

    I'm using gatk-4.0.10.1

    If the mismatch in counts is what I think it is, then this is expected. Let me explain. All of the sites you observe in .hets.tsv should also be present in .hets.normal.tsv. However, .hets.normal.tsv can contain sites that are absent in .hets.tsv. Remember that there is some filtering the tool performs to remove data points that do not pass muster.

    Thanks for this explanation! This is likely what happened to the count sites (i). And while this does answer my initial question, it also raises a new one, especially in terms of detecting homozygous deletions:

    The stdout of ModelSegments will inform you as to all that is going on. For example, ModelSegments informs you that it uses a 30 minimum read depth for allelic counts.

    Is this filter appropriate to apply to tumor samples? I understand why this filter is important to use for the matching normals, but I see problems in using it on the tumor samples. Imagine tumor sample with a homozygous deletion in CDKN2A (a common deletion in many cancer types). The matching normal sample could have a dozen highly covered heterozygous SNPs in this region, that in the tumor sample may only have a handful of reads (due to tumor impurity, because the tumor cells will have exactly zero reads here due to the homozyous deletion). ModelSegments would then discard these sites and downstream copy number callers would fail to pick up on the deletion?

    EDIT: just checked the ModelSegments documentation and it seems that although it is possible to change the --minimum-total-allele-count from the default value of 30, it's not possible to have a different threshold for tumor and normal, eg. a 30x normal threshold and 0x tumor threshold (which would make sense to me assuming tumor and normal were generated at the same time using the same experimental pipeline).

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @FPBarthel
    Thanks for your update! I am following up for bhanu today and wanted to make sure that following shlee's reply that all your questions were answered?

  • FPBarthelFPBarthel HoustonMember

    Hi @SChaluvadi not all questions as a related question came up, see the second part of my update.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @FPBarthel Ah yes I had misread that! I will look into it and get back to you!

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @FPBarthel,

    For your followup question:

    Is this filter appropriate to apply to tumor samples? I understand why this filter is important to use for the matching normals, but I see problems in using it on the tumor samples. Imagine tumor sample with a homozygous deletion in CDKN2A (a common deletion in many cancer types). The matching normal sample could have a dozen highly covered heterozygous SNPs in this region, that in the tumor sample may only have a handful of reads (due to tumor impurity, because the tumor cells will have exactly zero reads here due to the homozyous deletion). ModelSegments would then discard these sites and downstream copy number callers would fail to pick up on the deletion?

    The ModelSegments workflow can use either coverage data OR allelic data OR both types of data jointly in a multidimensional analysis. So for the scenario with CDKN2A, you should be able to detect deletion events with the coverage analysis. Depending on the event size, you may need to tune default parameters.

    As far as I'm aware, the allelic analysis aims to detect loss of heterozygosity (LOH) events and not deletion events. On a side note, it is my understanding (and please correct me if things have changed) tumor biopsy samples are typically impure. That is, we expect a typical tumor sample to present with some fractional purity where there is a fraction of normal tissue contaminating the tumor sample. It is my understanding impure tumor samples perform better in the ModelSegments CNV workflow's allelic analysis and allow for tumor-only allelic analyses.

    I'm rather under the weather at the moment so if what I say is not making sense, please do ask clarifying questions and I will relay these to our developers.

Sign In or Register to comment.