Dear GATK and mutSig developers,

mutSigCV coverage file: are the 6 mutation categories mutually exclusive?

I am recently trying to read the mutSigCV paper to understand the understanding statistical model and assumptions. However, I got confused about how the coverage file was obtained and what are the relationships between the 7 mutation categories with respect to coverage. Specifically, mutSigCV defines the following 7 categories of mutations:

  1. transition mutations at CpG dinucleotides
  2. transversion mutations at CpG dinucleotides
  3. transition mutations at C:G base pairs not in CpG dinucleotides
  4. transversion mutations at C:G base pairs not in CpG dinucleotides
  5. transition mutations at A:T base pairs
  6. transversion mutations at A:T base pairs
  7. null+indel mutations, including nonsense,splice-site,and indel mutation

The coverage file gives the sequence coverage achieved for each gene and patient for each of these 7 categories and also according to the zone (silent, nonsilent, flank). My understanding is that the coverage file counts the total covered bases that have the potential to be mutated in each of the mutation categories. Because all the bases are subject to indel mutation, so all sequenced bases should count to the 7th category? Then how the coverage for the other 6 categories counted? For each gene, should the 7th category coverage equal to the summation of all the previous 6 categories? Are all the first 6 categories mutually exclusive? In other words, if a base is counted in one of the categories, it would not count to another one?

Additionally, in the supplementary document for the published mutSigCV paper, it says: "covered bases will typically contribute fractionally to more than one zone depending on the consequences of mutating to each of the three different possibly alternative bases. " However, when I looked at the coverage file for the TCGA LUSC data set (as provided at mutSig website), all for genes/category/zone, the coverage are positive integers.

Could someone help me out with this? If I am not clear enough with my questions (I probably am), I am happy to try to rephrase them. I really appreciate any help/comments.


  • nzhaonzhao Seattle, WAMember

    Dear Geraldine_VdAuwera,

    Thank you very much for your reply. This helps a lot in understanding the model.

    Please allow me to explain my second question:
    There are three zones in calculating the coverage: silent, nonsilent, and flanking. Consider a particularly covered base C: mutation of this C to A or G causes amino acid changes or even a stop codon while a mutation of this C to T does not change the amino acid. Then it will count 2/3 to the nonsilent zoon and 1/3 to the silent zoon. Then when you count all the covered bases to each zoon category, you should (I suppose) to have some fractional numbers. But in the real LUSC data, coverages for all category + zoon combinations are integers. I am just confused why this is such. Or there might be some flaw in my logic?\

    By the way, the website says that we can construct our own coverage file based on BAM through the WIG file as an intermediate step. However, I can not find any document about how this is processed. Is there a document that explains this step?

    Thanks again for any explanation.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @nzhao,

    Thanks for the clarification. I don't know the answer to that question so I'll need to ask the developer to explain this.

  • nzhaonzhao Seattle, WAMember

    Thanks a lot.


