The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
mutSigCV coverage file: are the 7 categories exclusive?
Dear GATK and mutSig developers,
mutSigCV coverage file: are the 6 mutation categories mutually exclusive?
I am recently trying to read the mutSigCV paper to understand the understanding statistical model and assumptions. However, I got confused about how the coverage file was obtained and what are the relationships between the 7 mutation categories with respect to coverage. Specifically, mutSigCV defines the following 7 categories of mutations:
- transition mutations at CpG dinucleotides
- transversion mutations at CpG dinucleotides
- transition mutations at C:G base pairs not in CpG dinucleotides
- transversion mutations at C:G base pairs not in CpG dinucleotides
- transition mutations at A:T base pairs
- transversion mutations at A:T base pairs
- null+indel mutations, including nonsense,splice-site,and indel mutation
The coverage file gives the sequence coverage achieved for each gene and patient for each of these 7 categories and also according to the zone (silent, nonsilent, flank). My understanding is that the coverage file counts the total covered bases that have the potential to be mutated in each of the mutation categories. Because all the bases are subject to indel mutation, so all sequenced bases should count to the 7th category? Then how the coverage for the other 6 categories counted? For each gene, should the 7th category coverage equal to the summation of all the previous 6 categories? Are all the first 6 categories mutually exclusive? In other words, if a base is counted in one of the categories, it would not count to another one?
Additionally, in the supplementary document for the published mutSigCV paper, it says: "covered bases will typically contribute fractionally to more than one zone depending on the consequences of mutating to each of the three different possibly alternative bases. " However, when I looked at the coverage file for the TCGA LUSC data set (as provided at mutSig website), all for genes/category/zone, the coverage are positive integers.
Could someone help me out with this? If I am not clear enough with my questions (I probably am), I am happy to try to rephrase them. I really appreciate any help/comments.