We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

A point mutation was missed by Mutect2 when read length redueced from 150bp to 75bp


I have been analyzing the NGS data from clinical cancer samples for diagnosis.
The NGS data are paired end reads with 150bp each.
Most of the paired end reads were overlapped because the DNA extracted from FFPE samples are likely to be degraded into short pieces less than 200bp.
So, I am thinking about reducing the read to 75bp if there are no differences in mutation detection.
Two types of read length (150bp and 75bp) were tested with one sample which was techincally repeated 6 times.
The 75bp reads were generated by cutting the original reads in half in fastq files.
In the analysis, one mutation (SNP) was missed with 75bp reads all the time but detected with 150bp reads.
The depth of the mutation was about 300bp, and its VAF was about 50% (heterozygous SNP).
Why Mutect2 could not detect the mutation with 75bp-reads?


  • amjaddamjadd FinlandMember ✭✭

    I don't have an answer, but I would be also interested to know what effect does the overlapping reads-mates have on detection by Mutect2, and whether it is better to do something about it.

    By the way, it sounds arbitrary to split the reads in half. You can clip the overlapping parts by some of the available tools such as fgbio:


  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @ensei Could you describe how you generated artificial 75bp data in more detail? I get that you generated two reads from each original read by splitting the bases in half, but how did you handle pairing, in particular read names?

    @amjadd Overlapping read pairs are not a problem. Mutect2 reduces base qualities where they overlap in order to account for possible PCR error (ie errors in overlapping paired reads may not be independent -- this behavior can be switched off).

  • enselensel Member


    Artificial 75bp data were generated by taking 75bp from the 5'end of each 150bp read.
    The paired fastq files were processed in the same way.
    It is like running a sequencer 75 cycles for each paired-ends.
    I'd like to go with 75bp reads if there are no differences in mutation detection. It's faster, and costs lower.
    Almost all the mutations (about 300 SNPs in normal sample) were detected by the artificial 75bp reads.
    Only one SNP was missed in the 6 samples (one sample, 6 repeats).
    The SNP was not missed with freebayes though.
    If you need the data, let me know.
    Thank you!


  • amjaddamjadd FinlandMember ✭✭

    @davidben Thank you for your answer. I suppose the behavior for PCR errors is controlled by the --pcr-snv-qual and --pcr-indel-qual parameters, but I am still not sure how to modify their default values.
    Another question, does Mutect2 count the overlapping paired-end reads twice when a mutations is within the overlapping segment?

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @ensel Is the mutation in question completely missed (ie not emitted at all) with the 75-bp reads, or is it emitted but filtered by FilterMutectCalls? You may as well also share your command lines for Mutect2, FilterMutectCalls, and GetPileupSummaries, CalculateContamination, and LearnReadOrientationModel if you are using those.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @amjadd When paired reads overlap and both support the mutation, the total contribution to the log-likelihood is the lower of (i) the sum of the base qualities and (ii) the PCR qual. That is, Mutect2 considers them as independent evidence as far as sequencing error is concerned but correlated as far as PCR error is concerned, and uses whichever yields a higher error probability.

    For example, suppose PCR has an error rate of 1/1000 and sequencing has an error rate of 1/100. Then the likelihood of overlapping reads both having a sequencing error is 1/10,000. However, the likelihood of a PCR error accounting for a variant in both reads is 1/1000, and so we use the lower qual (Q30) associated with PCR error. Let's say, however, that we had a PCR error rate of 1/100,000. In that case we would use the qual (Q40) associated with the sum of sequencing quals.

    If you are absolutely sure that no PR error exists, then you could set --pcr-indel-qual and --pcr-snv-qual to something very high like 100. In general, if the PCR error rate is 1/N, you set the qual to log_10(N)/10. That is, these are phred-scaled quals.

Sign In or Register to comment.