Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

Detection of somatic mutations with RNAseq - Mutect2

mcharlesmcharles Jouy en JosasMember

We are working on a swine model of melanoma, where tumors trigger an efficient immune response, most likely by producing neoantigens, ie proteins carrying somatic mutations and thereby recognized as tumor antigens by the immune system.

We previously used Mutect2 from GATK for variant calling on tumor exome (from another project) and tried to use it again, this time to detect somatic mutation in tumor with RNAseq (16 matched pairs “normal tissue-tumor tissue”) as we don't have the correponding DNAseq. We got the following error :

ERROR MESSAGE: Unsupported CIGAR operator N in read ST-J00115:28:H5HTYBBXX:2:1214:24637:30538 at 1:12383. Perhaps you are trying to use RNA-Seq data? While we are currently actively working to support this data type unfortunately the GATK cannot be used with this data in its current form. You have the option of either filtering out all reads with operator N in their CIGAR string (please add --filter_reads_with_N_cigar to your command line) or assume the risk of processing those reads as they are including the pertinent unsafe flag (please add -U ALLOW_N_CIGAR_READS to your command line). Notice however that if you were to choose the latter, an unspecified subset of the analytical outputs of an unspecified subset of the tools will become unpredictable. Consequently the GATK team might well not be able to provide you with the usual support with any issue regarding any output
ERROR ------------------------------------------------------------------------------------------

Also, we could read in several forums dedicated to variant calling that this tool was not adapted, due notably to splicing and read depth depending on expression levels.

  • Shall we try anyway to run a few samples and check the output? We do not want an exhaustive list of variants, just a global idea of the relevance of our approach on neoantigens
  • Is it totally hopeless since it would just give us false positive results?
  • Is there an alternative way to call variants on these data, integrating matched pairs, and without reference genome for the samples?

Best Answer


  • mcharlesmcharles Jouy en JosasMember

    Thanks for your answer. We'll follows the recommendations.

  • dcampodcampo Los AngelesMember


    On the line of the above thread, I am doing somatic calling using RNAseq data, with matching tumor and normal samples. I have followed the Best practices recommendations for RNAseq pre-processing, all the way to re-calibrated bam files. Then, using GATK4, I created a PON with all my normal samples, and ran Mutect2, following the recommendations in
    The only option that I included that is not in the tutorial is the "--dont-use-soft-clipped-bases", which is specific for RNAseq (according to another post from you, Geraldine).
    After that, I estimated contamination, and filtered using FilterMutectCalls. I did not run the optional additional filtering step though (FilterByOrientationBias). But like I said, everything else was done following good practices and tutorials.
    Now, when I count the number of true positives (i.e. those with the filter PASS) in the final VCF, of the 51 tumor-normal pairs, half of them have more than 1,000 somatic calls, and some of those have between 20,000 and 30,000. The most extreme case is one with 250,000 somatic calls.
    I've never done somatic calling before, but I've been told that the number of true positives retained after filtering is typically in the order of a few hundred. If that's correct, it probably means that I am doing something wrong. Yet, I have gone through the entire pipeline several times, but cannot find any issues.
    I loaded in IGV the extreme sample with 250,000 somatic calls, randomly checked a few, and they seem to be alright. I have not checked all of them though, obviously.

    Since the above thread is from a year ago, I am wondering if there is now any recommendation for somatic calling with RNAseq data? Perhaps some of the filters applied in any of the steps are different than the recommendations in the tutorials (which are for exome-seq)?
    Any help will be much appreciated, cause I've been googling like crazy, but there's not much about somatic calling with RNA.


  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    We don't currently have any Best Practice guidelines for RNA seq in Mutect2. But, let me get back to you on reasons you may be getting the results you are getting


  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @dcampo,

    For RNA-Seq data, the high number of somatic calls you see could be reasonable. Just to be clear, we do not support your use-case. But we'd like to explain why your results are unsurprising.

    What we need to account for in RNA-Seq data that is not a factor in DNA-Seq data is differential coverage due to (i) differences in transcription, e.g. in the normal gene A is expressed but in the tumor gene A is not expressed, and (ii) differential isoform expression, e.g. in the normal gene B transcript extends to the distal 3'UTR but in the tumor gene B transcript expresses only up to the proximal 3'UTR. I enumerate on detailed causes of differential coverage here.

    Here are some approaches to consider.

    • Restrict somatic short mutation calling to regions where both the tumor and normal have some minimal coverage. This can be done to your current callsets by subsetting the calls using the -L option to regions you have determined have coverage.
    • Collect coverage for each of your samples. GATK4 has a variety of tools towards this and the tool docs list them under Coverage Analysis. GATK3 has additional tools, e.g. DepthOfCoverage and DiagnoseTargets.

    Remember that for RNA-Seq, somatic analyses should additionally account for differential gene expression. This is an analysis type that researchers traditionally perform with outside programs, e.g. CuffDiff. Although the applicability is questionable and completely untested, the closest GATK4 comes to analyzing differential expression is with the somatic CNV workflow, which is currently in beta release. A rudimentary outline of the workflow can be found in a link on the February 14 comment in this thread.

    Good luck.

Sign In or Register to comment.