The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
Register now for the upcoming GATK Best Practices workshop, Nov 7-8 at the Broad in Cambridge, MA. Open to all comers! More info and signup at

The GATK Best Practices for variant calling on RNAseq, in full detail

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,557Administrator, Dev admin

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

Best Practices workflow for RNAseq


This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!

Geraldine Van der Auwera, PhD


  • map2085map2085 nyPosts: 3Member
    edited October 17

    Are the authors of GATK BaseRecalibrator concerned about Post-transcriptional Modifications in RNA-seq?

    Reverse Transcriptases have difficulty correctly reading modified nucleotides. The Reverse Transcriptase may produce an error at a modified nucleotide when making the cDNA. Illumina will then read the resulting cDNA correctly (and give high quality score). Thus, even though the Illumina reads are correctly reporting the base in the cDNA (with high quality scores), it will be "wrong" compared to the reference, and not masked by dbSNP since it is only a Post-transcriptional modification. This will severely reduce the resulting empirical quality scores calculated by BaseRecalibrator.

    For DNA-seq, BaseRecalibrator masks "--knownSites" of polymorphism when calculating empirical Quality scores. The "--knownSites" is usually a VCF from e.g. dbSNP.

    In RNA-seq, do the authors of GATK recommend any kind of VCF with known Post-transcriptional modifications in mRNA?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,557Administrator, Dev admin

    I'm afraid we don't have any recommendations for this -- in our hands BQSR performed normally on RNAseq data, but we haven't tested for this specifically. The size of potential effect is linked to how random vs. not random the errors might be, and what is the rate at which they occur. The more random and the lower the rate, the less noticeable any potential effect.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.