We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

The GATK Best Practices for variant calling on RNAseq, in full detail

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

Best Practices workflow for RNAseq


This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!


  • map2085map2085 nyMember
    edited October 2016

    Are the authors of GATK BaseRecalibrator concerned about Post-transcriptional Modifications in RNA-seq?

    Reverse Transcriptases have difficulty correctly reading modified nucleotides. The Reverse Transcriptase may produce an error at a modified nucleotide when making the cDNA. Illumina will then read the resulting cDNA correctly (and give high quality score). Thus, even though the Illumina reads are correctly reporting the base in the cDNA (with high quality scores), it will be "wrong" compared to the reference, and not masked by dbSNP since it is only a Post-transcriptional modification. This will severely reduce the resulting empirical quality scores calculated by BaseRecalibrator.

    For DNA-seq, BaseRecalibrator masks "--knownSites" of polymorphism when calculating empirical Quality scores. The "--knownSites" is usually a VCF from e.g. dbSNP.

    In RNA-seq, do the authors of GATK recommend any kind of VCF with known Post-transcriptional modifications in mRNA?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I'm afraid we don't have any recommendations for this -- in our hands BQSR performed normally on RNAseq data, but we haven't tested for this specifically. The size of potential effect is linked to how random vs. not random the errors might be, and what is the rate at which they occur. The more random and the lower the rate, the less noticeable any potential effect.

  • sbombinsbombin TuscaloosaMember

    Does expression level effects how many variants be found?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @sbombin Potentially, yes. If you have different expression of two alleles, and one is very low, you may not be able to call them correctly -- e.g. you may call the wrong genotype, or miss the call altogether (if it looks hom-ref). Also, anything that is not expressed at all will typically not be callable.
  • dnousomednousome Member

    Hi Geraldine and others,
    I'm trying to run this RNA-seq variant calling pipeline on human tumor data (we have quite a few), and it seems like variant calling using Haplotypecaller is our rate limiting step. While alignment and preprocessing are quick, Haplotypecaller runs a bit slower. Is there any way to speed up the process? Could we potentially exclude indels (not of interest at the moment) to speed up the runs?

  • everestial007everestial007 GreensboroMember ✭✭
    edited November 2016

    HaplotypeCaller runs slower since it is a haplotype based calling method. HCaller is loosing time when doing denovo assembly of haplotypes in the active region, to find the most likely haplotype. Based on the number of samples and size of the genome and the computing power you have, it will take considerable time. On my desktop with 4 cores, 12 gb RAM, 205 mb reference genome and 12 samples, HaplotypeCaller takes about 2-3 days. If you use UnifiedGenotyper it will work way faster, like couple of hours for the same parameters-sample combination, I discussed earlier.

    I am not sure how much indel exclusion helps with the run of the speed. But, see this: http://gatkforums.broadinstitute.org/wdl/discussion/1975/how-can-i-use-parallelism-to-make-gatk-tools-run-faster
    HCaller only support -nct but isn't quite stable. But, on the positive note: I am used server with large RAM capacity (several gbs) and in this case the HCaller perfoms way much faster. In the report files I see that quite a high amount of memory was allocated to HCaller. I am not sure why the use of large RAM isn't quite discussed to speed up HCaller.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hey all, it's not possible to exclude indels from HaplotypeCaller, and we don't recommend using UnifiedGenotyper because when you don't model for indels, you will make errors in SNPs that are near indels. We recommend parallelizing execution of HC using data scatter gather.

  • satoshisatoshi kyotoMember

    Dear Geraldine

    How does this pipeline is accurate?
    Would you show me any paper or data, comparing whole exome/genome sequence and this pipeline?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @satoshi, we have not published any benchmarking or comparisons on this pipeline ourselves. Some of our collaborators have published a paper that I think may be related, which may include comparative results. We've linked to it previously; you may find it and related papers by looking for Ami Levy-Moonshine, Brian Haas and Timothy Tickle as authors.
  • satoshisatoshi kyotoMember

    Thank you for your reply.
    I will check it out!

  • WimSWimS Member ✭✭

    Hi @Geraldine_VdAuwera. Is this RNA-seq variant calling best practice document still valid for GATK4? I don't see any RNA-seq specific arguments in the haplotypecaller in GATK4 beta1. Thank you.

  • WimSWimS Member ✭✭

    Nevermind. Found the correct document http://www.broadinstitute.org/gatk/guide/article?id=3891 and GATK4 beta1 jar does seem to support SplitNCigarReads an ddontUseSoftClippedBases.

  • picard_gatk_mjpicard_gatk_mj Unconfirmed

    so gatk3 support joint calling for rna_seq, gatk4 does not ,why

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    No, we have not validated the GVCF workflow in both GATK3 and in GATK4. But, there are some new developments happening on that front, so this may happen in the near future.


  • aushaush Member

    can this workflow be adapted for mouse RNA-seq data?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Yes, you can use this workflow on mouse data. You just need to find the appropriate resource and reference files.


  • Hi,

    I would like to know what changes must be done in the RNA-seq variant calling best practice document to be valid for GATK4.

    Thank you,
  • Hi @Geraldine_VdAuwera ,

    In one of your earlier responses in Oct 2016, you mentioned that:

    "If you have different expression of two alleles, and one is very low, you may not be able to call them correctly -- e.g. you may call the wrong genotype, or miss the call altogether (if it looks hom-ref)."

    Since the hom-ref (i.e. 0/0) would affect the heterozygosity calculation a lot, is it possible to recover them? If not, how should I properly deal with loci with 0/0 so that the heterozygosity will not be biased.

    Also, missing genotypes are represented in two different ways, "./." and "." in vcf files from variant calling with RNAseq data. Although some documentations explain that they represent diploid and haploid missing genotypes, it is still unclear how they differ. Could you briefly explain what may generate "./." and ".". Is "." actually "0/0"?

    Thanks a lot!

  • AmmuAmmu Member
    I would like to determine variants from RNA-seq data that was generated in different lanes. I've combined it during alignment step using HISAT2 and the uniquely mapped reads to the ref. genome is 80%. Can I use this bam file from the step 2 (ie. Add read groups, sort, mark duplicates, and create index) onwards in GATK Best Practices workflow for SNP and indel calling on RNAseq data?
  • Hello, how should I properly cite the pipeline of Calling variants in RNAseq in a manuscript?

Sign In or Register to comment.