Calling variants in RNAseq

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
edited October 15 in Methods and Workflows

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

image

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.


The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample 

java -jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics 

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

image

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -recoverDanglingHeads -dontUseSoftClippedBases to your regular HC command line. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf 

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

image

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

image


Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

image

image

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.


[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013


NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • NicolasRobineNicolasRobine New YorkPosts: 2Member

    @mmterpstra: I believe the GATK pipeline should only used for calling variants and not for anything else. Therefore I think you should process your reads in parallel and apply CollectRnaSeqMetrics to the resulting BAM. I believe that things like MarkDuplicate will modify the results significantly...

  • amiami Posts: 36GATK Developer mod
    edited July 28

    Hi @sirian‌ "I actually don't quite understand how splitNcigarReads can remove "intronic overhang" if it is before the N cigar. How does it know it is intronic, without any annotation information" You are right, the tool does not try to "guess" the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

    btw - In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

    Hope it makes more sense now, let me know if not.

    Post edited by ami on
  • siriansirian USPosts: 11Member

    @ami said: Hi sirian‌ "I actually don't quite understand how splitNcigarReads can remove "intronic overhang" if it is before the N cigar. How does it know it is intronic, without any annotation information" You are right, the tool does not try to "guess" the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

    btw - In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

    Hope it makes more sense now, let me know if not.

    Thanks! Did you mean using reference annotations in STAR alignment? SplitNcigarReads doesn't seem to have such an option. I'm considering removing any variants that fall outside of exons based on RefSeq annotation, because theoretically RNA-seq reads should be from exons only (if not consider immature mRNA).

  • amiami Posts: 36GATK Developer mod

    @sirin - yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea). The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

  • siriansirian USPosts: 11Member

    @ami said: sirin - yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea). The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

    I actually tried redoing the alignment with Ensembl annotation, but that variant still exists. That position is in one exon by Ensembl, but not by RefSeq. It will be more stringent to use RefSeq but I'm worried it may be too stringent. Anyway, glad to know you will improve GATK for this issue and I'm looking forward to it. Thanks.

  • humaasifhumaasif University of SaoPauloPosts: 17Member
  • s6junchengs6juncheng Cologne, GermanyPosts: 5Member
    edited August 8

    Hi, I'm wondering does GATK support or plan to support calling variance betwenn two bam files, for example, compare one RNAseq data with pared DNAseq data to identify RNAediting events? Ideally would also support biological replicates. Thanks for the nice tool again.

    Post edited by s6juncheng on
  • amiami Posts: 36GATK Developer mod

    @s6juncheng‌ We do plan to provide such option/tool. We already doing something like that with allele specific expression (i'm currently working on that) and collaborate with few group to do the same with RNA editing. When we will have tools that are ready to be used by the community we will be glad to share them, as we always do.

  • jdcampjdcamp Broad InstitutePosts: 1Member

    Hello GATK team. I was trying to run SplitNCigarReads on some RNA-seq data aligned using the PRADA pipeline (BWA against genome+transcriptome reference, then all reads converted back genome coordinates). I tried using the "--refactor_NDN_cigar_string" option from the recent "vnightly-2014-09-12-gb0687a8" build (due to the presence of "1N1D1N" like reads), but received this error before the walker began transversal of any reads:

    ##### ERROR MESSAGE: Code exception (see stack trace for error itself)

    Here is the stack trace:

    
    ##### ERROR stack trace
    java.lang.UnsupportedOperationException
    at java.util.AbstractList.add(AbstractList.java:148)
    at java.util.AbstractList.add(AbstractList.java:108)
    at org.broadinstitute.gatk.tools.walkers.rnaseq.SplitNCigarReads.initialize(SplitNCigarReads.java:150)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:319)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

    Here is the command that I used:

    
    java -Xmx8g -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens_assembly19.fasta -U ALLOW_N_CIGAR_READS  --refactor_NDN_cigar_string -I PCGA-01-0002-050-23843-0279R-SLCC279.dup.marked.bam -o PCGA-01-0002-050-23843-0279R-SLCC279.Nsplit.bam

    Any thoughts? Thanks!

  • Patrick_DeelenPatrick_Deelen University Medical Center Groningen, the NetherlandsPosts: 3Member

    Thank you for this clear tutorial. In this paper (http://www.sciencedirect.com/science/article/pii/S0002929713003832) several filters are used to remove among others known RNA-editing sites and variant calls at splice sites. Apparently this helps to lower the false positive rate. This leads me to the following questions:

    1. Is the haplotype caller also sensitive to these potential false positive calls?
    2. If the haplotype caller is sensitive to for instance RNA editing is it possible to exclude these variants upfront so they are not used in the haplotype assembly?
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @Patrick_Deelen

    Let me bring @ami‌, our main RNAseq developer, into the conversation. He's currently focusing on RNA editing events so he will be able to tell you more about HC's behavior in that regard.

    Geraldine Van der Auwera, PhD

  • amiami Posts: 36GATK Developer mod
    edited October 1

    Hi @Patrick_Deelen‌ ,

    We do not consider the known RNA editing sites as FP so we do not filter them out. In fact one implementation of our pipeline is to find the RNA editing events. One can filter out (using -XL for example or other ways that @Geraldine_VdAuwera‌ can suggest) the known RNA-sites if you have them as interval file or VCF file. However I can't think of a way to prevent the haplotype assembly to use those site in the current pipeline.

    For the variant calls at splice sites, we believe that the previous studies that use older tools had many FP in those location due to the limitation of the aligners and tools. Currently we are not filtering those type of calls since the HC and the aligning pipeline are able to do much better job in those regions and for some implementation it is important to be able to call such sites. Also in this case there is no easy way to ignore these sites in the haplotype assembly step, however if it turn out to improve the output of the pipeline, we might include such option in the future version of the tools. If you see cases of FP calls that are due to this reason, please share them with us so we can make our pipeline even better

    Also, can you tell me a little bit about the purpose of RNAseq variant calling in your project? I try to learn more about the user cases of our pipeline so we can have them in mind while developing new tools and recommendations.

    Let me know if you have more questions or comments

    Post edited by ami on
  • shangzhong0619shangzhong0619 La JollaPosts: 3Member
    edited October 1

    Hello @ami, Thanks for the fascinating pipeline. I have some questions about the pipeline.

    1. In DNAseq pipeline, there is a step "joint genotyping" which merge all the gvcf files from all samples. In RNAseq it seems we should run each sample separately? And it also comes to a next question:

    2. How do we deal with the lanes and samples in RNAseq. Same with DNAseq? ( get recalibrated files separately --> merge lanes of the same sample --> realign). Or should we run through each pair of fastq files separately?

    3. And for the organism without dbSNP, in the step of base quality recalibration, should we do the same with DNAseq (hard filter high quality results as dbSNP ---> run BQSR).

    Thank you very much.

    Post edited by shangzhong0619 on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi @shangzhong0619,

    1. At this time the RNAseq pipeline is only meant for single-sample processing, yes. We have not finished testing how the tools perform on RNAseq for multiple samples.

    2. Lanes and samples should be treated the same way as for DNAseq.

    3. The same recommendation applies, yes.

    Geraldine Van der Auwera, PhD

  • Patrick_DeelenPatrick_Deelen University Medical Center Groningen, the NetherlandsPosts: 3Member

    Hi @Ami,

    Thank you for your prompt and extensive answer. Let me first start by briefly explaining our project. We recently performed eQTL and ASE using only genotypes derived from public RNA-seq samples (http://biorxiv.org/lookup/doi/10.1101/007633). We are here only interested in the genomic variation since these contribute to eQTL and ASE effects. Since we started that project the number of available samples has doubled and in the near future we want to also include these samples. In contrast to our current work we would then also like to also discover novel variants instead of only calling known variants. For this reason I also asked if the GVCF mode will be made available for the RNA-seq as well, Geraldine told me that work on this is in progress.

    Great that you now have higher accuracy around the splice sites, I agree that these are biological very interesting to call. I’m still a bit worried about the RNA editing sites though. In essence you are dealing with a non-diploid region at the RNA level, the 2 real copies from the DNA plus 1 or 2 haplotypes where the editing event took place. As long as this does not adversely affect the genotyping for nearby sites I’m happy, then I can simple exclude the editing sites at a later time. It would be a pity though if the nearby accuracy drops due to an editing event, is this something that you assessed?

    We would by the way be happy to beta test the GVCF method in combination with RNA-seq genotyping when the software is ready for this. We are using the Geuvadis RNA-seq data as a benchmarking set since these samples are also part of the 1000G project.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @ami may respond in more detail, but FYI it is already technically possible to emit GVCFs for RNAseq data, if you're interested in testing this. You just combine the specific arguments of both modes. We'd love to hear whether this produces meaningful results for you and what are the chief error modes you observe. By the way, if you like living on the cutting edge, you can try the in-development versions which are made available every night; see nightly builds in the downloads section.

    Geraldine Van der Auwera, PhD

  • dnaasednaase USC Epigenome CenterPosts: 2Member

    I get the similar error as @egeulgen. few warning line for each sample "Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD". I just copy and paste the command line in this tutorial, only change the file name to my own VCF file. The bam file is aligned by MapSplice2, instead of output quality score > 20, i output all of the variant regions with qual > 0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @dnaase, I don't understand what you are saying. Can you please clarify exactly what command you ran and what was the complete error message?

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.