The current GATK version is 3.3-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Calling variants in RNAseq

edited October 2014

### Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

### Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

### The workflow

#### 1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir STAR --runMode genomeGenerate --genomeDir$genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>


2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir cd$runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>  3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass: genomeDir=/path/to/hg19_2pass mkdir$genomeDir
STAR --runMode genomeGenerate --genomeDir genomeDir --genomeFastaFiles hg19.fa \ --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>  4) The resulting index is then used to produce the final alignments as follows: runDir=/path/to/2pass mkdirrunDir
cd $runDir STAR --genomeDir$genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>


The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

java -jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics


#### 3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS


#### 4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

#### 5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

#### 6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf


#### 7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf


Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

### Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• New YorkPosts: 2Member

@mmterpstra: I believe the GATK pipeline should only used for calling variants and not for anything else. Therefore I think you should process your reads in parallel and apply CollectRnaSeqMetrics to the resulting BAM. I believe that things like MarkDuplicate will modify the results significantly...

• Posts: 40GATK Developer mod
edited July 2014

Hi @sirian
"I actually don't quite understand how splitNcigarReads can remove "intronic overhang" if it is before the N cigar. How does it know it is intronic, without any annotation information"
You are right, the tool does not try to "guess" the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

btw - In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

Hope it makes more sense now, let me know if not.

Post edited by ami on
• USPosts: 11Member

@ami said:
Hi sirian‌
"I actually don't quite understand how splitNcigarReads can remove "intronic overhang" if it is before the N cigar. How does it know it is intronic, without any annotation information"
You are right, the tool does not try to "guess" the splice positions, but use the ones that already discovered. In some cases, although STAR (although true for other aligners that we tried) find the correct splice position for some of the reads, but not for others. In this cases you will see in the IGV (see the screenshot we provided in the doc) that some of the reads were correctly split, while others have intronic overhang, and lead to FP in the calling step (unlike your case, where there are no correctly split reads). Those are the cases that are fixed by splitNcigarReads tool. We are aware on that some of the FP that we still have are due to similar errors as in you case, we have some ideas and we hope to fix those in the future.

btw - In your case it might be fixed by adding the known annotations (genecode or something like it) as input. We chose not to use them since we prefer to have a pipeline that can work based on the actual data. However, this is our recommendation and you can try other things and let us know if it work better for you. We would like to see such cases in order to improve our development.

Hope it makes more sense now, let me know if not.

Thanks!
Did you mean using reference annotations in STAR alignment? SplitNcigarReads doesn't seem to have such an option.
I'm considering removing any variants that fall outside of exons based on RefSeq annotation, because theoretically RNA-seq reads should be from exons only (if not consider immature mRNA).

• Posts: 40GATK Developer mod

@sirin - yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea).
The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

• USPosts: 11Member

@ami said:
sirin - yes, I meant using reference annotations in STAR alignment (although it might be useful to have an option for that input in SplitNcigarReads, thanks for the idea).
The problem is that you will limit yourself to the known annotations, which based on our discussions with other people in the field (though never checked it myself) cover only about half of the real exoms, and thus you will throw all the covered exoms and any novel splice junction that your data might have.

I actually tried redoing the alignment with Ensembl annotation, but that variant still exists. That position is in one exon by Ensembl, but not by RefSeq. It will be more stringent to use RefSeq but I'm worried it may be too stringent.
Anyway, glad to know you will improve GATK for this issue and I'm looking forward to it.
Thanks.

• University of SaoPauloPosts: 17Member

@ami thank you so much

• Cologne, GermanyPosts: 5Member
edited August 2014

Hi,
I'm wondering does GATK support or plan to support calling variance betwenn two bam files, for example, compare one RNAseq data with pared DNAseq data to identify RNAediting events? Ideally would also support biological replicates.
Thanks for the nice tool again.

Post edited by s6juncheng on
• Posts: 40GATK Developer mod

@s6juncheng‌ We do plan to provide such option/tool. We already doing something like that with allele specific expression (i'm currently working on that) and collaborate with few group to do the same with RNA editing. When we will have tools that are ready to be used by the community we will be glad to share them, as we always do.

Hello GATK team. I was trying to run SplitNCigarReads on some RNA-seq data aligned using the PRADA pipeline (BWA against genome+transcriptome reference, then all reads converted back genome coordinates). I tried using the "--refactor_NDN_cigar_string" option from the recent "vnightly-2014-09-12-gb0687a8" build (due to the presence of "1N1D1N" like reads), but received this error before the walker began transversal of any reads:

##### ERROR MESSAGE: Code exception (see stack trace for error itself)

Here is the stack trace:


##### ERROR stack trace
java.lang.UnsupportedOperationException
at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:107)

Here is the command that I used:


java -Xmx8g -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R Homo_sapiens_assembly19.fasta -U ALLOW_N_CIGAR_READS  --refactor_NDN_cigar_string -I PCGA-01-0002-050-23843-0279R-SLCC279.dup.marked.bam -o PCGA-01-0002-050-23843-0279R-SLCC279.Nsplit.bam

Any thoughts? Thanks!

• University Medical Center Groningen, the NetherlandsPosts: 3Member

Thank you for this clear tutorial. In this paper (http://www.sciencedirect.com/science/article/pii/S0002929713003832) several filters are used to remove among others known RNA-editing sites and variant calls at splice sites. Apparently this helps to lower the false positive rate. This leads me to the following questions:

1. Is the haplotype caller also sensitive to these potential false positive calls?
2. If the haplotype caller is sensitive to for instance RNA editing is it possible to exclude these variants upfront so they are not used in the haplotype assembly?

Let me bring @ami‌, our main RNAseq developer, into the conversation. He's currently focusing on RNA editing events so he will be able to tell you more about HC's behavior in that regard.

Geraldine Van der Auwera, PhD

• Posts: 40GATK Developer mod
edited October 2014

Hi @Patrick_Deelen‌ ,

We do not consider the known RNA editing sites as FP so we do not filter them out. In fact one implementation of our pipeline is to find the RNA editing events. One can filter out (using -XL for example or other ways that @Geraldine_VdAuwera‌ can suggest) the known RNA-sites if you have them as interval file or VCF file. However I can't think of a way to prevent the haplotype assembly to use those site in the current pipeline.

For the variant calls at splice sites, we believe that the previous studies that use older tools had many FP in those location due to the limitation of the aligners and tools. Currently we are not filtering those type of calls since the HC and the aligning pipeline are able to do much better job in those regions and for some implementation it is important to be able to call such sites. Also in this case there is no easy way to ignore these sites in the haplotype assembly step, however if it turn out to improve the output of the pipeline, we might include such option in the future version of the tools. If you see cases of FP calls that are due to this reason, please share them with us so we can make our pipeline even better

Also, can you tell me a little bit about the purpose of RNAseq variant calling in your project? I try to learn more about the user cases of our pipeline so we can have them in mind while developing new tools and recommendations.

Let me know if you have more questions or comments

Post edited by ami on
• La JollaPosts: 4Member
edited October 2014

Hello @ami,
Thanks for the fascinating pipeline. I have some questions about the pipeline.

1. In DNAseq pipeline, there is a step "joint genotyping" which merge all the gvcf files from all samples. In RNAseq it seems we should run each sample separately? And it also comes to a next question:

2. How do we deal with the lanes and samples in RNAseq. Same with DNAseq? ( get recalibrated files separately --> merge lanes of the same sample --> realign). Or should we run through each pair of fastq files separately?

3. And for the organism without dbSNP, in the step of base quality recalibration, should we do the same with DNAseq (hard filter high quality results as dbSNP ---> run BQSR).

Thank you very much.

Post edited by shangzhong0619 on
1. At this time the RNAseq pipeline is only meant for single-sample processing, yes. We have not finished testing how the tools perform on RNAseq for multiple samples.

2. Lanes and samples should be treated the same way as for DNAseq.

3. The same recommendation applies, yes.

Geraldine Van der Auwera, PhD

• University Medical Center Groningen, the NetherlandsPosts: 3Member

Hi @Ami,

Thank you for your prompt and extensive answer. Let me first start by briefly explaining our project. We recently performed eQTL and ASE using only genotypes derived from public RNA-seq samples (http://biorxiv.org/lookup/doi/10.1101/007633). We are here only interested in the genomic variation since these contribute to eQTL and ASE effects. Since we started that project the number of available samples has doubled and in the near future we want to also include these samples. In contrast to our current work we would then also like to also discover novel variants instead of only calling known variants. For this reason I also asked if the GVCF mode will be made available for the RNA-seq as well, Geraldine told me that work on this is in progress.

Great that you now have higher accuracy around the splice sites, I agree that these are biological very interesting to call. I’m still a bit worried about the RNA editing sites though. In essence you are dealing with a non-diploid region at the RNA level, the 2 real copies from the DNA plus 1 or 2 haplotypes where the editing event took place. As long as this does not adversely affect the genotyping for nearby sites I’m happy, then I can simple exclude the editing sites at a later time. It would be a pity though if the nearby accuracy drops due to an editing event, is this something that you assessed?

We would by the way be happy to beta test the GVCF method in combination with RNA-seq genotyping when the software is ready for this. We are using the Geuvadis RNA-seq data as a benchmarking set since these samples are also part of the 1000G project.

@ami may respond in more detail, but FYI it is already technically possible to emit GVCFs for RNAseq data, if you're interested in testing this. You just combine the specific arguments of both modes. We'd love to hear whether this produces meaningful results for you and what are the chief error modes you observe. By the way, if you like living on the cutting edge, you can try the in-development versions which are made available every night; see nightly builds in the downloads section.

Geraldine Van der Auwera, PhD

• USC Epigenome CenterPosts: 2Member

I get the similar error as @egeulgen. few warning line for each sample "Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD". I just copy and paste the command line in this tutorial, only change the file name to my own VCF file. The bam file is aligned by MapSplice2, instead of output quality score > 20, i output all of the variant regions with qual > 0

@dnaase, I don't understand what you are saying. Can you please clarify exactly what command you ran and what was the complete error message?

Geraldine Van der Auwera, PhD

• Chongqing, ChinaPosts: 3Member

@ami, HaplotypeCaller supported for multiple samples calling on RNA-Seq?

• Posts: 40GATK Developer mod

@GooderPanQi‌ , we never tried that ourself, but there is no reason why it should not work. You should probably follow the DNA pipeline, using gvcfs, and using the specific parameters for RNA. We was just discussing it with our friends in Mount Sanai and they will also try to do it.
Just to be clear - we do not have experience with that, so that the reason that it is not in the recommendation, but we will be happy to get any feedback from you and other users that are willing to try that, and hopefully, based on that, we can update the recommendation at some point.

Please let us know how it goes.

• Chongqing, ChinaPosts: 3Member

@ami, we run the step "variant calling", but there're something wrong: "Invalid command line: The parameter recoverDanglingHeads is deprecated.  This argument is deprecated since version 3.3". GATK version 3.3 don't supported the parameter "recoverDanglingHeads"??

Hi @GooderPanQi

That functionality is enabled by default now; this is described in the 3.3 version notes. I'll update the RNAseq docs accordingly.

Geraldine Van der Auwera, PhD

• germanyPosts: 7Member

Hi @ami and @Geraldine,
First of all, thanks for this excellent workflow! Easy to follow and very descriptive!
I do have, however, some remarks concerning read groups and mapping quality. You wrote in the workflow, that these need to be adjusted, after running star. Are you aware that this can be done already within star? With the parameter "--outSAMmapqUnique" you can change the mapping quality assigned for unique mappings from the default 255 to 60. With the "--outSAMattrRGline" parameter you can add a read group line to the sam header. I further think it's useful to mention that star can directly output mappings in a coordinate sorted BAM file using "--outSAMtype BAM SortedByCoordinate". If you prefer picard sorting, you could at least start with an unsorted BAM using "--outSAMtype BAM Unsorted".
Just to let you know.
Best,
c

@corlagon Thanks! Are these capabilities in the latest version of STAR? We know Alex has been making a lot of upgrades, and we haven't yet updated this doc to fit, so I wouldn't be surprised if that was the reason for these redundancies.

Geraldine Van der Auwera, PhD

• Posts: 40GATK Developer mod

@corlagon‌ - thanks, we do aware of all those changes since we do touch bases with Alex. I assume that in the next versions of the pipeline, some of those changes or all of them will be included (in fact Alex included some of those options after our joint discussions).
Currently, since we focus on using the existing pipeline for several applications, and we plan to revisit the pipeline (and improve it farther more) after we will get more experience with several of its main use cases.

Please continue to give us feedback and comments, to allow us make sure we provide the best recommendations.

• Chongqing, ChinaPosts: 3Member

Hi @ami and @Geraldine, thanks for this excellent workflow. Since we used 'Star' to align RNA raw read, we find some problems that the number of aligned bam ( 19291348, 'samtools flagstat' was used ) with 'STAR' is more than RNA sequencing raw reads ( 18698820, wc -l fq ). Then, how we should compute mapped rate on RNAseq with 'STAR' ??

@GooderPanQi The developer of STAR, Alex Dobin, would be better qualified to answer that question. I believe he has a mailing list for support.

Geraldine Van der Auwera, PhD

• TorontoPosts: 6Member

Hi there,

New member here. I was wondering whether you would consider adding your "dangling-head" functionality to the Unified Genotyper as well. I've decided to use UG for RNAseq variant calling because my samples are heterogeneous primary tumours which likely contain a variable and unpredictable number of haplotypes in a given region, and so I didn't think HC was appropriate. I realize GATK does not support variant calling for tumour samples, but other tools (eg. MuTect) do not support RNAseq, so I tried to make the best of it.

It turns out that this RNAseq pipeline performs quite well for my samples, in that I get strong concordance between calls made from RNAseq and whole exome for the same individual (~90%), given adequate coverage of the variant. From manual inspection of my alignments, it seems that many of my false positive calls from RNAseq are caused by these "dangling heads" misaligned to introns. Something like "--recoverDanglingHeads" in UG would clean a lot of this up, I think. Just a suggestion.

Cheers!

• Posts: 40GATK Developer mod

@santayana‌ can you upload an IGV screenshot of such an example, I don't think you are thinking about "dangling heads" in the same way that we are, but I want to be sure.
[I will let @Geraldine_VdAuwera‌ to replay about any future development of UG]

• TorontoPosts: 6Member

Thanks for the quick reply Ami. The top window is whole exome and the bottom window is RNAseq from the same individual. It seems that the variant is only present in reads that only extend a few bases into the intron, suggesting to me that this a misalignment.

• TorontoPosts: 6Member

Hi @santayana‌,

Unfortunately it is impossible to add this functionality to the UnifiedGenotyper, since it does not perform any reassembly or realignment of the reads. In any case, the UnifiedGenotyper is now permanently deprecated, which means that we will no longer make any fixes or improvements to it. Going forward, we are only putting development effort into HaplotypeCaller.

For the record, the HaplotypeCaller is neither more nor less appropriate for tumor samples than UnifiedGenotyper, but at least it does handle RNAseq properly, so I would recommend you use HC rather than UG. In future we may be able to provide support for tumor RNAseq, but that's still a long way down the road.

Geraldine Van der Auwera, PhD

• TorontoPosts: 6Member

Hi Geraldine,

Thank you for your reply. It may be a moot point now, but please let me follow up your comment about HC vs UG for tumour samples, since this is a point of confusion for my colleagues and I. UG seems to me to be more suitable than HC for genetically heterogeneous tumour samples for the following reason: Say a tumour has two subclones: one contains a heterozygous mutation A, and one contains a heterozygous mutation B. These mutation modify adjacent bases, but they are mutually exclusive: no cell (and no haplotype) contains both A and B mutations. As I understand it, HC calls the two likeliest haplotypes, which in this case would be 1) wild type for both A and B, and 2) mutant for one of A or B (whichever is more abundant). In other words, it is bound to miss one of the mutations, because it cannot handle more than two haplotypes. UG on the other hand, considers each mutation independently, so, as long as the heterozygous mutant genotype has higher likelihood than wild type, then the mutation will be called, even if the mutant allele frequency deviates strongly away from 0.5. Hence, A and B are likely to be called by UG. In my tests, heterozygous mutants can be called with minor allele frequencies as low as 0.1, presumably because a heterozygous genotype is still more likely than a wild type genotype when sufficient depth is present.

Have I misunderstood something?

Hmm, I guess that's a fair point -- but you could solve this in HC by using the -ploidy argument to allow the caller to consider additional haplotypes.

Geraldine Van der Auwera, PhD

• Posts: 40GATK Developer mod

Hi @santayana
1) since it is an alignment issue, I just wanted to make sure you are using 2 pass STAR and use the SJ from the first round in the second round. Do you?
2) That type of problem are suppose to be fixed by SplitNCigarString and not by "dangling heads", which fix the missed true variants at the begging of exons. If you do use SplitNCigarString in you pipeline, it might still not fixed that specific error, since the default mismatches to clip the overhang are 2. You might change it to one and see how many "real" event you will remove, and how many FP you will remove, and decide based on that tradeoff if to use it or not.
3) I don't think you are write regarding your interpretation of how the HC will handle your suggested case. If there is no read to support the A+B haplotype, it won't be called as one event although it is a consecutive bases. Let me bering to the discussion @valentin who is actively developing the HC.
4) As Geraldine suggested the multi ploid option of HC can solve many of the more complex cases that might give advantage to a single locus caller, and we are testing it currently with RNA editing calling, and hopefully will publish even better recommendation for such cases in the near future.

Best,
Ami

• TorontoPosts: 6Member

Hi Ami and Geraldine,

1) Yes, I am following the pipeline that you have set out here, with the only exception of substituting HC for UG.
2) Ah OK, I understand now this is SplitNCigarString issue. I'll try changing the default option - thanks!
3) That's my point, that one of A or B won't be called in the sample even though they are both present.
4) I suppose I could experiment with the -ploidy option, although I imagine the intent of this option is to handle non-diploid organisms, and not the issue of clonal heterogeneity of tumours discussed here (although broad and focal copy number changes are common in cancer, which makes calculating the number of haplotypes even more complicated). The extent of tumour heterogeneity is an active area of research, but single cell sequencing data suggests it is very extensive. If I set -ploidy > 10, wouldn't that slow things down a lot?