Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Calling variants in RNAseq

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin
edited June 11 in Methods and Workflows

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

image

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.


The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample 

java -jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics 

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

image

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -recoverDanglingHeads -dontUseSoftClippedBases to your regular HC command line. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf 

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

image

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

image


Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

image

image

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.


[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • NicolasRobineNicolasRobine New YorkPosts: 2Member

    Hi,

    First, thanks a lot for this document. I am testing it and will probably report on it soon. In the meantime, I have a question about the alignment part. In STAR, it is possible to build the index with a set of canonical splicing junctions, in addition to the reference sequence (as shown in the second pass of the alignment). Has it been tried to do the first-pass with a known annotation (RefSeq or Gencode, for instance). Does it decrease the performances? Also, a typo in the command for the generation of the index, in 2ndPass-STAR: the parameter genomeFastaFiles need a double-dash: --genomeFastaFiles

    Thanks. Nicolas Robine

  • amiami Posts: 31GATK Developer mod
    edited March 25

    Hi @NicolasRobine‌, thanks for pointing out the error in the STAR command. We did not use a known annotation file as you suggest since we would like to have a pipeline that is not dependent on such annotations (that are constantly being updated and yet are probably still missing many splice junctions). One can use it and I I would assume it won't decrease the performance. If you do it, it will be great to get a feedback about it.

    Post edited by ami on
  • pachewychomppachewychomp OregonPosts: 2Member

    Hi there,

    Thanks for this resource! I have been trying to duplicate this protocol and am coming across error messages similar to those related to old bugs in the PG management system. Here is the stack trace:

    ERROR stack trace

    net.sf.picard.PicardException: 3 program groups weren't processed. Do their PP ids point to existing PGs? @PG ID:MarkDuplicates.2 PN:MarkDuplicates PP:T @PG ID:MarkDuplicates.3 PN:MarkDuplicates PP:A @PG ID:MarkDuplicates.4 PN:MarkDuplicates PP:M

        at net.sf.picard.sam.SamFileHeaderMerger.mergeProgramGroups(SamFileHeaderMerger.java:321)
        at net.sf.picard.sam.SamFileHeaderMerger.<init>(SamFileHeaderMerger.java:170)
        at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMReaders.<init>(SAMDataSource.java:885)
        at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMResourcePool.createNewResource(SAMDataSource.java:795)
        at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource$SAMResourcePool.getAvailableReaders(SAMDataSource.java:766)
        at org.broadinstitute.sting.gatk.datasources.reads.SAMDataSource.<init>(SAMDataSource.java:277)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.createReadsDataSource(GenomeAnalysisEngine.java:872)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.initializeDataSources(GenomeAnalysisEngine.java:761)
        at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:284)
        at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:121)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
        at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
        at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:107)
    
    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8):

    I do have the most recent version of Java installed.

    Thanks much!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @pachewychomp,

    I think the latest version of Picard no longer produces this issue (or there's an option to disable writing PG tags). To work with these files, you'll need to reheader them in order to get GATK to accept them as input.

    Geraldine Van der Auwera, PhD

  • pachewychomppachewychomp OregonPosts: 2Member

    Thanks @Geraldine_VdAuwera‌. For anyone else running in to this issue, set PG=null in Picard's (v1.110) MarkDuplicates.

  • mjafinmjafin UKPosts: 2Member

    I had a comment regarding SplitNCigarReads and how it assigns the same QNAME for the split reads. In the SAM spec it says that "In a SAM file, a read may occupy multiple alignment lines, when its alignment is chimeric or when multiple mappings are given." However, SplitNCigarReads doesn't change anything else in the reads except the CIGAR and start position. In other words, the FLAG isn't changed in any way to reflect the split reads. I wonder if this is intentional?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @mjafin,

    The BAM file produced by SplitNCigarReads is only intended to be used for variant calling by the HaplotypeCaller. So that information is not meant to be usable by other programs, if that's what you are concerned about.

    Geraldine Van der Auwera, PhD

  • mjafinmjafin UKPosts: 2Member

    Hi @Geraldine_VdAuwera, That's what I was trying to get at, yes - thanks :)

    So basically the BAM files shouldn't be used downstream for anything else I guess

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Yep, that's right. Eventually we're hoping to integrate that tool's functionality into the HaplotypeCaller to avoid producing yet another intermediate bam file.

    Geraldine Van der Auwera, PhD

  • zzqzzq ChinaPosts: 9Member

    Hi ,all : Can I use tophat2 to instead of STAR ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    You can if you want. We found that we got best results overall (including indel calls) with the STAR 2-pass method. But you can choose to use something different if you prefer. It's your data, your analysis :)

    Geraldine Van der Auwera, PhD

  • yasinkaymazyasinkaymaz Posts: 8Member

    Hi Geraldine, I am currently running GATK for my RNAseq data set (after running TopHat with bowtie2). It is great that you have integrated a tool that takes care of reads with Ncigar. I used to filter out all spliced reads and run GATK. I was wondering if it would be possible to keep rest of the read (somehow, may be in separate intermediate file) and make use of them as well, rather than throwing them away? I know I want too much but just an idea. My second question; would it hurt if we use known variation databases for VQSR in case of RNAseq (human)? I am testing both VQSR and hard filtering but wondering the side effects?

  • amiami Posts: 31GATK Developer mod
    edited April 9

    @zzq : Just to give more info to @Geraldine_VdAuwera‌ answer - if you care only about SNPs, Tophat will perform only slightly worse than star (** based on our data ** ). Probably most of the differences will of variants that are close to the splicing positions and/or close to Indels. If you do care about Indels as well, star will perform much better (when using the default parameters of both aligners). As we try to get more familiar with the data that our users have - can you please also mention what type of data you have?

    Post edited by ami on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @yasinkaymaz,

    I was wondering if it would be possible to keep rest of the read (somehow, may be in separate intermediate file) and make use of them as well, rather than throwing them away?

    I'm not sure exactly what you mean, can you please clarify? With the new tool nothing really gets thrown away; the reads with Ns are split but all non-N bases are used, for the most part.

    would it hurt if we use known variation databases for VQSR in case of RNAseq (human)?

    This is something we're currently testing; @ami may be able to tell you more about the very latest findings, but my understanding is that VQSR should work. There may just be some tweaks required to get the best results.

    Geraldine Van der Auwera, PhD

  • amiami Posts: 31GATK Developer mod

    Hi @yasinkaymaz‌,

    Since the pipeline is currently recommended for single sample (as we did not test it yet with many samples), we do not have recommendations or enough experience with VQSR for that pipeline (you need to see enough data in order to use VQSR, which is not the currently the case with the RNA-seq pipeline). We are planning to use VQSR in the future versions of the best practices pipelines.

    As we try to get more familiar with the data that our users have - can you please also mention what type of data you have? Do you have data from many samples (enough for VQSR)?

  • yasinkaymazyasinkaymaz Posts: 8Member

    @Geraldine_VdAuwera‌ What I meant by Ncigar was the reads that are represented with an "N" notation in cigar in their bam/sam files, which basically shows that they are spliced reads (specific to RNAseq). I did not mean the reads with Ns (ambiguous bases). As far as I understand from the explanations of the tool above ("3. Split'N'Trim and reassign mapping qualities"), the reads that carry N in their cigar stings are trimmed off of their intronic site. This way tool gets rid of possible miss-matches. Please correct me if I am wrong. What I suggested to keep was the part of the read that is trimmed off.

    @ami‌ My data is strand specific total RNAseq and I have ~40 samples. I ran GATK on individual samples but also planning to run it on all at the same time even though I don't know up to which level it will effect the calls.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Ah, I see -- you're referring to "dangling" heads and tails. When they occur later, within the HaplotypeCaller (due to the reads being moved during the assembly step), they are "rescued" and placed in the appropriate location if possible. When they occur at the Split'N'Trim step we currently can't rescue them, but I believe the plan is to move that functionality into HaplotypeCaller, so they will ultimately also be rescued wherever possible. @ami, is this the right way to put it?

    Geraldine Van der Auwera, PhD

  • amiami Posts: 31GATK Developer mod

    @yasinkaymaz‌ - can you give an example of what do you mean? I'm not sure I understand what you mean by "This way tool gets rid of possible miss-matches"

  • yasinkaymazyasinkaymaz Posts: 8Member

    @ami If you try to align RNAseq reads using directly bowtie/bwa, you will not get any spliced reads (without using any soft-clipping option etc.). Then, when you visualize the alignment on IGV, you would see many miss-matches especially in intron spanning reads. These would introduce many false positives when you call variations. However; a splice aligner splices these intron spanning reads and flags the cigar string with 'N'. At this point GATK needs to handle cigar string properly. So, my understanding is that SplitNCigarReads tool is for this purpose, right? I might have been a little unclear, sorry. Thanks for your time and effort.

  • amiami Posts: 31GATK Developer mod

    Everything you describe is right. The splice aligner will add N's to the cigar only to gap between 2 exons, and this is the part that we remove with the SplitNCigarReads, so you don't (implicitly) lose information and there is no part of the read that I can think that we wold like to keep as in your suggestion "What I suggested to keep was the part of the read that is trimmed off" - the part that is trimmed is only N's. We do trim some "dangling" heads and tails that have many mismatches since they usually should be spliced by the aligner by somehow were missed. Does it make sense now?

  • yasinkaymazyasinkaymaz Posts: 8Member
    edited April 9

    Ok, now it makes more sense. You don't trim the reads from the spliced position. You only trim small dangling parts. Thanks for the explanation.

    Post edited by yasinkaymaz on
  • yg1yg1 Posts: 37Member

    My question is why reduced reads option is not recommended for? Why reduce the minimum phred scale to 20?

  • AicenAicen Posts: 1Member

    HI all, i got the problem that mein coordinate sorted bam file is unsorted after using picard removeduplicates.jar and therefore i got an error when i try to use java -jar GenomeAnalysisTK.jar -T SplitNCigarReads my commands: java -jar ${picard}MarkDuplicates.jar I=$bam O=$bam_out CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M="${bam_out[0]}.${bam_out[1]}.metrics"

    You maybe need to add this step to the workflow or i did sth wrong. Thx, Alex.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @yg1 ReduceReads would strip out important information about splice junctions. Anyway, all GATK 3.x tools will refuse to work with reduced bams.

    Geraldine Van der Auwera, PhD

  • zzqzzq ChinaPosts: 9Member

    Hi all: I have failed when i use SplitNCigarReads(mapping by tophat2). It give me an error like this : Cannot split this read (might be an empty section between Ns, for example 1N1D1N): 10M628N2D203N90M Should I filter these reads mapped by this style ? What is more , which tools can do this? (does samtools can finish it ?) thanks!

  • amiami Posts: 31GATK Developer mod

    @zzg‌ which aligner are you using? such a Cigar string does not make any sense, so I would suspect such results... You can write a GATK walker to filter such reads (I don't think that any of the tools can do it right now). Even if you do filter these reads out, I would be suspicious with the other reads...

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @zzq Would you be able to share a snippet of data that reproduces this error? We want to understand why the mapper is producing this CIGAR in order to decide how we should handle it.

    Geraldine Van der Auwera, PhD

  • egeulgenegeulgen USPosts: 3Member

    hey, First of thanks really very much for this! I've already run the HaplotypeCaller and also run VariantFiltration on the resulting VCFs. However, for each read I get 1-3 warning lines like this: "Interpreter - ![0,2]: 'QD < 2.0;' undefined variable QD" I thought this is because some lines in the VCF files don't contain QD. Is that possible? I don't think it should bother me that much because I only get the error for 1 to 3 lines per sample. Should I check what lines don't contain QD? I'm using the latest GATK version(v3.1-1-g07a4bf8) and the line for variant filtration is exactly the same you suggest using, above. The process doesn't abort and I have the resulting filtered vcfs.

  • zzqzzq ChinaPosts: 9Member

    @ami@Geraldine_VdAuwera‌ I mapped my reads by tophat2. I have filtered these reads by a simple perl script :perl -e 'while(<>){chomp;@tmp=split/\t/;if($tmp[0]=~/^\@/ or $tmp[5] !~ /(\d+)N(\d+)D(\d+)N/){print "$_\n"}}' input.sam > filtered.bam .Until now , it works all right.

    the filtered reads like follows: HWI-7001446:268:C38BAACXX:6:2313:6195:49269 153 18 18781061 50 10M628N2D203N90M * 0 0 GAAGATCCAGGTGGGCCAGCACGGGCACTCGTCCGTAGAAGACGCCATGACCGCCATGGAGCTCTACCGGCTGGTGGAGGTGCAGTGGGAGCAGCAGGCG EDDDDBDCBDDDDBBDDDDDDDDDDB?DDDDDBDDDDDDDDBDFFFFFEHJJJJJJJJJJIJJJJHJJJJJJJJJJJIJJJJJJJJJHHHHHFFFFFCCC MD:Z:10^AT90 PG:Z:MarkDuplicates.5 RG:Z:6673 XG:i:2 NH:i:1 NM:i:2 XM:i:0 XN:i:0 XO:i:1 AS:i:-7 XS:A:+ YT:Z:UU HWI-7001446:268:C38BAACXX:6:1212:18965:69011 83 18 18781066 50 5M628N2D203N95M`` = 18780931 -1068 TCCAGGTGGGCCAGCACGGGCACTCGTCCGTAGAAGACGCCATGACCGCCATGGAGCTCTACCGGCTGGTGGAGGTGCAGTGGGAGCAGCAGGCGGCCAG DDDDDDDDDDDDDDDBDDDDDBDBBDDDDDDDDDDDDDDDDDDB>DHHIGJJIJJJJJIIGIJJJJJJJJJJJJJJJIJJJJJJIHFHHFHHFFFFFCCC MD:Z:5^AT95 PG:Z:MarkDuplicates RG:Z:6658 XG:i:2 NH:i:1 NM:i:2 XM:i:0 XN:i:0 XO:i:1 AS:i:-7 XS:A:+ YT:Z:UU

    I hope you can help me ! Thanks !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @zzq‌ I've discussed this with the TopHat developers; it seems to be a side effect of the use of a transcriptome resource in the alignment process. What's happening here is probably that the read maps to a transcript that includes only two of three exons, with the middle one getting skipped at transcription time. We will need to add some functionality to handle this properly. In the meantime, I would suggest filtering out these reads by name (since your script identified them) using the ReadName filter.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @egeulgen, sorry I missed your question yesterday. Can you look up which lines are affected and post them here?

    Geraldine Van der Auwera, PhD

  • davidpzdavidpz HMSPosts: 2Member

    Hello,

    first of all, thank you so much for this tutorial and this software. I have a question. I have RNA-seq data generated from F1 mice (B6 X NOD), therefore het at every loci. B6 is the reference genome, NOD the alternate. I applied exactly your command line, with the exception of starting with bam files aligned with Tophat2. I used SNPs and INDELs reference for NOD from Sanger Institute. The vcf files before and after variant filtering somehow only show genotypes from het and homo-alt (GT = 0/1 and 1/1). Am I missing something so that I do not see GT=0/0 ?

    Thank you,

    David

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi David,

    By default HaplotypeCaller only outputs variant sites. To also output non-variant sites, you need to use the -ERC GVCF mode (for compressed/banded GVCF output) or -ERC BP_RESOLUTION (for per-site output) (see the FAQ doc on GVCFs for details).

    Geraldine Van der Auwera, PhD

  • amiami Posts: 31GATK Developer mod

    @davidpz‌ Can you explain a little bit more what are you trying to do? Do you just want to check the genotypes of the new pipeline given the known alleles from the Sanger Institute (as you wrote: " I used SNPs and INDELs reference for NOD from Sanger Institute.") or do you want to compare the calls of the new pipeline and the calls that you have from the Sanger Institute (on those known sites? or on all genome?). There are different tools for those 2 different directions, and we will be able to give you better suggestions. As we try to get feedback on the new pipeline, I would be happy to get some feedback after you will do you comparisons... especially for INDELs (since you are using tophat2) Also, out if curiosity, why do you use tophat2?

  • davidpzdavidpz HMSPosts: 2Member

    Thank you for your feedback Geraldine and Ami. @Geraldine_VdAuwera‌ : thank you, I am trying these options. @ami‌ : 1) My specific question is to study allelic expression in single cell RNA-seq data: the NOD allele and B6 allele. I am not interested in finding new variants, just to call the one that are in the SNP and INDEL reference files from Sanger. I am using 3'digital gene expression so that one read corresponds to one transcript in the original cell. So in theory, counting the reads with an NOD variant, and reads with a B6 variant, should provide me the solution. I was thinking of using the DP info from the VCF file generated by GATK. 2) I have no specific reason to use Tophat2 instead of STAR, yet. I already had the bam files aligned with tophat2, and wanted to test out the pipeline first. I do not know STAR yet. For single cell, I just had a better mapping than bwa. I am in Boston (HMS), I would be very happy to discuss in person, if you are interested, and I would less likely make analytic errors.

  • sboylesboyle Posts: 22Member

    Thanks so much for this wonderful pipeline!
    I am currently on the indel realignment step (-T IndelRealigner) and I am getting a large number of results saying "Not attempting realignment in interval SomeChr:Location because there are too many reads." I'm guessing this is because RNA-seq will have far deeper coverage at certain points than WGS or WES. I see I can increase the cutoff for analysis using -maxReads.
    1) Do you suggest this for RNA-seq?
    2) If you do suggest it, is there an upper limit that I should not try beyond?

  • amiami Posts: 31GATK Developer mod
    edited April 17

    @sboyle‌ how many reads do you have? When I tested it on 250M reads, I got that several times (~170 locations), so it is expected, but I would not be worry about it, it should not have impact on the variant calling results. If you really care about the specific regions that are ignored, since the gene there is something that is your main study, or if you see a variants (inconsistent indels, cluster of SNPs) that look as they might be as a results of ignoring the realignment, you can defiantly increase the threshold, but we didn't do that for our data.
    Also - what aligner are you using? and how good is the quality of your data? my suggestion assume for 2-pass star and high quality (2x76, > 30M reads) data

    Post edited by ami on
  • sboylesboyle Posts: 22Member
    edited April 17

    Thank you for your comment @ami. I have around 120M reads. I just tried raising the -maxReads value to 100000 and there are still several regions claiming too many reads. I am just hoping this does not cause a large number of incorrect indels to sneak through. Has anyone else looked at this situation?
    I'm using Star 1-pass, high quality data, 2X101 bp. Does the 2-pass help significantly? I read the alignment comparison paper by Par Engstrom in Pau Bertones lab "Systematic evaluation of spliced alignment programs for RNA-seq data" and it did not seem that the 2-pass was that much better, however, maybe for variant calling it is?

    Update: actually, increasing to 1000000 actually is helping a lot. I only have 3 positions that aren't realigning so far.

    Post edited by sboyle on
  • amiami Posts: 31GATK Developer mod

    Based on our results it was helping in removing many FP calls close to splicing positions. I would recommend using it. If you do and have your conclusions based on the calls with and without it, please share them with us! If you are planning on using HC, I wouldn't be so worry about the indel realigment issue, since it is manly important for the BQSR step, but the HC won't need it.

  • yasinkaymazyasinkaymaz Posts: 8Member
    edited April 18

    I was wondering if any of you guys have heard of this approach for calling SNVs. Mapping reads to an index including splice junction sequences. Please check this out and tell me what you think. lilab.stanford.edu/SNPiR/readme

    Post edited by yasinkaymaz on
  • amiami Posts: 31GATK Developer mod

    Hi @yasinkaymaz‌ , we didn't evaluate it but we collaborate with that team in order to compare and improve both pipelines together. I think that the main differences will be in the regions that are close to the spice positions, since they are filtering those calls and our approach is to fix those regions. I would expect that their pipeline will have better specificity (since they have more filters and since they map to to an index including splice junction sequences) and worse sensitivity. We probably call more TP but also more FP. The other 2 main big differences are the ability of our pipeline to also INDELs and the up to date tools that we are using (newer version of BWA and HC instead of UG - a different and improve type of caller, you can read about the differences between the tools on line). Again, this is my guess based on the different steps and tools in the pipeline, but we didn't compare the results yet. If you do it before us, please update us.
    Are you interesting only in SNPs?

  • yasinkaymazyasinkaymaz Posts: 8Member

    Hi @ami‌ , thanks for the comment. I have run my samples through your pipeline and got some pretty encouraging results. Only issue got my attention was that splice site regions with in exon boundaries (2-3bp inside right before intron) were hot spot for indels and SNVs. I was not sure if those calls were true or alignment artifact. Right now I am trying to set this SNPiR pipeline and run for a parallel comparison. One caveat is, I think, the splice junction sequences that you concatenate to original genome are probably based on known annotated junctions. I don't think they truly account for novel genes/splice junctions, which will introduce noise. Yes, I am also interested in indels as well as SNVs.

  • amiami Posts: 31GATK Developer mod

    Hi @yasinkaymaz, We do not use any known annotated junctions, to avoid issues like you mentioned ("I don't think they truly account for novel genes/splice junctions, which will introduce noise") but instead we use the first pass of the alignment to identify the potential spice junctions (that are specific to the given data and thus there is no problem if they are novel).

    Did you use the 2-pass star protocol as we suggested? If so, it will be great if we can get a snapshot of regions that are hot spot for indels and SNVs, so we can understand the problem and see if it is something that we fix. If you are willing to provide us such data, @Geraldine_VdAuwera‌ can instruct you what should be done.

  • yasinkaymazyasinkaymaz Posts: 8Member

    @ami I was talking about SNPiR pipeline when I mentioned known junctions because authors provide a set of junction sequences to be used as alignment targets. To me this is a prior info so you need to know where those junctions are to be able to generate such sequences. I used tophat as a splice aligner instead of star for my initial pipeline. I would be happy to have your comments on my results once I summarize them.

  • amiami Posts: 31GATK Developer mod

    Using Tophat2 might create the "hot spots", but it will be good to verify it when you will have some data for us to check. You can also try the 2-pass star and see if you still see that problem. We appreciate your feedback, thanks.

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited April 22

    These are a few comments reading this tread.

    I have ran STAR a few times and do not forget to set sjdbOverhang to readlength-1! (read like Spliced Junction Data-Base Overhang. Overhang is the amount of sequence a read extends into the exon from another exon. At most this can be readlength-1 or else this read will not have any anchor in the other exon).

    Also the parameters alignSJoverhangMin = 5 and alignSJDBoverhangMin = 3 might explain the 'hot spots' that 'yasinkaymaz' mentioned. Depending on the usage of annotation one of the two will be used. The reads not passing the alignSJoverhangMin/alignSJDBoverhangMin are possibly written as unspliced reads with alignment errors. Maybe for variant discovery the best setting is alignSJDBoverhangMin = 1 or so. Good luck on perfecting improving RNA-seq calling.

    Post edited by mmterpstra on
  • agoutagout Posts: 3Member

    Hi All,

    Does anyone here trim raw RNA-seq reads prior to alignment etc. for variant calling? The protocol above would suggest not...

    Am very interested to hear of anyone's experience comparing the lilab.stanford.edu/SNPiR/readme pipeline with the 'GATK' best practices protocol above. I'm about to run my RNA-seq data through each to compare...

    Thanks

  • amiami Posts: 31GATK Developer mod
    edited April 24

    @mmterpstra, thanks for the suggestions, we haven't test all the STAR parameters extensively, but we do discuss that with Alex Dobin (the developer of star) and we appreciate any suggestions based on our users experience, so thanks again. btw - I assume that @yasinkaymaz 's hot spots were due to different reasons since he is using tophat and not star. "The reads not passing the alignSJoverhangMin/alignSJDBoverhangMin are possibly written as unspliced reads with alignment errors" - in this case, if the SJ is found in other reads, the SplitNCigarReads tool will trim the part with the alignment errors

    Post edited by ami on
  • amiami Posts: 31GATK Developer mod
    edited April 24

    @agout‌ we will be glad to get your impressions after you will try and compare the pipelines. I did pointed few differences that I expect to see in my comment to @yasinkaymaz‌ on April 18.

    Post edited by ami on
  • yasinkaymazyasinkaymaz Posts: 8Member
    edited April 24

    Hi @mmterpstra‌ , @ami is right. I used tophat as a splice aligner not star. I personally think reads that span introns or overlap splice acceptor/donor sites (hot spots in my results) should be realigned more sensitively (blat is a good option). That's why I wanted to try SNPiR pipeline. I am still working on it in my free time and I will definitely let you guys know about the results.

    Post edited by yasinkaymaz on
  • amiami Posts: 31GATK Developer mod

    @yasinkaymaz‌ - thanks again for your comments. When you have time, can you please generate a snapshot of those hot spots... we mentioned them few times, and I'm curious to see how they actually look like. Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @agout, If you mean trim for low quality ends, that's not necessary because the HaplotypeCaller will do this internally for you.

    Let us know what you find in your comparison!

    Geraldine Van der Auwera, PhD

  • Keifa_1983Keifa_1983 Posts: 1Member

    Dear all,

    I used BWA to map my RNA-Seq reads and I am trying to apply your workflow. Do I have to perform the "reassign mapping qualities" step?

    I really appreciate any help you will provide.

    Luciano

  • yg1yg1 Posts: 37Member

    I have two more questions 1) Should I still follow RNA seq best prectice if I use dthe transcriptome assembly as reference? 2) How should I annotate my SNPs from the pipeline, if I don't have an annotated genome file?

  • SheilaSheila Broad InstitutePosts: 280Member, GATK Developer, Broadie, Moderator admin

    @Keifa_1983

    Hi,

    Unfortunately, if you are using BWA for RNA-seq, that is not part of our best practices workflow and we cannot help you out.

    -Sheila

  • amiami Posts: 31GATK Developer mod

    @Keifa_1983‌ why do you use BWA? If you are using the bwa-mem it can work, but you should know that it was not meant to be run on RNA-seq data. I expect that you will get many false calls due to that reason (although BWA-mem split reads, it does not remove the overhang parts of the reads). If you still want to try it (I won't recommend it unless you have a really good reason to do it) You should not use reassign mapping qualities and the splitNCigarRead tool.

  • yg1yg1 Posts: 37Member

    I think using BWA to map the RNA seq data to transcriptome assembly should work and will NOT introduce fasle positives, right?

  • amiami Posts: 31GATK Developer mod
  • nbahlisnbahlis Posts: 2Member

    can this workflow be used with RNAseq from cancer samples (where ploidy may be an issue) or should HC be replaced by mutect in this workflow? does anyone have experience using this workflow with cancer samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @nbahlis‌ HC is not able to appropriately handle the cancer-specific issues such as uncertain ploidy. But MuTect will have issues with the RNA-specific issues. So, there is not currently one tool that is completely appropriate for handling cancer RNAseq data. In time this will change but for now you may want to try both and see what gives you the most meaningful results. Good luck!

    Geraldine Van der Auwera, PhD

  • amiami Posts: 31GATK Developer mod

    @nbahlis‌ I will lust add to @Geraldine_VdAuwera‌ and say that we are working with the Broad's cancer group to combine our best practices pipeline with their tools.

  • MikebesanskiMikebesanski Rennes, FrancePosts: 3Member

    Hi there !

    Thank you so much for this wonderful pipeline !

    Is anybody having suggestion to process RNA-seq data with the aim to perform allele specific expression analyses ? I'm currently using HC to call variants, and planned to use DP field from the VCF file generated by GATK. I was wondering if I should use read downsampling ? I'm working on 2 x 100 60M reads aligned with STAR 2-passes and processed as suggested in GATK Best Practices for RNA-seq.

    Many thanks !

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @Mikebesanski‌,

    Glad to hear it's working for you! We're looking at methods to do allele specific analysis, but we haven't got anything ready for public use yet. In the meantime what you can do is use the AD values to estimate allele ratios. I wouldn't necessarily recommend downsampling unless you find you have problems with excessive depth. @ami may comment on how downsampling would affect the analysis, I'm not sure.

    Geraldine Van der Auwera, PhD

  • vsmarwahvsmarwah ItalyPosts: 1Member
    edited May 14

    Hi,

    I guess this is more of a question regarding STAR alignment program.

    We trim the sequence reads for adapter and low quality before alignment to the reference but as it has been mentioned in the previous posts STAR should be executed with option 'sjdbOverhang' set to readlength-1. Does that mean that it needs all the reads to be of same length, meaning no trimmed reads when aligning with STAR?

    Veer

    Post edited by vsmarwah on
  • MikebesanskiMikebesanski Rennes, FrancePosts: 3Member

    Thanks a lot @Geraldine_VdAuwera‌ for you answer. Regarding this pipeline, I have another question.

    For RNA-seq analysis in cohort, could I use HC in a two step procedure as recommended for DNA (HC + GenotypeGVCFs) ? In the "Best Pratices for RNA-seq" it appears we should use HC to process each sample individually for calling and genotyping. But there is no explanation for doing thing in a different way than with cohort DNA-seq analysis. I'm a curious guy and I really want to know why !

    Many thanks

  • amiami Posts: 31GATK Developer mod

    Hi @Mikebesanski‌, I'm also not sure what is the effect of downsampling on the allele specific expression analysis, although I think you won't see big differences if you use it or not (since the random downsampling should not change the ratio between the alleles too much). The best thing to do is to run the HC (and any other downstream tools) with and without the downsampling on a small region and compare the results. In general this is the best way to check those kind of things, since it might also depend on your specific data. Such a comparison should not take too much time or compute resources. We (and I'm sure that other users as well) will highly(!) appreciate if you can report your conclusions and thoughts after you do it.

  • amiami Posts: 31GATK Developer mod

    Hi (again) @Mikebesanski‌, The only (critical) reason that we don't have such recommendations is that we didn't evaluate such project (with more then one sample simultaneously) so we can't share any conclusions. Again, the best approach is probably to roll up the sleeves and compare the two or more reasonable options on part of the data. And of course not to forget to share also those results with our community. It can help others and you may also get comments/suggestions from us and form our other very smart (and curious like you) users. In the future, we would probably have some recommendations for such projects and we will work with more and more data.

  • amiami Posts: 31GATK Developer mod

    Hi @vsmarwah, It is a very specific question about STAR options and I wouldn't want to answer without being 100% sure about it. I think you should ask the developer of the STAT tool.

    [However, your question gave me an idea, and I would suggest to Alex Dobin, which will give a talk at the Broad in few weeks, to find a way or a platform where he can also answer questions regarding his part of the pipeline].

  • MikebesanskiMikebesanski Rennes, FrancePosts: 3Member

    Thanks a lot @ami‌. I'm currently running the pipeline on RNA-seq data in a "two step manner" (HC + GenotypeGVCFs) and without any downsampling. As soon as I perform some comparisons (with the "one step manner" with HC only and/or with downsampling), it will be a pleasure to let you know.

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited May 30

    This might not be neccecary but the gatk bundle is still missing some formats (For the star/picard tools genes.gtf/genes.refflat/rrna.interval_list) that should be logical to include or to describe the setup with the prerequisites pages.

    genes.refflat:

    #UCSC commandline tools from http://hgdownload.soe.ucsc.edu/admin/exe/
    /tools/UCSC/gtfToGenePred -genePredExt Homo_sapiens.GRCh37.75.gtf Homo_sapiens.GRCh37.75.genePredExt 
    awk '{print $12"\t"$1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10}' Homo_sapiens.GRCh37.75.genePredExt > Homo_sapiens.GRCh37.75.refflat
    

    rrna.interval_list:

    (cat human_g1k_v37.dict;grep rRNA Homo_sapiens.GRCh37.75.gtf| awk '{if($3 == "gene"){gsub("\"|;","",$10);print $1"\t"$4"\t"$5"\t"$7"\t"$10}}')> Homo_sapiens.GRCh37.75.rrna.interval_list
    #picard tool IntervalListTools
    java -jar IntervalListTools.jar I=Homo_sapiens.GRCh37.75.rrna.interval_list O=Homo_sapiens.GRCh37.75.rrna.sorted.interval_list SORT=true
    

    PS: maybe you'll have to remove some additional contig annotations from the input 'genes.gtf' first before converting it.

    Post edited by mmterpstra on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @mmterpstra,

    Unless I'm mistaken, those files are not used in the basic workflow we describe, so that's why we wouldn't include them outright in the bundle. However, if you can tell me what is the utility of these files and if that is of general interest to users, we can talk about including them in the next version.

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 29Member

    CollectRnaSeqMetrics will need those files, although i know it is not software from the gatk group. It might save users a headache. Just like the idea of providing the .fa.fai/.dict files.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    That's a fair point. We're generally hesitant to take on the task of maintaining additional files because it adds overhead, so I can't guarantee we'll actually start bundling these, but I'll bring this up in our next group meeting. I'll let you know what comes of it.

    Geraldine Van der Auwera, PhD

  • NicolasRobineNicolasRobine New YorkPosts: 2Member

    @mmterpstra: I believe the GATK pipeline should only used for calling variants and not for anything else. Therefore I think you should process your reads in parallel and apply CollectRnaSeqMetrics to the resulting BAM. I believe that things like MarkDuplicate will modify the results significantly...

  • pjvthofpjvthof NLPosts: 5Member

    Just a small question. If you have a RNA-seq samples that has multiple runs/lanes where do you need to do the SplitNCigarReads part? So for each lane separate or at the merged bam? I asume the merged bam but just want to be sure.

  • SheilaSheila Broad InstitutePosts: 280Member, GATK Developer, Broadie, Moderator admin

    @pjvthof

    Hi,

    It really does not matter when you run it, however running on the merged file is more efficient.

    -Sheila

  • siriansirian USPosts: 7Member
    edited June 25

    I ran this pipeline (from alignment to haplotypecaller) on 70 RNA-seq samples and used GenotypeGVCFs to join them into a single VCF file which has 1.5million variants. After filtration (following this pipeline) and annotation by Snpeff, I found that 70% of the variants are intronic and 14% are intergenic. This looks weird because RNA-seq reads should be mostly exonic (if we suppose the library contained mostly matured mRNA). My question is: 1. Should I include transcriptome information in read alignment? 2. Or should I specify a target file when running haplotypecaller? 3. Or you think it's fine to have so many introinc/intergenic variants. (Either because of novel transcripts, or unmatured RNA, from 70 samples?) Thanks.

    Post edited by sirian on
  • SheilaSheila Broad InstitutePosts: 280Member, GATK Developer, Broadie, Moderator admin

    @‌sirian

    Hi,

    What exactly are the commands you ran? Also, you should check what data is mapping to intronic or intergenic regions.

    -Sheila

  • siriansirian USPosts: 7Member
    edited June 26

    @Sheila said: @‌sirian

    Hi,

    What exactly are the commands you ran? Also, you should check what data is mapping to intronic or intergenic regions.

    -Sheila

    I used exactly the same commands in the pipeline on this page, except that I let haplotypecaller generate gvcf, then combine them into vcf and then filter:

    #### haplotypecaller on individual samples (Just take sample_40 as an example)
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ucsc.hg19.fasta -I 40_recal_sorted.bam -recoverDanglingHeads -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -o 40_rawcall.vcf
    
    #### Joint calling by GenotypeGVCFs
    java -jarGenomeAnalysisTK.jar -R ucsc.hg19.fasta -T GenotypeGVCFs --annotation InbreedingCoeff --annotation FisherStrand --annotation QualByDepth --annotation ChromosomeCounts --dbsnp dbsnp_138.hg19.vcf --read_filter BadCigar --variant 1_rawcall.vcf ... 70_rawcall.vcf -o combined70samples_variants.vcf
    
    #### Filtering
    java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ucsc.hg19.fasta -V combined70samples_variants.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o combined70samples_variants_filtered.vcf
    

    So I did the filtering on the combined vcf from 70 samples instead of individual sample. Should I increase the threshold?

    Post edited by sirian on
  • siriansirian USPosts: 7Member

    By the way, the annotation says there are just ~61,000 coding variants. Considering it's a combination from 70 samples, the number is much lower than what we regularly get from exome-seq. I know variants may not be called in the genes with low expression, but I just wonder what's the range other people get.

  • siriansirian USPosts: 7Member
    edited June 27

    I'm going back to redo the variant calling in each sample individually. I'm not sure which one GATK would recommend, individual calling or joint calling?

    Post edited by sirian on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @sirian‌ We haven't yet fully investigated the interaction between the GVCF and the RNAseq modes, so for now only single-sample calling is supported for RNAseq. We know some users have started to try this and seem to be getting reasonable results, but we can't guarantee that it works properly until we've investigated it ourselves in more detail.

    Geraldine Van der Auwera, PhD

  • siriansirian USPosts: 7Member
    edited June 30

    @Geraldine_VdAuwera said: sirian‌ We haven't yet fully investigated the interaction between the GVCF and the RNAseq modes, so for now only single-sample calling is supported for RNAseq. We know some users have started to try this and seem to be getting reasonable results, but we can't guarantee that it works properly until we've investigated it ourselves in more detail.

    Thanks.

    Post edited by sirian on
  • s6junchengs6juncheng Cologne, GermanyPosts: 4Member

    Hi, I'm working with Cancer RNAseq data aiming for finding RNA editing sites. I worked through this pipeline. As discussed in this forum, it seams no tool is good at calling variance (DNA, RNA difference in my case) from cancer RNAseq sample. My question is the vcf file I got seams do not have dbSNP identifier information for the SNPs. Id the program does not provide this information automatically and thus I have to add the annotation information by some other approaches? Best, Jun

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @s6juncheng‌,

    You have to provide the dbsnp information to the variant caller using the --dbsnp argument, or you can annotate it using VariantAnnotator if you already have a callset generated.

    Geraldine Van der Auwera, PhD

  • byb121byb121 UKPosts: 21Member

    Hi,

    Thanks for making it work with RNA-seq data. My concern is the issue, as you mentioned, that HC will miss heterozygous when reads supporting the alternative allele are extremely fewer comparing to the reference supporting reads. So I have a few questions related to the issue:

    1) Did you check how many of the variants missed due to this issue? We intend to study RNA-editing events and it could happen in a very low frequency resulting in exact the same scenario as the issue, only it is not a heterozygous on the DNA level. Well, we don't have DNA-seq data, so hope the pipeline to detect variants from RNA-seq data is very sensitive. 2) I am not sure what do you mean by DNA-aware mode of the HC. Did you mean to run HC again without -recoverDanglingHeads -dontUseSoftClippedBases and merge results? Or you mean to run HC again on DNAseq data of the same sample? I have to say this is not available for us. 3) With the command to call variants suggested here, if I change it to output GVCF, will those sites that does not satisfy stand_call_conf 20.0 stand_emit_conf 20.0 but does have reads of alternative alleles be properly recorded? I mean if GT,AD and GQ fields of those sites will reflect the existing of alternative alleles.

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @byb121,

    That is why we are working on functionality to jointly analyze both RNAseq and the corresponding DNAseq, so the caller can compare what is the "real" genotype of the sample (per the DNA) and how the RNAseq-based call matches or differs from that (indicating either differential allele expression or RNA editing, depending on the case). But that is not yet ready.

    1) I don't have any numbers on hand, sorry. Perhaps @ami can comment on that. In any case though, if you don't have the corresponding DNAseq to compare against, I'm not sure I see how you are going to identify RNA-editing events...?

    2) This is what I refer to at the start of this comment, but like I said it's not ready. In the meantime you can produce the DNA-based calls yourself, and set up an analysis that compares the calls from both in a pair-wise manner.

    3) If you set the HC to output a GVCF, the thresholds will be ignored, and all alternate alleles will be recorded accordingly. Note however that we have not yet validated the RNAseq and GVCF functionalities in combination.

    Geraldine Van der Auwera, PhD

  • byb121byb121 UKPosts: 21Member

    @Geraldine_VdAuwera,

    Thanks for your prompt reply. It is possible to identify RNA-editing events from RNA-seq data alone, there are already several publications on it, such as

    http://www.nature.com/nmeth/journal/v10/n2/full/nmeth.2330.html and http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3246201/

    , although we have to deal with lots of false positives and carefully justify what we find. Surely the DNAseq of the same sample will help, but not sure if the cost is worth. For now I think the sensitivity of a RNAseq variant caller is more important than its specificity.

  • s6junchengs6juncheng Cologne, GermanyPosts: 4Member
    edited July 18

    @byb121‌ , I'm also trying to identify RNA-editing sites by RNAseq data only. By the way, the nature method paper used UC instead of HC.

    Post edited by s6juncheng on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hmm, I see, fair enough. You'll still need to get gDNA and cDNA sequenced to validate your candidate sites, though. You could save a lot of time by just sequencing the genome (or exome) up front, and eliminate a lot of candidates that won't pan out based on the DNAseq. But that's a question of managing the time/money cost tradeoffs (we don't judge).

    I think the HC in GVCF mode is your best tradeoff, as it will yield the highest sensitivity. Normally we don't advise analysing the raw gVCF file directly (instead, you put it through GenotypeGVCFs will all the other samples in your cohort) but in your case you'll probably want to look at the raw allele counts.

    Let us know how it goes!

    Geraldine Van der Auwera, PhD

  • s6junchengs6juncheng Cologne, GermanyPosts: 4Member

    I tried exactly as described in this tutorial. The result I got have some some like this: RefReadCount AltReadCount 1 88

    This kind of SNPs usually present in dbsnps. Actually, this is not so uncommon in the result. I think the 1 reference read count most likely because of technical error. I'm not sure why they are not filtered, because they are listed in dbsnps thus increased the confidence? In this case, is it a good choice to add a filter criteria that read counts less or equal to 2 are not considered?

    Best, Jun

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    Hi @s6juncheng‌,

    I'm not sure I understand what the problem is. Can you please post the complete variant record (the line from the VCF file)?

    Also, did you do any filtering or is this the raw output of the HaplotypeCaller?

    The presence of a SNP in dbsnp is not used for the calculation of the confidence score.

    Geraldine Van der Auwera, PhD

  • s6junchengs6juncheng Cologne, GermanyPosts: 4Member

    Thanks, the line may like this:

    16 70287177 rs4081753 A G 4734.77 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=0.698;ClippingRankSum=0.168;DB;DP=148;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=1.347;QD=31.99;ReadPosRankSum=-1.467;ANN="AARS";STR=- GT:AD:DP:GQ:PL 1/1:1,143:144:99:4763,359,0

    I intended to use AD to calculate allele specific expression as you mentioned in this forum before (in this case 1,143. The '1' obviously because of technical error. What I surprised before is all this cases, the 'variances' are present in dbsnp). Now I know I should also consider GT which in this case is 1/1, means homozygous.

    Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @s6juncheng That's right, this is a homozygous variant, so the numbers look ok.

    Geraldine Van der Auwera, PhD

  • siriansirian USPosts: 7Member
    edited July 25

    I did run splitNCigarReads, but I still got some variants that resulted from wrong splicing. In the following example, the alignment at the top (above RefSeq track) are the alignments from STAR mapping, before splitting. The alignments at the bottom (below RefSeq track) are after running splitNcigarReads, indel realignment, and base recalibration. As you can see, reads were incorrectly spliced by STAR and the overhang mismatches were not trimmed off by splitNcigarReads. The three mismatches at the end of the subreads on the left actually match the beginning of the exon on the right. Could you comment on this? Thanks.

    Post edited by sirian on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,858Administrator, GATK Developer admin

    @sirian, I'll ask @ami to have a look on Monday.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.