We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

RNA-seq variant calling and merging of sample replicates

Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).

I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?

If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    Allow me to correct a couple of misunderstandings first. Indel realignment affects SNPs as well as indels, so being only interested in SNPs shouldn't be grounds for skipping this step. As for BQSR, it corrects for quality issues that you cannot detect up front, even if your data is ostensibly of good quality, so I wouldn't recommend skipping it either. If you are having trouble getting it to work, we can help you figure out what's wrong.

    The question of replicates is a complicated one because it comes down to your experimental design: what is the intended purpose of sequencing replicates? Are you looking to control for any particular aspect of data generation or analysis? If so you may need to keep them separate. However if they were produced simply to provide more depth, then you can merge them after pre-processing. But the read group information must be correctly annotated.

  • erikfaserikfas Member

    Thanks for your reply, Geraldine!

    According to the Best Practices, the indel realignment is listed as optional, and the text says that you should only perform it if you have computing time to spare or if it's important not to miss any potential indels. So, is the guide in error here, and I should do the indel realignment anyway?

    I will give the BQSR another go, then, and ask for help if/when I get stuck.

    The replicates are biological ones (or as close to biological replicates you can get when working with cell lines), in order to be able to do all other RNA-seq analyses properly (expression estimation, differential expression analysis, etc.). So the reason I have replicates is not directly related to the variant calling, it's just something I have due to the nature of my other experimental interests. I have around 30 million reads per replicate, and three replicates per sample. How would I merge in my case (if at all), and how should the read groups be made?

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    IndelRealignment helps with SNPs because it helps to get rid of false positive SNPs. We still recommend doing IndelRealignment because it does help with BQSR. Have a look at this video for more information.

    For merging your biological replicates, I think this thread will answer your question. Also, this article might help.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To be clear, the recommendation for RNAseq is slightly different and does indeed identify indel realignment as optional. My comment was mostly meant to indicate that your reason for not using indel realignment was based on a misunderstanding of what it does.

  • erikfaserikfas Member

    Ok, then I will definitely get to working on getting both indel realignment and BQSR to work!

    Those threads were quite useful, thanks for that! I just want to check that I've understood and applied them correctly to my specific case; please tell me if I am in error somewhere.

    1) First, I'd do the STAR alignment as previously (i.e. separately for each replicate).
    2) I would then add a single RGSM but different RGID to all the replicates (from the same sample), i.e. treating them as the same sample but with different libraries (read groups).
    3) MarkDuplicates, create index and Split'N'Trim as previously (separately for each replicate)
    4) Indel realignment, either separately or giving all the bam's to one command (which one is quickest? It says that the realignment is slower with higher depth, but is the speed[all bam's at once] slower than speed[one bam at a time] x 3?)
    5) BQSR, either on one file (if merged at the indel alignment step) or all bam's together (if not).
    6) Variant calling on the single bam file from BQSR
    7) Variant filtering

    How does that look? Would the resulting final VCF file contain a single variant call column of the aggregated replicates, or am I missing something?

  • erikfaserikfas Member

    Oh, and what RGLB should the replicates have, unique or the same?

  • erikfaserikfas Member

    Ok, I've made a pipeline that seems to work, but I'm unsure of some of the output. I'm currently running BQSR (which is taking a long time), but I've already got the reclibration plots. The plots themselves look nice as far as I can tell, but I'm unsure if how the read groups are listed in there are correct. Looking at the read groups in my replicates VCFs:

    @RG ID:repl_a.aligned.out.sam   LB:repl_a.aligned.out.sam.lib      PL:Illumina  SM:sample1  PU:HiSeq2500
    @RG ID:repl_b.aligned.out.sam   LB:repl_b.aligned.out.sam.lib      PL:Illumina  SM:sample1  PU:HiSeq2500
    @RG ID:repl_c.aligned.out.sam   LB:repl_c.aligned.out.sam.lib      PL:Illumina  SM:sample1  PU:HiSeq2500

    And looking at the tables in the recalibration_plots.pdf (attached images), it seems to be as if something in the read groups have gone wrong. I take it that this is because BQSR prioritizes RGPU (I am, by the way, not sure if "HiSeq2500" is a "correct" RGPU to give) over all other RG fields, but has it affected the analysis itself? I mean, going by PU or SM should give the same results here, correct?

    I cannot run AddOrReplaceReadGroups without giving an RGPU, but if I add them from STAR (like Geraldine proposed) does that then mean that BQSR should take RGSM instead of RGPU?

  • erikfaserikfas Member

    Adding another question; sorry for double-posting =/

    HaplotypeCaller worked fine with the output from above, and I used the exact same parameters as the Best Practices, but I added "-out_mode EMIT_ALL_CONFIDENT_SITES", because I'm also interested in the confidently called 0/0's. First, is this a correct way to do it? Second, I cannot run VariantFiltration on the resulting output, I get an

    Bad input: The clustered SNPs filter does not work in the presence of non-variant records

    error message. Is there a way to get it working, somehow, or run HaplotypeCaller differently but still giving confident reference calls? Or, working around the problem, can I just skip VariantFiltration altogether? I do downstream hard filtration of my own in Python (along with some statistics and my final output), so what would I lose if I don't do the VariantFiltration?

  • erikfaserikfas Member

    Thank you very much, Shiela, that does it! I'm just gonna set the Read Group fields at the STAR alignment step and skip PU altogether, since it's not needed by GATK. Separating non-variants / variants with SelectVariants, running VariantFiltration on just the variants and then using CatVariants works like a charm! Thanks to both you and Geraldine for all the help :smiley:

Sign In or Register to comment.