We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Processing raw SNP files


I have use Unified Genotyper and Haplotype Caller from GATK to do the SNP calling from my RNA-seq data. Now, I have the .vcf files generated by these two tools and I need to process them. I have read documentations on Select Variant and Variant filteration, also the vcf tools but I am lost, I don't know what I should do first. I know there are lots of information out there on net but it would be great if you could give me some general outlines.

Thanks a lot in advance.

Best Answer


  • vyellapavyellapa Member

    Hi Homa , Im curious about the aligner you are using for aligning your RNA-Seq reads.

  • HomaHoma Member

    Hi vyellapa,

    I am aligning my RNA-seq reads against a de novo transcriptome assembly, for that, I have used BWA. However, I have received some advice from a bioinformatician that to align the RNA-seq against a genomic reference, you better not to use BWA, here are some recommendations:

    There are a few packages specifically for alignment of RNA reads that permit spliced alignments (which bwa can't do)
    bowtie is quite commonly used. Another that is found very accurate is GSNAP. It is, admittedly, quite a lot slower than bowtie.

    Here is a paper discussing different methods of RNA-seq alignment:

    Good luck!

  • HomaHoma Member

    Regarding alignment, I have a question too, as follows:

    I have individual-based RNA-seq data for 6 female birds. So, previously, a researcher in the lab, had used pooled data of 3 individuals. She had used the pooled data to do the SNP calling. The re-aligner and the SNP calling algorithm that she had used were different. For re-alignment, she used samtools-calmd (I am going to confirm it soon) and for SNP calling, she used FreeBayes.

    Now, I have the de novo transcriptome assembly that she made and I am using BWA to align my reads against this assembly and then using Unified Genotyper, samtools and Haplotype caller for SNP calling. The three methods seem to have detected similar sets of SNPs.

    But the important problem is that when I compare my SNPs to hers, for some sets of genes, I can find the exact set of SNPs as she found which is very comforting. However, for some other sets, all the SNPs within a gene have shifted by a certain number of bases, for example for gene 12345, all the SNPs in my data set have shifted 10 base pairs compared to hers and for gene 54321, all SNPs have shifted 34 bases. Indeed, for some sets of genes, I get a completely different sets of SNPs.

    I have been thinking what could be the cause, I do not understand the details of these alignment algorithms but I thought maybe the shift in bases is coming from the fact that we have used different realigners.

    If you have any comment on that, please let me know, I'll be extremely happy.

  • vyellapavyellapa Member

    Sure, GSNAP I feel is a good tool and so is STAR but I get an error with these tool s when taking it through the GATK pipeline. The recurrent one having something to with reporting multiple primary alignments by these tools.

    As for your issue, I would make sure I am using the same reference file throughout my steps. Are you using the reference build when comparing your SNPs? Are mapping to the transcriptome reference but calling variants against the genome reference? There could be various issues which could be off-topic on this thread. Feel free to email me if you have any questions.

  • CarneiroCarneiro Charlestown, MAMember admin

    can you show us some examples of these sites where the calls are shifted? Maybe an IGV screenshot? This could be very informative for us as we start to think about how to support RNA seq in the GATK.

  • HomaHoma Member

    Sure, I will soon provide you with some examples, thank you for your replies.

  • amiami Member

    @Homa said:
    Sure, I will soon provide you with some examples, thank you for your replies.

    Did you had the chance to do it? We are currently looking on RNA data and it might be good to look on those examples.


Sign In or Register to comment.