The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Processing raw SNP files

Posts: 25

Hello,

I have use Unified Genotyper and Haplotype Caller from GATK to do the SNP calling from my RNA-seq data. Now, I have the .vcf files generated by these two tools and I need to process them. I have read documentations on Select Variant and Variant filteration, also the vcf tools but I am lost, I don't know what I should do first. I know there are lots of information out there on net but it would be great if you could give me some general outlines.

Tagged:

• Posts: 33

• Posts: 25

Hi vyellapa,

I am aligning my RNA-seq reads against a de novo transcriptome assembly, for that, I have used BWA. However, I have received some advice from a bioinformatician that to align the RNA-seq against a genomic reference, you better not to use BWA, here are some recommendations:

There are a few packages specifically for alignment of RNA reads that permit spliced alignments (which bwa can't do)
bowtie is quite commonly used. Another that is found very accurate is GSNAP. It is, admittedly, quite a lot slower than bowtie.
http://research-pub.gene.com/gmap/
http://bioinformatics.oxfordjournals.org/content/26/7/873.full

Here is a paper discussing different methods of RNA-seq alignment:
http://bioinformatics.oxfordjournals.org/content/27/18/2518.full

Good luck!
Homa

• Posts: 25

Regarding alignment, I have a question too, as follows:

I have individual-based RNA-seq data for 6 female birds. So, previously, a researcher in the lab, had used pooled data of 3 individuals. She had used the pooled data to do the SNP calling. The re-aligner and the SNP calling algorithm that she had used were different. For re-alignment, she used samtools-calmd (I am going to confirm it soon) and for SNP calling, she used FreeBayes.

Now, I have the de novo transcriptome assembly that she made and I am using BWA to align my reads against this assembly and then using Unified Genotyper, samtools and Haplotype caller for SNP calling. The three methods seem to have detected similar sets of SNPs.

But the important problem is that when I compare my SNPs to hers, for some sets of genes, I can find the exact set of SNPs as she found which is very comforting. However, for some other sets, all the SNPs within a gene have shifted by a certain number of bases, for example for gene 12345, all the SNPs in my data set have shifted 10 base pairs compared to hers and for gene 54321, all SNPs have shifted 34 bases. Indeed, for some sets of genes, I get a completely different sets of SNPs.

I have been thinking what could be the cause, I do not understand the details of these alignment algorithms but I thought maybe the shift in bases is coming from the fact that we have used different realigners.

If you have any comment on that, please let me know, I'll be extremely happy.

• Posts: 33

Sure, GSNAP I feel is a good tool and so is STAR but I get an error with these tool s when taking it through the GATK pipeline. The recurrent one having something to with reporting multiple primary alignments by these tools.

As for your issue, I would make sure I am using the same reference file throughout my steps. Are you using the reference build when comparing your SNPs? Are mapping to the transcriptome reference but calling variants against the genome reference? There could be various issues which could be off-topic on this thread. Feel free to email me if you have any questions.

can you show us some examples of these sites where the calls are shifted? Maybe an IGV screenshot? This could be very informative for us as we start to think about how to support RNA seq in the GATK.

• Posts: 25

Sure, I will soon provide you with some examples, thank you for your replies.

• Posts: 50

@Homa said:
Sure, I will soon provide you with some examples, thank you for your replies.

Did you had the chance to do it? We are currently looking on RNA data and it might be good to look on those examples.

Thanks,
Ami