I'm working on a thesis comparing CRC and liver metastastic tumours using RNA-seq data. I've done my best to follow the RNA-seq best practises (which I appreciate from what I've read are in a sort of beta form) using primarily GATK3. At the end of the pipeline, I've used SnpEff and SnpSift to annotate and predict variant effects, though my supervisor has voiced concerns about filtration, which is something I'm struggling to find documentation on.
I have used the RNA-seq best practises filters (which considers FS, QD, and clusters of variants within a window), but I'm wondering if there are some other filters I should consider? The data I'm working with my also not be terribly well suited for the kind of work I'm trying to do with it, as it's mostly been an academic excercise so far to get more familiar with bioinformatics. There are no normal samples to make comparisons with (though I was thinking of using the CRC samples as a kind of "normal" comparison to find novel mutations specific to the LM tumours), and my supervisor has often mentioned that the sequencing depth might be insufficient. I have stumbled across some other filters from a powerpoint that does reference GATK (using values for MQ, MQRankSum, ReadPosRankSum, InbvreedingCoeff, HaplotypeScore), however those filters return no calls, which may suggest how bad the data may be for this kind of use!
Could anyone point me towards any recommended RNA-seq filters that may be appropriate? And if anyone requires any more information that might help them suggest such filters, please let me know!
@TomWillDo I have some experience with this type of task, and I was wondering if you could provide some more information. For example, what is your number of samples? What is your sequencing depth?
A few points:
1.) low sequencing depth can lead to an overestimation of variants or an inability to screen out noise using the regular filter settings. Using a very stringent filtration would reduce the number of false positives, but also decrease the number of actual variants captured. This is okay if you are just looking for targets that are easily detectable. If you are looking for rare variants, your advisor is correct that deep sequencing may be required.
2.) Not having a "normal" may lead to an overestimation of variants. It is probably possible to use CRC samples as a "normal" that you can match as closely as possible in sequencing and sample size characteristics. This will at least screen out the novel variants in your sample set, but please refer back to point 1, you may have to play around with the filtration settings to reduce low confidence variants in your set.
Also, what protocols on GATK did you use? Did you try the Mutect2 option with Tumor-only mode in GATK4? Mutect2 has changed significantly between GATK3 and GATK4 and may give you some more information about your samples. Here is the link