I'm working with data from 12 samples of RNA-seq (in sheep) with a mean of 74 million raw reads each sample. I'm using your pipeline for RNA-seq data to find SNP and INDELS, but i have decided to try at the variant calling step the GVCF mode of the HaplotypeCaller. After combining the different GVCFs and use the GenotypeGVCFs function, i have only take two groups of variants (biallelic SNPs and indels) in different files with the SelectVariants tool (--exclude-non-variants --max-nocall-number 4 options activated in both cases). After that I have been trying to select hard filtering thresholds for my data, where I have found some problems or doubts.
For the SNPs and INDELs I have attached the distribution of some of the annotations usually used for filtering, as I can't post links because I have recently signed in the forum.
First, I find strange the MQ and MQRankSum distribution, having all the values in 60 and 0, respectively. So I have tried to find if there was any problem in the execution or any parameter bad used, but I don't have seen anything strange at first sight, so I wonder if you know if its normal or it might be an error. On the other hand, the QD distribution not shows a similar distribution as in your manual (with two peaks) and I don't know if its something to be worried about. In the others I don't find anything strange, thinking to use ReadPosRankSum<-2.0 filter (because I don't have values <-8.0), while the other filters are as in your manual.
In this case, the INDELs and SNPs have a similar distribution for FS, QD, SOR and ReadPsRankSum, so I was thinking to use the same filtering criteria that I'm going to use for the SNPs (although in your manual you used for example a filter for FS>200, that I think that not makes sense in my data because there isn't any value greater than 200). I don't know if the InbreedingCoeef distribution is normal, as I couldn't find any example in the page.
Thank you for your time and Ideas.
Dear @Akidne - Please look at this document for an explanation about what might be happing in regards to mapping quality.
After the reads are split into exonic segments, "we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60."
Also, I see that in the Variant Filtration step, a value of FS > 30 is recommended.
"As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0)."
For the inbreeding coefficient, here are some caveats that might be helpful, such as how many founder individuals you have in your analysis. 12 samples may not be a good representation of the population, and pedigree could be affecting the per-site estimates.
Please post follow-up questions here.