It looks like you're new here. If you want to get involved, click one of these buttons!
Hi all,
I'm currently analysing non-human mammalian whole genome data (>30x). No previous variants databases are available.
I'm currently in the VariantFiltration step. I came around the following command which is used for human data, and I'm wondering if it will be good for non-human data:
java -Xmx10g -jar GenomeAnalysisTK.jar \
-R [reference.fasta] \
-T VariantFiltration \
--variant [input.recalibrated.vcf] \
-o [recalibrated.filtered.vcf] \
--clusterWindowSize 10 \
--filterExpression "MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1)" \
--filterName "HARD_TO_VALIDATE" \
--filterExpression "DP < 5 " \
--filterName "LowCoverage" \
--filterExpression "QUAL < 30.0 " \
--filterName "VeryLowQual" \
--filterExpression "QUAL > 30.0 && QUAL < 50.0 " \
--filterName "LowQual" \
--filterExpression "QD < 1.5 " \
--filterName "LowQD" \
--filterExpression "SB > -10.0 " \
--filterName "StrandBias"
I would appreciate your thoughts on this matter.
Thank you very much!
Sagi
Geraldine_VdAuwera
Posts: 2,239 admin
Whether it's human or not shouldn't matter too much, it's more a question of what is your dataset size, quality etc. If you're using a command lifted from someone else's study, try to evaluate how similar or different your dataset is compared to theirs. Also it can be very helpful to look at some variants in each of the filtered categories (in a genome viewer like IGV) and evaluate how "real" they look in their genomic context. That can help you adjust your thresholds as needed.