We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

best practices for calling variants in RNAseq

sandymsandym coloradoMember

I have been trying to follow your recommendations for processing RNA-seq data. For the most part the recommendations are easy to follow and implement (thank you!) BUT Iv'e hit a snafu:
attempting to run BaseRecalibrator (GATK v3.5) on a non-model organism for which there is no set of variants. The docs for this tool specifically state --knownSites is optional with a default of NA
Optional Inputs
--knownSites NA A database of known polymorphic sites,
yet when I try to run without specifying this option I get this error:
ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to mask out known variant sites. Please provide a VCF file containing known sites of genetic variation.

suggesting that a vcf file of known sites is required. I would think there should be a way to recalibrate for machine artifacts despite not knowing variants in advance. Why is this optional input of providing a vcf apparently not optional, and is there a way to recalibrate these bamfiles in the absence of giving the program known variants?
Thank you!


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Short answer: no, despite the knownSites argument being technically optional (for development reasons) in practice it is required by the algorithm.

    Long answer: The model that we use depends on having known variants because we use them to mask out sites that are likely to include real variants. That way we can count anything else that mismatches as an error. Of course this is an approximation, obviously there will be novel snps counted as errors. But overall we find that this provides a reliable empirical estimation of quality. This in turn allows us to measure how quality covaries with various sequencing parameters like cycle. If we can't mask out real variants, we can't make the approximation that allows us to measure empirical quality.

    Workaround: You can bootstrap a set of known variants by doing a first round of calling and filtering. See the documentation (esp. the recent workshop presentation slides) for details.

Sign In or Register to comment.