How do I run BaseRecalibrator without known snps/indels

CharlesDavidCharlesDavid New ZealandMember

I am trying to run the BaseRecalibrator on plant data for which I have no SNP or INDEL data. The documentation clearly states that providing a set of known variants is OPTIONAL, but the program crashes. What is going on and how do I run the program without the SNPs which are not available? I have followed the pipeline given in the Best Practices and want to see it through.

Documentation for the tool:

https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php

States the following:

"Optional Inputs"
"--knownSites NA A database of known polymorphic sites"

However, I get this message:

ERROR MESSAGE: Invalid command line: This calculation is critically dependent on being able to mask out known variant sites. Please provide a VCF file containing known sites of genetic variation.

Should I simply not bother with this step and just run the Haplotype Caller? And will this yield equally good results?

Thanks!

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Please do a search for "bootstrapping BQSR" on the forum. We have covered this a number of times.

  • CharlesDavidCharlesDavid New ZealandMember

    Thanks for your reply and suggestion. While your forum does deal with this issue, it would be helpful if it were correctly stated in the documentation so that drilling down into the forum would not be necessary. Also, we have some concerns that this bootstrapping process could inject a strong bias into the results and would want to benchmark the calls for further investigation as we are working with an organism that has little established background with respect to variants.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CharlesDavid
    Hi,

    In the tool documentation that you referenced above it does say the inputs required include the -knownSites file. However, I will see what we can do about making it more obvious that users should absolutely input the known sites file :smile:

    As for your concerns, you should have nothing to worry about. The known sites file simply masks out sites that may have real variation. If there are real error trends in your data, BQSR will be able to pick them up. You can always try running your pipeline with and without BQSR and compare the two runs.

    Let us know what you find!

    -Sheila

Sign In or Register to comment.