We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Base recalibration and variant calling(with joint genotyping) for a non-model organism

nkobmoonkobmoo ParisMember
edited March 2016 in Ask the GATK team


In case we do not have a database of known sites for a non model organism, what is the best strategy to do the base recalibration and variant calling with joint genotyping when we have several samples ?

According to the GATK best practices, when we do not have a data base of known sites, it is recommended to run HaplotypeCaller and hard filter the SNPs then use the resultings SNPs for the base recalibration until convergence. As I have many samples (31), I have to do this on all the samples. However, as, after the base recalibration, I want to do the final snp calling with joint genotyping, I think this approach is going to be really time consuming...

So, I'm wondering whether I can execute an initial snp calling (with HaplotypeCaller) on each sample, then joint genotyping them to obtain a global snp data that I will manually filter. This joint and filtered snp database could then be used for the base recalibration and a second joint genotyping. I think that it will be faster this way. However, I'm not sure whether there are caveats in this approach and I would like to have your advices.

Thank you very much in advance.


Best Answer


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi again Noppol,

    I have a few follow up comments. We have not tested whether HaplotypeCaller in normal mode or HaplotypeCaller in GVCF mode plus GenotypeGVCFs is faster on your scale (for 30-ish samples). What I can tell you is the compute requirements for HaplotypeCaller are pretty much exponential for adding samples when you run in normal mode, so there is a possibility that it will be slower than running the individual samples in GVCF mode.

    You can try running HaplotypeCaller in normal mode on smaller subsets (maybe 5-6 samples) then merging the vcfs. That might be a reasonable approach. Again, we have never tested this, so it is hard for us to say exactly the best approach.

    The reason we recommend using the GVCF workflow is for scalability and ease of adding samples later on. Have a look at this document for more information.


    P.S. Do let us know how things work out! :smile:

  • nkobmoonkobmoo ParisMember

    Hi Sheila,

    Thank you for your answers. Finally, I decide to do first the haplotype calling in gVCF mode for all my samples, then do the joint genotyping and the base recalibration so on. I haven't finished yet but this seems not to take that much time.


Sign In or Register to comment.