We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How to generate a database SNP with non-model species?

Foer generating a vcf file contain multiple samples, It seems that I need a database SNP or variants
I want to generate vcf file containing multiple samples of birds for analyzing population genomic data with plink,
so I think I can use UnifiedGenotyper to generate the vcf.
From the tooldocs of UG, I know that I may need a database.vcf files to let GATK know how many substitution site in birds genome, but there's no known site of SNPs or INDELs of my research system before.

I already have 40 birds' variants which have been done recalibration two times.
So, can I generate the know site from my samples' vcf files for calling variants containing multiple samples ?

Best Answers


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    Accepted Answer

    Hi again @tytolin,

    The forum contains a number of discussions on bootstrapping your own known variants resource. You can start by reading here and here.

    Having a known variants resource and using this resource during BQSR and VQSR is part of the recommended best practices. However, you should know that the tools do not strictly require these and it is possible for you to call variants on your 40-sample cohort using alignments. This is how you start your bootstrapping process in generating a quality known variants resource.

    Please check out the features of HaplotypeCaller towards joint calling on your cohort. For reference, here is the HaplotypeCaller tool document: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--alleles.

  • tytolintytolin Member

    Hello @shlee
    I am using GATK 3.6 now ,so is it still recommended to use HaplotypeCaller in GATK 3.6 to call variants ?

  • tytolintytolin Member

    Yes, I am calling short variants (SNPs and Indels) now. I'll try GATK 4 and compare the output with GATK 3.6.
    From the Tool Docs of HC, I knew that there might be some problems when using HC with -nct multithreding mode. So, if I need to analyze SNPs or Indels, the best choice is using Queue or WDL for HC multithreding mode ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Accepted Answer

    In GATK4 there is no more nt or nct mode. You are correct that we recommend using WDL to build a parallelized pipeline; you can find the scripts we share in the Best Practices section.

    We are working on a version of HC that uses Spark for multithreading but it’s not ready for general use yet.

  • tytolintytolin Member

    Thanks a lot.
    I'll try WDL to build a parallelized pipeline.

  • tytolintytolin Member

    Can I call short variants with multiple samples with HC on GATK 4.0 like HC on GATK 3.6 ?
    I mean that can I use the same argument in HC GATK4.0 as I using in HC of GATK 3.6 for calling SNPs & Indels for multiple samples ?

    Here's my command line in GATK 3.6.
    nohup java -jar /opt/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T HaplotypeCaller -R ~/new_babbler_fa/platanus_trimmed_wNmp_kraken_gapClosed_1000.fa -I babbler01_recal02_reheader.bam -I babbler02_recal02_reheader.bam -o babbler_01_02.raw.vcf & 

    I also use this command under -nct 15 mode, it spent 19 hours to generate vcf files.
    And the monothreading command has been spent 31 hours, and it is still not finishing yet.
    Does it mean that HC in GATK 3.6 really got slow run time ?

Sign In or Register to comment.