We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

What are the standard resources for non-human genomes?

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
edited January 2013 in Ask the GATK team

We're trying to put together some recommendations for folks who want to use GATK tools on non-human genomes. But we really don't have much experience with non-human genomes, so we're hoping that those of you in the GATK community who do will chime in and help your fellow scientists find the answers for a few common problems.

The most common problem seems to be finding sets of known sites for organisms like Drosophila, dogs, and various plants. If you know of such resources, please share your knowledge by commenting in this thread. You could earn upvotes and warm fuzzy feelings!

Post edited by Geraldine_VdAuwera on


  • gsonggsong Member

    I'm trying to run GATK tools for Yeast. I found that SNPs data for Yeast in http://gbrowse.princeton.edu/cgi-bin/gbrowse/yeast_strains_snps/, but the actual sequence of the predicted SNPs in the other yeast strains is not known; only the location of the SNP is predicted. Does GATK require the actual sequence of the predicted SNPs? Is it fine with the only information of the SNP location?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Giltae,
    For most usages (like indel realignment, base and variant recalibration) the GATK tools only use the locus information, not the alternate alleles. As long as the SNPs are in a valid VCF file, it should be okay.

  • FrancoisGuillaumeFrancoisGuillaume Jouy en josasMember

    Hi, I am working on cattle sequence, and to obtain known sites, I used a gvf file found on ensemble ftp site :

    The size of the gvf files vary between species but it's at least a starting point.
    I had to process a little bit the file (with two or three shell/awk command).

    1. I discarded SNV with two possible ALT allele (e.g. ALT=A,G)
    2. For insertion, I had to search in fasta file the ref allele to add it ahead of the insertion (e.g. convert REF=., ALT=ACCC to REF=C,ALT=CACCC)
    3. I discard SNP mapped on several chromosome (same rsid, several position)
    4. QUAL and INFO fields were all set to "."
    5. FILTER field was set to PASS

    All this transformations were done in order to be read by IGV (maybe GATK is more lenient), sweeping any complex cases. Now my questions, is there any informations that i should not have discarded ? Should the known sites be as numerous as possible or should they be as reliable as possible (and what criteria could help to assess one SNP quality...when no quality measure is available).

    Furthermore, I went on Illumina and Affimetrix website to download the map file associated to their highest density SNP chip, and based on these, create two additionnal vcf.

    I put all the QUALITY field to ".", but would it be worthwhile to give an arbitrary value (possibly higher than the one of the GVF file SNV)

    I hope this can help.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    These are good questions.
    The last one is the easiest: as far as being a resource for known variation, we do not look at the QUAL field so setting that value to "." is okay.
    As for your first question, it really depends on what you are trying to do. In general, it's more important for your truth set to be as accurate as possible - even if it means missing some real sites because of it. The only thing I would consider doing differently is keeping the multi-allelic SNP sites (since they aren't inherently errorful provided they show up at the right frequency).

  • dkcrossmdkcrossm Member

    Hello! I have sequenced the whole genome of 19 C. elegans and would like to know where I can obtain a variant file for this species. Does one exist?

    I have tried running the BaseRecalibrator in GATK ver2 (already have run bwa->MarkDuplicates->Local realignment around indels) without the --knownSites option (which specifies where the variant file is located) but it comes back with an error and won't create the .grp file that is needed for the PrintReads -BQSR option. The error message is: "Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation."

    Any suggestions? Is there a way to turn off the --knownSites option if a variant file for C. elegans doesn't exist? Or is there another way? I'm open to any suggestions. Thanks!

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    There is a section of the documentation that offers advice for those users processing organisms without a database of known variation. I'd recommend reading it.

  • PeteHaitchPeteHaitch Member

    I'm currently dealing with this issue in mouse. I'm using mm9/MGSCv37 because there seem to be more resources available for this reference than the latest mm10 build. There are a few databases of variation that I am trying to choose between:

    1. dbSNP128.txt.gz from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/snp128.txt.gz). It seems I should be able to convert this to VCF using either GATK or SAMtools but I can't figure out how to do it. Instead I convert it using my own less-than-ideal script (https://github.com/PeteHaitch/dbSNP2VCF). This script can only convert simple SNPs to VCF; multi-allelic sites and indels are ignored.
    2. The VCF created by Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011) that is available from ftp://ftp-mouse.sanger.ac.uk/current_snps/. This includes variants from 17 different strains of mouse/ I'm not sure whether to restrict this file to sites that are SNPs in my strain or in any of the 17 sequenced strains. Advice on this would be most welcome. There is also an indel VCF from this paper, available at ftp://ftp-mouse.sanger.ac.uk/current_indels/.
    3. The Ensembl variants. For mm10 these are "remapped from NCBIM37" but I don't know how this is done. For mm9 these variants are from dbSNP but I'm not sure which version of dbSNP or how to get a VCF of these (or download something I can convert to VCF).

    I'd welcome input from anyone who is using GATK for analysis mouse data or general advice on using databases of variation in mice, such as how to use strain-specific variants.

  • pdexheimerpdexheimer Member ✭✭✭✭
    edited August 2012

    @PeteHaitch: That Keane paper looks really useful, especially since the UCSC dbSNP track is now so old. I will probably leave the sites from all 17 strains in. I can only think of three places in the pipeline that variant info is used:

    1. BQSR - Having "too many" variants might cause a slight overestimate in quality since real mismatches may be discarded, but I can't imagine the effect being severe
    2. Indel Realignment - Runtime would be higher, but results shouldn't be changed
    3. VQSR - Again, results shouldn't change because only sites that are called variant in the cohort are examined

    For converting the UCSC file, there is indeed a very useful walker/Tribble codec in the GATK for this very purpose. You'll want to run something like java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:OLDDBSNP dbSNP128.txt -o dbsnp128.vcf
    It looks like the OLDDBSNP codec isn't documented anymore, but I'm pretty sure it's still there

  • pdexheimerpdexheimer Member ✭✭✭✭

    Actually, now that I've read the paper a little more thoroughly, the Mus spretus sites should definitely go. I could see going either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains

  • PeteHaitchPeteHaitch Member
    edited August 2012

    @pdexheimer Thanks! That's just the sort of advice I was looking for regarding which strains of the Keane paper to retain. It's certainly looking the easiest option currently. A couple of minor things that you may need to be adjust before this can be used with GATK:

    1. Check chrom field matches your reference. For example, mm9 uses the "chr1, chr2, ..."-style of chromosome names whereas the Keane VCFs use the "1, 2, ..."-style. I think you'll need to alter the VCF so these match in order to use these VCFs with GATK.
    2. Conversion of the VCF to v4 or v4.1 (it is v3.3). java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:VCF3 Keane.vcf -o up_versioned_Keane.vcf should do the trick but I haven't tested this.

    I also looked at VariantsToVCF yesterday. Can anyone from the GATK team tell us whether the OLDDBSNP codec is deprecated or is it just a case of the documentation going missing?

    I found a GVF file of the Ensembl68 variants (mapped to GRCm38) on the Ensembl website (http://asia.ensembl.org/info/data/ftp/index.html). I need to find a way to convert this to VCF - this tool (http://code.google.com/p/gvf2vcf/) claims to do so but I haven't tested it.

  • Mark_DePristoMark_DePristo Broad InstituteMember admin

    I believe oldDBSNP is still there. I see that the codec docs are missing. I'll ask for that to be fixed.

  • SophiaSophia Member

    We have used the Apple (Malus x domesticus) genome assembly from here:


    They also have some other resources like gene predictions and functional annotations.

  • kmdailykmdaily Member
    edited February 2013

    What about for custom made libraries based on the human genome? Agilent allows extra baits to be added, for which we designed for a virus. I get errors at the step of RealignerTargetCreator which may be due to this.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @kmdaily, assuming this means you have reads that align to the genome of the virus you are looking for, you need to add the virus contigs to the human genome reference. Once you've done that and formatted it properly, the rest of your analyses should proceed as normal. Unless you're not aligning to the human reference at all?

  • kmdailykmdaily Member

    Thanks for the fast response @Geraldine_VdAuwera! I added the virus contig to the reference and built all necessary indexes with this, performed alignment, etc. I get a "Input files known2 and reference have incompatible contigs: Order of contigs differences, which is unsafe." error when using the 1000G phase1 indels for hg19; maybe I incorrectly assumed that it was due to the virus being there. I'm using that vcf in conjunction with the Mills and 1000G gold standard file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Sounds like the order of the contigs in your custom reference is different from the order of the canonical reference and the known sites vcf. The contigs have to be in the same order as explained here: http://www.broadinstitute.org/gatk/guide/article?id=1204

  • kmdailykmdaily Member

    I did read that article, thanks. What was strange is that it worked fine with the Mills and 1000G gold standard fine alone, and only fails when using the 1000G phase1 indels file (either alone or along with the Mills), which is why I didn't think there was a problem with the ordering. I will reorder the BAM file and re-try. Thanks again!

  • kmdailykmdaily Member

    I was using an old version of the bundle provided by our IT team (hg19 1.5). I downloaded the files directly from the GATK ftp site for 2.3, and after re-ordering the bam file it is working correctly with both files. Thank you for your help, @Geraldine_VdAuwera!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Glad to hear your problem is solved!

  • DavidRiesDavidRies Member ✭✭


    does anyone know where to obtain VariationRecalibration training resources for Arabidopsis Thaliana, or how to create them by yourself by using, for instance, the data from the http://1001genomes.org/? It should be possible to call reliable Variations from close to 1000 resequenced strains, no?

    I found dbSNP for it, but since it is no longer used in the recalibration step ...



  • fjrossellofjrossello Member


    I am calling variants in a mouse KO experiment where C57BL6NJ strain was used. As @PeterHaitch suggested in an earlier post, one could use the MGP indels and SNPs (ftp://ftp-mouse.sanger.ac.uk/current_snps/). @pdexheimer advised to "go either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains"

    Is there any reason why to include indels/SNPs sites of all strain plus those will-derived instead of restricting sites just to the strain one is using?

    Thanks in advance.



  • pdexheimerpdexheimer Member ✭✭✭✭

    Hi Fernando -

    I kind of view the question the other way around - "Is there any harm in leaving in sites from other strains?". Ultimately, what you're looking for is a catalog of reliably true variations. If there are variants in your catalog that aren't in your data, then there's no harm (at least for VQSR). And there's a chance that variants in other strains could exist in your strain. So I view this as a situation that might help and won't hurt.

    For BQSR, an excess of "truth" sites can be harmful as it would cause sites to be wrongly discarded from the statistical model. If you're dealing with whole genome, I can't see the number of sites in the Keane dataset causing problems. Actually, even for exomes I don't think it would hurt - but you could always leave them out if you're concerned.

  • fjrossellofjrossello Member

    Thanks @pdexheimer‌! I totally agree that it won't hurt using all strains for VQSR. My concerned was/is related to local indel realignments and BQSR. The current no. of reported indels for all strains (retrieved from the MGP ftp) is around 10 million compared to 40 thousand specific to my strain. A similar case can be observed for SNVs, around 60 million compared to 20 thousand variants reported to CL57BL6NJ.

    I am dealing with whole genome. Again, and sorry to insist, Do you think that this substantial differences in no. of sites could affect the analysis?

    Thanks in advance.



  • pdexheimerpdexheimer Member ✭✭✭✭

    Indel realignment, no. At worst it will slow you down, but it won't cause false indels to be inserted or anything like that.

    For BQSR, this is simply a semi-educated guess - I really have no idea if I'm right or not. The correct answer would be to run it both ways and compare the results of BQSR. But with that caveat, I would say that since you're using BL6 mice and therefore expect a very small amount of variation, including all of the cross-strain info would probably do more harm than good. If you were looking at, say, A/J mice and expected a more reasonable number of variants, it might not matter so much.

    But again, this is just a guess. Try it out!

  • fjrossellofjrossello Member

    Thanks again @pdexheimer‌! I am semi-educated in this topic as well and first time calling variants in mouse whole genome samples (always worked using human samples before). As you said, I will need to try it out.



  • AlexanderVAlexanderV BerlinMember

    Hi people,

    I have a set of SNPs that I used as input for Reclibration.
    Now it seems, that the recalibration lowers the BaseQuality too much, so SNPs that seem to be real SNPs are not being called.
    I find the SNPs in many individuals, but in some the region doesn't get triggered as an active region.
    One example (one site and two individuals, recalibrated data):
    The BaseQualities in the individual where it doesn't trigger are 12-15 (133 reads)
    in the individual where the SNP gets called there are a lot of Qualities up to 19 (74 reads)
    In the unrecalibrated I can find BQs up to 41.
    In both individuals there is not a single read that doesn't have the SNP.
    The mapping qualities of the reads are all 60.

    My SNP set is very small (69011 SNPs). It is microarray data, aligned to the reference.
    The area of the genome in question is not covered by the SNPs.

    I am wondering now, if it is a bad idea to use this set for recalibration and
    what I should do in this case.
    Maybe creating my own set by first not using recalibration, like suggested in best practices?
    Or going the other way and using my very restrictive set I have now and use this
    for recalibration and see how new SNPs evolve? Maybe in multiple cycles?

    Thanks for your input on this!

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    The BQSR model treats all mismatches to the reference as errors. The sites in the input knownSites file are masked because they are "known" variation, and we do not want to penalize them as errors. Because your knownSites file does not cover some regions in the genome, those regions will be penalized unfairly. That is why the quality scores are lower in your region of interest.

    In your case, I think it is a good idea to proceed with our suggestions in the Best Practices and try variant calling and recalibration until convergence. However, because you do have a small set of known variants, you will at least have a starting place for what filters to apply to your callset during hard filtering. You can plot the annotations of the known variants and use the most confident annotation thresholds as a standard.

    I hope this helps.


  • lovehardwarelovehardware AustraliaMember

    I got tomato SNP database between MicroTom and Heinz 1706, from Japanese TOMATOMICS databases



  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭
  • ortizemortizem Pomona, CAMember


    I am working with a wild & domesticated flower plant genome. I would like to generate SNPs but there is no reference for these species. Can I use a close related species such as Arabidopsis for the dbSNP for recalibration? Please let me know.

    Thank you

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭


    Have a look at this article under "I'm working on a genome that doesn't really have a good SNP database yet. "


  • GuillefriisGuillefriis SpainMember
    edited July 2016


    I've been trying to do exactely that, using highly confindent SNPs from uncalibrated data to recalibrate it in the next round, always with the same outcome: I end by having no variants at all in the subsequent gvcfs. I wonder if I'm using too restrictive thresholds to select the SNPs from the first round, has anyone the same problem? Thanks a lot!


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭
    edited July 2016

    Hi Guillermo,

    What is the criteria you are using for choosing high confidence SNPs?

    Have a look at our basic hard filtering recommendations and ways to improve them here.


  • emartinezemartinez HoustonMember


    I am trying to run indel realignment and base recalibration for some public dog whole exome samples. I am using the vcf files from Ensembl CanFam3.1 (Canis_familiaris.vcf and Canis_familiaris_structural_variations.vcf). However, the process of indel realignment keeps requiring more memory. I want to ask: if we set the FILTER field to PASS in the vcf files (as suggested by other user), do this can increase memory efficiency? Or there is no effect?


  • SumuduSumudu Sri LankaMember


    Anybody working with Leishmania genome? Please share if you know of any high quality SNPs or indels that can be used as known sites.


  • ymwymw Member

    For those non-model species that almost have no chance to obtain Known sites file, how can we still do local alignment for indels for their genomes?

    For doing BaseRecalibration, at least we can use a bootstrap approach to do recalibration for several rounds to reach convergence. Can we also apply such bootstrap approach for local alignment (i.e, RealignerTargetCreator and IndelRealigner)? If so, how?

    I remembered that I used a old version of GATK to do local alignment without providing a known-sites file. Could it still work in that way for the new version?



  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭
    edited September 2016

    Hi Chih-Ming,

    You can either use the bootstrapped VCF for both BQSR and Indel Realignment, or you can simply run Indel Realignment without the known sites file. RealignerTargetCreator will use the reads in your BAM file to determine regions that should be realigned.

    Also, we no longer recommend Indel Realignment if you are using HaplotypeCaller. Have a look at this post for more information.


  • BegaliBegali GermanyMember ✭✭

    hallo ,
    I work on RADSeq in NGS process generated for 5 samples of Cardamine hirsuta and 197 samples Arabidopsis thaliana .. now I am in filtering step which as I understood hard filtering in my case as there is no available resources for VQSR steps neither BQSR .. I have read a lot of comments which r related to same issue however how can I fix this problem for my dataset for plant ..also for plot analysis is there any command to apply it after I obtain my file.vcf of hard filtering step ..I am little confused or there is way to apply VQSR and BQSR steps for my problem .. kindly if any one has some help or hints drop here please ..

    Thanks in advance

  • everestial007everestial007 GreensboroMember ✭✭
    edited June 2017

    Since the data from RADSeq are very widely dispersed, I am not sure how much good it would be to apply BQSR and VQSR. I think BQSR relies on how much qualities of a base are correlated between nearby bases (or called variants) from same vs. other samples. But, you have to try and compare how much gain you had doing vs. no doing BQSR and VQSR

    • Since, RADseq is designed to capture regions of the genome in a more random way which may or may not overlap between samples. I think vcf generate from RADseq data will have similarities with vcf generated from RNAseq data in some way - I mean there will be lots of calledsites that are close to each other and also wide distance between several called sites, but check that. If you applied GVCF method and joint calling, I think you can proceed to hard filtering as suggested for RNAseq pipeline (i.e remove the SNPs that are clustered together) as @Sheila suggested https://software.broadinstitute.org/gatk/documentation/article?id=3891
      and then use those hard filtered (passed) variants to run BQSR or VQSR.

    • On VQSR step I would rely less on Ti/Tv ratio, because the Ti/Tv for plants are different than humans, but care more about how much False positives vs. True negatives are computed. But, since those values are just some computation I think there is more to digging to do about which are true vs. false variants. I am sure there are lots of True Variants reported for A. thaliana.

  • BegaliBegali GermanyMember ✭✭


    Thanks for your information .. I will try it ..

  • everestial007everestial007 GreensboroMember ✭✭

    Here is the method to plot the distribution of coverage of the alignment (bam/sam) files.

  • BegaliBegali GermanyMember ✭✭


    Thanks for your comments.. :smiley:

  • efratkefratk Member


    I want to use GATK for rat. Did someone use GATK for rat and know about known sites for it?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We've had questions about this in the past but we haven't kept track of responses -- hopefully someone from the rat research community will be able to help you. Sometimes it can be useful to look up published papers from your subfield and look up what those authors did.

Sign In or Register to comment.