The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# What are the standard resources for non-human genomes?

edited January 2013

We're trying to put together some recommendations for folks who want to use GATK tools on non-human genomes. But we really don't have much experience with non-human genomes, so we're hoping that those of you in the GATK community who do will chime in and help your fellow scientists find the answers for a few common problems.

The most common problem seems to be finding sets of known sites for organisms like Drosophila, dogs, and various plants. If you know of such resources, please share your knowledge by commenting in this thread. You could earn upvotes and warm fuzzy feelings!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Member Posts: 3

I'm trying to run GATK tools for Yeast. I found that SNPs data for Yeast in http://gbrowse.princeton.edu/cgi-bin/gbrowse/yeast_strains_snps/, but the actual sequence of the predicted SNPs in the other yeast strains is not known; only the location of the SNP is predicted. Does GATK require the actual sequence of the predicted SNPs? Is it fine with the only information of the SNP location?

Giltae Song, Ph.D.
Postdoctoral researcher
Department of Genetics
School of Medicine
Stanford

Hi Giltae,
For most usages (like indel realignment, base and variant recalibration) the GATK tools only use the locus information, not the alternate alleles. As long as the SNPs are in a valid VCF file, it should be okay.

Geraldine Van der Auwera, PhD

• Jouy en josasMember Posts: 4

Hi, I am working on cattle sequence, and to obtain known sites, I used a gvf file found on ensemble ftp site :
ftp://ftp.ensembl.org/pub/release-68/variation/gvf/

The size of the gvf files vary between species but it's at least a starting point.
I had to process a little bit the file (with two or three shell/awk command).

1. I discarded SNV with two possible ALT allele (e.g. ALT=A,G)
2. For insertion, I had to search in fasta file the ref allele to add it ahead of the insertion (e.g. convert REF=., ALT=ACCC to REF=C,ALT=CACCC)
3. I discard SNP mapped on several chromosome (same rsid, several position)
4. QUAL and INFO fields were all set to "."
5. FILTER field was set to PASS

All this transformations were done in order to be read by IGV (maybe GATK is more lenient), sweeping any complex cases. Now my questions, is there any informations that i should not have discarded ? Should the known sites be as numerous as possible or should they be as reliable as possible (and what criteria could help to assess one SNP quality...when no quality measure is available).

Furthermore, I went on Illumina and Affimetrix website to download the map file associated to their highest density SNP chip, and based on these, create two additionnal vcf.

I put all the QUALITY field to ".", but would it be worthwhile to give an arbitrary value (possibly higher than the one of the GVF file SNV)

I hope this can help.

These are good questions.
The last one is the easiest: as far as being a resource for known variation, we do not look at the QUAL field so setting that value to "." is okay.
As for your first question, it really depends on what you are trying to do. In general, it's more important for your truth set to be as accurate as possible - even if it means missing some real sites because of it. The only thing I would consider doing differently is keeping the multi-allelic SNP sites (since they aren't inherently errorful provided they show up at the right frequency).

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 3

Hello! I have sequenced the whole genome of 19 C. elegans and would like to know where I can obtain a variant file for this species. Does one exist?

I have tried running the BaseRecalibrator in GATK ver2 (already have run bwa->MarkDuplicates->Local realignment around indels) without the --knownSites option (which specifies where the variant file is located) but it comes back with an error and won't create the .grp file that is needed for the PrintReads -BQSR option. The error message is: "Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation."

Any suggestions? Is there a way to turn off the --knownSites option if a variant file for C. elegans doesn't exist? Or is there another way? I'm open to any suggestions. Thanks!

There is a section of the documentation that offers advice for those users processing organisms without a database of known variation. I'd recommend reading it.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 19

I'm currently dealing with this issue in mouse. I'm using mm9/MGSCv37 because there seem to be more resources available for this reference than the latest mm10 build. There are a few databases of variation that I am trying to choose between:

1. dbSNP128.txt.gz from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/snp128.txt.gz). It seems I should be able to convert this to VCF using either GATK or SAMtools but I can't figure out how to do it. Instead I convert it using my own less-than-ideal script (https://github.com/PeteHaitch/dbSNP2VCF). This script can only convert simple SNPs to VCF; multi-allelic sites and indels are ignored.
2. The VCF created by Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011) that is available from ftp://ftp-mouse.sanger.ac.uk/current_snps/. This includes variants from 17 different strains of mouse/ I'm not sure whether to restrict this file to sites that are SNPs in my strain or in any of the 17 sequenced strains. Advice on this would be most welcome. There is also an indel VCF from this paper, available at ftp://ftp-mouse.sanger.ac.uk/current_indels/.
3. The Ensembl variants. For mm10 these are "remapped from NCBIM37" but I don't know how this is done. For mm9 these variants are from dbSNP but I'm not sure which version of dbSNP or how to get a VCF of these (or download something I can convert to VCF).

I'd welcome input from anyone who is using GATK for analysis mouse data or general advice on using databases of variation in mice, such as how to use strain-specific variants.

• Member, Dev Posts: 544 ✭✭✭✭
edited August 2012

@PeteHaitch: That Keane paper looks really useful, especially since the UCSC dbSNP track is now so old. I will probably leave the sites from all 17 strains in. I can only think of three places in the pipeline that variant info is used:

1. BQSR - Having "too many" variants might cause a slight overestimate in quality since real mismatches may be discarded, but I can't imagine the effect being severe
2. Indel Realignment - Runtime would be higher, but results shouldn't be changed
3. VQSR - Again, results shouldn't change because only sites that are called variant in the cohort are examined

For converting the UCSC file, there is indeed a very useful walker/Tribble codec in the GATK for this very purpose. You'll want to run something like java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:OLDDBSNP dbSNP128.txt -o dbsnp128.vcf
It looks like the OLDDBSNP codec isn't documented anymore, but I'm pretty sure it's still there

• Member, Dev Posts: 544 ✭✭✭✭

Actually, now that I've read the paper a little more thoroughly, the Mus spretus sites should definitely go. I could see going either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains

• Member Posts: 19
edited August 2012

@pdexheimer Thanks! That's just the sort of advice I was looking for regarding which strains of the Keane paper to retain. It's certainly looking the easiest option currently. A couple of minor things that you may need to be adjust before this can be used with GATK:

1. Check chrom field matches your reference. For example, mm9 uses the "chr1, chr2, ..."-style of chromosome names whereas the Keane VCFs use the "1, 2, ..."-style. I think you'll need to alter the VCF so these match in order to use these VCFs with GATK.
2. Conversion of the VCF to v4 or v4.1 (it is v3.3). java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:VCF3 Keane.vcf -o up_versioned_Keane.vcf should do the trick but I haven't tested this.

I also looked at VariantsToVCF yesterday. Can anyone from the GATK team tell us whether the OLDDBSNP` codec is deprecated or is it just a case of the documentation going missing?

I found a GVF file of the Ensembl68 variants (mapped to GRCm38) on the Ensembl website (http://asia.ensembl.org/info/data/ftp/index.html). I need to find a way to convert this to VCF - this tool (http://code.google.com/p/gvf2vcf/) claims to do so but I haven't tested it.

I believe oldDBSNP is still there. I see that the codec docs are missing. I'll ask for that to be fixed.

--
Mark A. DePristo, Ph.D.
Co-Director, Medical and Population Genetics
Broad Institute of MIT and Harvard

• Member Posts: 2

These seem to be good resources for many reference genomes.

http://www.ensembl.org/info/data/ftp/index.html

ftp://ftp.ncbi.nih.gov/genomes/

• Member Posts: 50

We have used the Apple (Malus x domesticus) genome assembly from here:

http://www.rosaceae.org/node/475

They also have some other resources like gene predictions and functional annotations.

• Member Posts: 17
edited February 2013

What about for custom made libraries based on the human genome? Agilent allows extra baits to be added, for which we designed for a virus. I get errors at the step of RealignerTargetCreator which may be due to this.

@kmdaily, assuming this means you have reads that align to the genome of the virus you are looking for, you need to add the virus contigs to the human genome reference. Once you've done that and formatted it properly, the rest of your analyses should proceed as normal. Unless you're not aligning to the human reference at all?

Geraldine Van der Auwera, PhD

• Member Posts: 17

Thanks for the fast response @Geraldine_VdAuwera! I added the virus contig to the reference and built all necessary indexes with this, performed alignment, etc. I get a "Input files known2 and reference have incompatible contigs: Order of contigs differences, which is unsafe." error when using the 1000G phase1 indels for hg19; maybe I incorrectly assumed that it was due to the virus being there. I'm using that vcf in conjunction with the Mills and 1000G gold standard file.

Sounds like the order of the contigs in your custom reference is different from the order of the canonical reference and the known sites vcf. The contigs have to be in the same order as explained here: http://www.broadinstitute.org/gatk/guide/article?id=1204

Geraldine Van der Auwera, PhD

• Member Posts: 17

I did read that article, thanks. What was strange is that it worked fine with the Mills and 1000G gold standard fine alone, and only fails when using the 1000G phase1 indels file (either alone or along with the Mills), which is why I didn't think there was a problem with the ordering. I will reorder the BAM file and re-try. Thanks again!

• Member Posts: 17

I was using an old version of the bundle provided by our IT team (hg19 1.5). I downloaded the files directly from the GATK ftp site for 2.3, and after re-ordering the bam file it is working correctly with both files. Thank you for your help, @Geraldine_VdAuwera!

Geraldine Van der Auwera, PhD

• Member Posts: 15

Hey,

does anyone know where to obtain VariationRecalibration training resources for Arabidopsis Thaliana, or how to create them by yourself by using, for instance, the data from the http://1001genomes.org/? It should be possible to call reliable Variations from close to 1000 resequenced strains, no?

I found dbSNP for it, but since it is no longer used in the recalibration step ...

Cheers,

David

• Member Posts: 12

Hi,

I am calling variants in a mouse KO experiment where C57BL6NJ strain was used. As @PeterHaitch suggested in an earlier post, one could use the MGP indels and SNPs (ftp://ftp-mouse.sanger.ac.uk/current_snps/). @pdexheimer advised to "go either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains"

Is there any reason why to include indels/SNPs sites of all strain plus those will-derived instead of restricting sites just to the strain one is using?

Cheers,

Fernando

• Member, Dev Posts: 544 ✭✭✭✭

Hi Fernando -

I kind of view the question the other way around - "Is there any harm in leaving in sites from other strains?". Ultimately, what you're looking for is a catalog of reliably true variations. If there are variants in your catalog that aren't in your data, then there's no harm (at least for VQSR). And there's a chance that variants in other strains could exist in your strain. So I view this as a situation that might help and won't hurt.

For BQSR, an excess of "truth" sites can be harmful as it would cause sites to be wrongly discarded from the statistical model. If you're dealing with whole genome, I can't see the number of sites in the Keane dataset causing problems. Actually, even for exomes I don't think it would hurt - but you could always leave them out if you're concerned.

• Member Posts: 12

Thanks @pdexheimer‌! I totally agree that it won't hurt using all strains for VQSR. My concerned was/is related to local indel realignments and BQSR. The current no. of reported indels for all strains (retrieved from the MGP ftp) is around 10 million compared to 40 thousand specific to my strain. A similar case can be observed for SNVs, around 60 million compared to 20 thousand variants reported to CL57BL6NJ.

I am dealing with whole genome. Again, and sorry to insist, Do you think that this substantial differences in no. of sites could affect the analysis?

Cheers,

Fernando

• Member, Dev Posts: 544 ✭✭✭✭

Indel realignment, no. At worst it will slow you down, but it won't cause false indels to be inserted or anything like that.

For BQSR, this is simply a semi-educated guess - I really have no idea if I'm right or not. The correct answer would be to run it both ways and compare the results of BQSR. But with that caveat, I would say that since you're using BL6 mice and therefore expect a very small amount of variation, including all of the cross-strain info would probably do more harm than good. If you were looking at, say, A/J mice and expected a more reasonable number of variants, it might not matter so much.

But again, this is just a guess. Try it out!

• Member Posts: 12

Thanks again @pdexheimer‌! I am semi-educated in this topic as well and first time calling variants in mouse whole genome samples (always worked using human samples before). As you said, I will need to try it out.

Cheers,

Fernando

• BerlinMember Posts: 39

Hi people,

I have a set of SNPs that I used as input for Reclibration.
Now it seems, that the recalibration lowers the BaseQuality too much, so SNPs that seem to be real SNPs are not being called.
I find the SNPs in many individuals, but in some the region doesn't get triggered as an active region.
One example (one site and two individuals, recalibrated data):
The BaseQualities in the individual where it doesn't trigger are 12-15 (133 reads)
in the individual where the SNP gets called there are a lot of Qualities up to 19 (74 reads)
In the unrecalibrated I can find BQs up to 41.
In both individuals there is not a single read that doesn't have the SNP.
The mapping qualities of the reads are all 60.

My SNP set is very small (69011 SNPs). It is microarray data, aligned to the reference.
The area of the genome in question is not covered by the SNPs.

I am wondering now, if it is a bad idea to use this set for recalibration and
what I should do in this case.
Maybe creating my own set by first not using recalibration, like suggested in best practices?
Or going the other way and using my very restrictive set I have now and use this
for recalibration and see how new SNPs evolve? Maybe in multiple cycles?

Thanks for your input on this!

@AlexanderV
Hi,

The BQSR model treats all mismatches to the reference as errors. The sites in the input knownSites file are masked because they are "known" variation, and we do not want to penalize them as errors. Because your knownSites file does not cover some regions in the genome, those regions will be penalized unfairly. That is why the quality scores are lower in your region of interest.

In your case, I think it is a good idea to proceed with our suggestions in the Best Practices and try variant calling and recalibration until convergence. However, because you do have a small set of known variants, you will at least have a starting place for what filters to apply to your callset during hard filtering. You can plot the annotations of the known variants and use the most confident annotation thresholds as a standard.

I hope this helps.

-Sheila

• AustraliaMember Posts: 3

I got tomato SNP database between MicroTom and Heinz 1706, from Japanese TOMATOMICS databases

Cheers,

@lovehardware
Thanks for sharing

• Pomona, CAMember Posts: 1

Hi,

I am working with a wild & domesticated flower plant genome. I would like to generate SNPs but there is no reference for these species. Can I use a close related species such as Arabidopsis for the dbSNP for recalibration? Please let me know.

Thank you

@ortizem
Hi,

Have a look at this article under "I'm working on a genome that doesn't really have a good SNP database yet. "

-Sheila

• SpainMember Posts: 16
edited July 2016

Hi,

I've been trying to do exactely that, using highly confindent SNPs from uncalibrated data to recalibrate it in the next round, always with the same outcome: I end by having no variants at all in the subsequent gvcfs. I wonder if I'm using too restrictive thresholds to select the SNPs from the first round, has anyone the same problem? Thanks a lot!

Guillermo

edited July 2016

@Guillefriis
Hi Guillermo,

What is the criteria you are using for choosing high confidence SNPs?

Have a look at our basic hard filtering recommendations and ways to improve them here.

-Sheila

• HoustonMember Posts: 1

Hi,

I am trying to run indel realignment and base recalibration for some public dog whole exome samples. I am using the vcf files from Ensembl CanFam3.1 (Canis_familiaris.vcf and Canis_familiaris_structural_variations.vcf). However, the process of indel realignment keeps requiring more memory. I want to ask: if we set the FILTER field to PASS in the vcf files (as suggested by other user), do this can increase memory efficiency? Or there is no effect?

Thanks,
Emmanuel

• Sri LankaMember Posts: 24

Hi,

Anybody working with Leishmania genome? Please share if you know of any high quality SNPs or indels that can be used as known sites.

Thanks
Sumudu

• Member Posts: 13

For those non-model species that almost have no chance to obtain Known sites file, how can we still do local alignment for indels for their genomes?

For doing BaseRecalibration, at least we can use a bootstrap approach to do recalibration for several rounds to reach convergence. Can we also apply such bootstrap approach for local alignment (i.e, RealignerTargetCreator and IndelRealigner)? If so, how?

I remembered that I used a old version of GATK to do local alignment without providing a known-sites file. Could it still work in that way for the new version?

Thanks,

Chih-Ming

edited September 2016

@ymw
Hi Chih-Ming,

You can either use the bootstrapped VCF for both BQSR and Indel Realignment, or you can simply run Indel Realignment without the known sites file. RealignerTargetCreator will use the reads in your BAM file to determine regions that should be realigned.

Also, we no longer recommend Indel Realignment if you are using HaplotypeCaller. Have a look at this post for more information.

-Sheila

• Member Posts: 15

VQSR in arabidopsis thaliana:

http://www.ncbi.nlm.nih.gov/pubmed/25681258