What are the standard resources for non-human genomes?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin
edited January 2013 in Ask the GATK team

We're trying to put together some recommendations for folks who want to use GATK tools on non-human genomes. But we really don't have much experience with non-human genomes, so we're hoping that those of you in the GATK community who do will chime in and help your fellow scientists find the answers for a few common problems.

The most common problem seems to be finding sets of known sites for organisms like Drosophila, dogs, and various plants. If you know of such resources, please share your knowledge by commenting in this thread. You could earn upvotes and warm fuzzy feelings!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • gsonggsong Posts: 3Member

    I'm trying to run GATK tools for Yeast. I found that SNPs data for Yeast in http://gbrowse.princeton.edu/cgi-bin/gbrowse/yeast_strains_snps/, but the actual sequence of the predicted SNPs in the other yeast strains is not known; only the location of the SNP is predicted. Does GATK require the actual sequence of the predicted SNPs? Is it fine with the only information of the SNP location?

    Giltae Song, Ph.D. Postdoctoral researcher Department of Genetics School of Medicine Stanford

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Hi Giltae, For most usages (like indel realignment, base and variant recalibration) the GATK tools only use the locus information, not the alternate alleles. As long as the SNPs are in a valid VCF file, it should be okay.

    Geraldine Van der Auwera, PhD

  • FrancoisGuillaumeFrancoisGuillaume Jouy en josasPosts: 4Member

    Hi, I am working on cattle sequence, and to obtain known sites, I used a gvf file found on ensemble ftp site : ftp://ftp.ensembl.org/pub/release-68/variation/gvf/

    The size of the gvf files vary between species but it's at least a starting point. I had to process a little bit the file (with two or three shell/awk command).

    1. I discarded SNV with two possible ALT allele (e.g. ALT=A,G)
    2. For insertion, I had to search in fasta file the ref allele to add it ahead of the insertion (e.g. convert REF=., ALT=ACCC to REF=C,ALT=CACCC)
    3. I discard SNP mapped on several chromosome (same rsid, several position)
    4. QUAL and INFO fields were all set to "."
    5. FILTER field was set to PASS

    All this transformations were done in order to be read by IGV (maybe GATK is more lenient), sweeping any complex cases. Now my questions, is there any informations that i should not have discarded ? Should the known sites be as numerous as possible or should they be as reliable as possible (and what criteria could help to assess one SNP quality...when no quality measure is available).

    Furthermore, I went on Illumina and Affimetrix website to download the map file associated to their highest density SNP chip, and based on these, create two additionnal vcf.

    I put all the QUALITY field to ".", but would it be worthwhile to give an arbitrary value (possibly higher than the one of the GVF file SNV)

    I hope this can help.

  • ebanksebanks Posts: 683GATK Developer mod

    These are good questions. The last one is the easiest: as far as being a resource for known variation, we do not look at the QUAL field so setting that value to "." is okay. As for your first question, it really depends on what you are trying to do. In general, it's more important for your truth set to be as accurate as possible - even if it means missing some real sites because of it. The only thing I would consider doing differently is keeping the multi-allelic SNP sites (since they aren't inherently errorful provided they show up at the right frequency).

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • dkcrossmdkcrossm Posts: 3Member

    Hello! I have sequenced the whole genome of 19 C. elegans and would like to know where I can obtain a variant file for this species. Does one exist?

    I have tried running the BaseRecalibrator in GATK ver2 (already have run bwa->MarkDuplicates->Local realignment around indels) without the --knownSites option (which specifies where the variant file is located) but it comes back with an error and won't create the .grp file that is needed for the PrintReads -BQSR option. The error message is: "Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation."

    Any suggestions? Is there a way to turn off the --knownSites option if a variant file for C. elegans doesn't exist? Or is there another way? I'm open to any suggestions. Thanks!

  • ebanksebanks Posts: 683GATK Developer mod

    There is a section of the documentation that offers advice for those users processing organisms without a database of known variation. I'd recommend reading it.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • PeteHaitchPeteHaitch Posts: 19Member

    I'm currently dealing with this issue in mouse. I'm using mm9/MGSCv37 because there seem to be more resources available for this reference than the latest mm10 build. There are a few databases of variation that I am trying to choose between:

    1. dbSNP128.txt.gz from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/snp128.txt.gz). It seems I should be able to convert this to VCF using either GATK or SAMtools but I can't figure out how to do it. Instead I convert it using my own less-than-ideal script (https://github.com/PeteHaitch/dbSNP2VCF). This script can only convert simple SNPs to VCF; multi-allelic sites and indels are ignored.
    2. The VCF created by Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011) that is available from ftp://ftp-mouse.sanger.ac.uk/current_snps/. This includes variants from 17 different strains of mouse/ I'm not sure whether to restrict this file to sites that are SNPs in my strain or in any of the 17 sequenced strains. Advice on this would be most welcome. There is also an indel VCF from this paper, available at ftp://ftp-mouse.sanger.ac.uk/current_indels/.
    3. The Ensembl variants. For mm10 these are "remapped from NCBIM37" but I don't know how this is done. For mm9 these variants are from dbSNP but I'm not sure which version of dbSNP or how to get a VCF of these (or download something I can convert to VCF).

    I'd welcome input from anyone who is using GATK for analysis mouse data or general advice on using databases of variation in mice, such as how to use strain-specific variants.

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭
    edited August 2012

    @PeteHaitch: That Keane paper looks really useful, especially since the UCSC dbSNP track is now so old. I will probably leave the sites from all 17 strains in. I can only think of three places in the pipeline that variant info is used:

    1. BQSR - Having "too many" variants might cause a slight overestimate in quality since real mismatches may be discarded, but I can't imagine the effect being severe
    2. Indel Realignment - Runtime would be higher, but results shouldn't be changed
    3. VQSR - Again, results shouldn't change because only sites that are called variant in the cohort are examined

    For converting the UCSC file, there is indeed a very useful walker/Tribble codec in the GATK for this very purpose. You'll want to run something like java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:OLDDBSNP dbSNP128.txt -o dbsnp128.vcf It looks like the OLDDBSNP codec isn't documented anymore, but I'm pretty sure it's still there

    Post edited by pdexheimer on
  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    Actually, now that I've read the paper a little more thoroughly, the Mus spretus sites should definitely go. I could see going either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains

  • PeteHaitchPeteHaitch Posts: 19Member
    edited August 2012

    @pdexheimer Thanks! That's just the sort of advice I was looking for regarding which strains of the Keane paper to retain. It's certainly looking the easiest option currently. A couple of minor things that you may need to be adjust before this can be used with GATK:

    1. Check chrom field matches your reference. For example, mm9 uses the "chr1, chr2, ..."-style of chromosome names whereas the Keane VCFs use the "1, 2, ..."-style. I think you'll need to alter the VCF so these match in order to use these VCFs with GATK.
    2. Conversion of the VCF to v4 or v4.1 (it is v3.3). java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:VCF3 Keane.vcf -o up_versioned_Keane.vcf should do the trick but I haven't tested this.

    I also looked at VariantsToVCF yesterday. Can anyone from the GATK team tell us whether the OLDDBSNP codec is deprecated or is it just a case of the documentation going missing?

    I found a GVF file of the Ensembl68 variants (mapped to GRCm38) on the Ensembl website (http://asia.ensembl.org/info/data/ftp/index.html). I need to find a way to convert this to VCF - this tool (http://code.google.com/p/gvf2vcf/) claims to do so but I haven't tested it.

    Post edited by PeteHaitch on
  • Mark_DePristoMark_DePristo Posts: 153Administrator, GATK Developer admin

    I believe oldDBSNP is still there. I see that the codec docs are missing. I'll ask for that to be fixed.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • SophiaSophia Posts: 35Member

    We have used the Apple (Malus x domesticus) genome assembly from here:

    http://www.rosaceae.org/node/475

    They also have some other resources like gene predictions and functional annotations.

  • kmdailykmdaily Posts: 17Member
    edited February 2013

    What about for custom made libraries based on the human genome? Agilent allows extra baits to be added, for which we designed for a virus. I get errors at the step of RealignerTargetCreator which may be due to this.

    Post edited by kmdaily on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    @kmdaily, assuming this means you have reads that align to the genome of the virus you are looking for, you need to add the virus contigs to the human genome reference. Once you've done that and formatted it properly, the rest of your analyses should proceed as normal. Unless you're not aligning to the human reference at all?

    Geraldine Van der Auwera, PhD

  • kmdailykmdaily Posts: 17Member

    Thanks for the fast response @Geraldine_VdAuwera! I added the virus contig to the reference and built all necessary indexes with this, performed alignment, etc. I get a "Input files known2 and reference have incompatible contigs: Order of contigs differences, which is unsafe." error when using the 1000G phase1 indels for hg19; maybe I incorrectly assumed that it was due to the virus being there. I'm using that vcf in conjunction with the Mills and 1000G gold standard file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Sounds like the order of the contigs in your custom reference is different from the order of the canonical reference and the known sites vcf. The contigs have to be in the same order as explained here: http://www.broadinstitute.org/gatk/guide/article?id=1204

    Geraldine Van der Auwera, PhD

  • kmdailykmdaily Posts: 17Member

    I did read that article, thanks. What was strange is that it worked fine with the Mills and 1000G gold standard fine alone, and only fails when using the 1000G phase1 indels file (either alone or along with the Mills), which is why I didn't think there was a problem with the ordering. I will reorder the BAM file and re-try. Thanks again!

  • kmdailykmdaily Posts: 17Member

    I was using an old version of the bundle provided by our IT team (hg19 1.5). I downloaded the files directly from the GATK ftp site for 2.3, and after re-ordering the bam file it is working correctly with both files. Thank you for your help, @Geraldine_VdAuwera!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,423Administrator, GATK Developer admin

    Glad to hear your problem is solved!

    Geraldine Van der Auwera, PhD

  • DavidRiesDavidRies Posts: 11Member

    Hey,

    does anyone know where to obtain VariationRecalibration training resources for Arabidopsis Thaliana, or how to create them by yourself by using, for instance, the data from the http://1001genomes.org/? It should be possible to call reliable Variations from close to 1000 resequenced strains, no?

    I found dbSNP for it, but since it is no longer used in the recalibration step ...

    Cheers,

    David

  • fjrossellofjrossello Posts: 12Member

    Hi,

    I am calling variants in a mouse KO experiment where C57BL6NJ strain was used. As @PeterHaitch suggested in an earlier post, one could use the MGP indels and SNPs (ftp://ftp-mouse.sanger.ac.uk/current_snps/). @pdexheimer advised to "go either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains"

    Is there any reason why to include indels/SNPs sites of all strain plus those will-derived instead of restricting sites just to the strain one is using?

    Thanks in advance.

    Cheers,

    Fernando

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    Hi Fernando -

    I kind of view the question the other way around - "Is there any harm in leaving in sites from other strains?". Ultimately, what you're looking for is a catalog of reliably true variations. If there are variants in your catalog that aren't in your data, then there's no harm (at least for VQSR). And there's a chance that variants in other strains could exist in your strain. So I view this as a situation that might help and won't hurt.

    For BQSR, an excess of "truth" sites can be harmful as it would cause sites to be wrongly discarded from the statistical model. If you're dealing with whole genome, I can't see the number of sites in the Keane dataset causing problems. Actually, even for exomes I don't think it would hurt - but you could always leave them out if you're concerned.

  • fjrossellofjrossello Posts: 12Member

    Thanks @pdexheimer‌! I totally agree that it won't hurt using all strains for VQSR. My concerned was/is related to local indel realignments and BQSR. The current no. of reported indels for all strains (retrieved from the MGP ftp) is around 10 million compared to 40 thousand specific to my strain. A similar case can be observed for SNVs, around 60 million compared to 20 thousand variants reported to CL57BL6NJ.

    I am dealing with whole genome. Again, and sorry to insist, Do you think that this substantial differences in no. of sites could affect the analysis?

    Thanks in advance.

    Cheers,

    Fernando

  • pdexheimerpdexheimer Posts: 360Member, GSA Collaborator ✭✭✭

    Indel realignment, no. At worst it will slow you down, but it won't cause false indels to be inserted or anything like that.

    For BQSR, this is simply a semi-educated guess - I really have no idea if I'm right or not. The correct answer would be to run it both ways and compare the results of BQSR. But with that caveat, I would say that since you're using BL6 mice and therefore expect a very small amount of variation, including all of the cross-strain info would probably do more harm than good. If you were looking at, say, A/J mice and expected a more reasonable number of variants, it might not matter so much.

    But again, this is just a guess. Try it out!

  • fjrossellofjrossello Posts: 12Member

    Thanks again @pdexheimer‌! I am semi-educated in this topic as well and first time calling variants in mouse whole genome samples (always worked using human samples before). As you said, I will need to try it out.

    Cheers,

    Fernando

Sign In or Register to comment.