What are the standard resources for non-human genomes?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,492Administrator, GSA Official Member admin
edited January 7 in Ask the team

We're trying to put together some recommendations for folks who want to use GATK tools on non-human genomes. But we really don't have much experience with non-human genomes, so we're hoping that those of you in the GATK community who do will chime in and help your fellow scientists find the answers for a few common problems.

The most common problem seems to be finding sets of known sites for organisms like Drosophila, dogs, and various plants. If you know of such resources, please share your knowledge by commenting in this thread. You could earn upvotes and warm fuzzy feelings!

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • gsonggsong Posts: 3Member

    I'm trying to run GATK tools for Yeast. I found that SNPs data for Yeast in http://gbrowse.princeton.edu/cgi-bin/gbrowse/yeast_strains_snps/, but the actual sequence of the predicted SNPs in the other yeast strains is not known; only the location of the SNP is predicted. Does GATK require the actual sequence of the predicted SNPs? Is it fine with the only information of the SNP location?

    Giltae Song, Ph.D. Postdoctoral researcher Department of Genetics School of Medicine Stanford

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,492Administrator, GSA Official Member admin

    Hi Giltae, For most usages (like indel realignment, base and variant recalibration) the GATK tools only use the locus information, not the alternate alleles. As long as the SNPs are in a valid VCF file, it should be okay.

    Geraldine Van der Auwera, PhD

  • FrancoisGuillaumeFrancoisGuillaume Posts: 3Member

    Hi, I am working on cattle sequence, and to obtain known sites, I used a gvf file found on ensemble ftp site : ftp://ftp.ensembl.org/pub/release-68/variation/gvf/

    The size of the gvf files vary between species but it's at least a starting point. I had to process a little bit the file (with two or three shell/awk command).

    1. I discarded SNV with two possible ALT allele (e.g. ALT=A,G)
    2. For insertion, I had to search in fasta file the ref allele to add it ahead of the insertion (e.g. convert REF=., ALT=ACCC to REF=C,ALT=CACCC)
    3. I discard SNP mapped on several chromosome (same rsid, several position)
    4. QUAL and INFO fields were all set to "."
    5. FILTER field was set to PASS

    All this transformations were done in order to be read by IGV (maybe GATK is more lenient), sweeping any complex cases. Now my questions, is there any informations that i should not have discarded ? Should the known sites be as numerous as possible or should they be as reliable as possible (and what criteria could help to assess one SNP quality...when no quality measure is available).

    Furthermore, I went on Illumina and Affimetrix website to download the map file associated to their highest density SNP chip, and based on these, create two additionnal vcf.

    I put all the QUALITY field to ".", but would it be worthwhile to give an arbitrary value (possibly higher than the one of the GVF file SNV)

    I hope this can help.

  • ebanksebanks Posts: 497GSA Official Member mod

    These are good questions. The last one is the easiest: as far as being a resource for known variation, we do not look at the QUAL field so setting that value to "." is okay. As for your first question, it really depends on what you are trying to do. In general, it's more important for your truth set to be as accurate as possible - even if it means missing some real sites because of it. The only thing I would consider doing differently is keeping the multi-allelic SNP sites (since they aren't inherently errorful provided they show up at the right frequency).

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • dkcrossmdkcrossm Posts: 1Member

    Hello! I have sequenced the whole genome of 19 C. elegans and would like to know where I can obtain a variant file for this species. Does one exist?

    I have tried running the BaseRecalibrator in GATK ver2 (already have run bwa->MarkDuplicates->Local realignment around indels) without the --knownSites option (which specifies where the variant file is located) but it comes back with an error and won't create the .grp file that is needed for the PrintReads -BQSR option. The error message is: "Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation."

    Any suggestions? Is there a way to turn off the --knownSites option if a variant file for C. elegans doesn't exist? Or is there another way? I'm open to any suggestions. Thanks!

  • ebanksebanks Posts: 497GSA Official Member mod

    There is a section of the documentation that offers advice for those users processing organisms without a database of known variation. I'd recommend reading it.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • PeteHaitchPeteHaitch Posts: 19Member

    I'm currently dealing with this issue in mouse. I'm using mm9/MGSCv37 because there seem to be more resources available for this reference than the latest mm10 build. There are a few databases of variation that I am trying to choose between:

    1. dbSNP128.txt.gz from UCSC (http://hgdownload.cse.ucsc.edu/goldenPath/mm9/database/snp128.txt.gz). It seems I should be able to convert this to VCF using either GATK or SAMtools but I can't figure out how to do it. Instead I convert it using my own less-than-ideal script (https://github.com/PeteHaitch/dbSNP2VCF). This script can only convert simple SNPs to VCF; multi-allelic sites and indels are ignored.
    2. The VCF created by Keane, T. M. et al. Mouse genomic variation and its effect on phenotypes and gene regulation. Nature 477, 289–294 (2011) that is available from ftp://ftp-mouse.sanger.ac.uk/current_snps/. This includes variants from 17 different strains of mouse/ I'm not sure whether to restrict this file to sites that are SNPs in my strain or in any of the 17 sequenced strains. Advice on this would be most welcome. There is also an indel VCF from this paper, available at ftp://ftp-mouse.sanger.ac.uk/current_indels/.
    3. The Ensembl variants. For mm10 these are "remapped from NCBIM37" but I don't know how this is done. For mm9 these variants are from dbSNP but I'm not sure which version of dbSNP or how to get a VCF of these (or download something I can convert to VCF).

    I'd welcome input from anyone who is using GATK for analysis mouse data or general advice on using databases of variation in mice, such as how to use strain-specific variants.

    pdexheimer
  • pdexheimerpdexheimer Posts: 96Member ✭✭✭
    edited August 2012

    @PeteHaitch: That Keane paper looks really useful, especially since the UCSC dbSNP track is now so old. I will probably leave the sites from all 17 strains in. I can only think of three places in the pipeline that variant info is used:

    1. BQSR - Having "too many" variants might cause a slight overestimate in quality since real mismatches may be discarded, but I can't imagine the effect being severe
    2. Indel Realignment - Runtime would be higher, but results shouldn't be changed
    3. VQSR - Again, results shouldn't change because only sites that are called variant in the cohort are examined

    For converting the UCSC file, there is indeed a very useful walker/Tribble codec in the GATK for this very purpose. You'll want to run something like java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:OLDDBSNP dbSNP128.txt -o dbsnp128.vcf It looks like the OLDDBSNP codec isn't documented anymore, but I'm pretty sure it's still there

    Post edited by pdexheimer on
    PeteHaitch
  • pdexheimerpdexheimer Posts: 96Member ✭✭✭

    Actually, now that I've read the paper a little more thoroughly, the Mus spretus sites should definitely go. I could see going either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains

    PeteHaitch
  • PeteHaitchPeteHaitch Posts: 19Member
    edited August 2012

    @pdexheimer Thanks! That's just the sort of advice I was looking for regarding which strains of the Keane paper to retain. It's certainly looking the easiest option currently. A couple of minor things that you may need to be adjust before this can be used with GATK:

    1. Check chrom field matches your reference. For example, mm9 uses the "chr1, chr2, ..."-style of chromosome names whereas the Keane VCFs use the "1, 2, ..."-style. I think you'll need to alter the VCF so these match in order to use these VCFs with GATK.
    2. Conversion of the VCF to v4 or v4.1 (it is v3.3). java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:VCF3 Keane.vcf -o up_versioned_Keane.vcf should do the trick but I haven't tested this.

    I also looked at VariantsToVCF yesterday. Can anyone from the GATK team tell us whether the OLDDBSNP codec is deprecated or is it just a case of the documentation going missing?

    I found a GVF file of the Ensembl68 variants (mapped to GRCm38) on the Ensembl website (http://asia.ensembl.org/info/data/ftp/index.html). I need to find a way to convert this to VCF - this tool (http://code.google.com/p/gvf2vcf/) claims to do so but I haven't tested it.

    Post edited by PeteHaitch on
  • Mark_DePristoMark_DePristo Posts: 140Administrator, GSA Official Member admin

    I believe oldDBSNP is still there. I see that the codec docs are missing. I'll ask for that to be fixed.

    -- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard

  • SophiaSophia Posts: 30Member

    We have used the Apple (Malus x domesticus) genome assembly from here:

    http://www.rosaceae.org/node/475

    They also have some other resources like gene predictions and functional annotations.

  • kmdailykmdaily Posts: 16Member
    edited February 4

    What about for custom made libraries based on the human genome? Agilent allows extra baits to be added, for which we designed for a virus. I get errors at the step of RealignerTargetCreator which may be due to this.

    Post edited by kmdaily on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,492Administrator, GSA Official Member admin

    @kmdaily, assuming this means you have reads that align to the genome of the virus you are looking for, you need to add the virus contigs to the human genome reference. Once you've done that and formatted it properly, the rest of your analyses should proceed as normal. Unless you're not aligning to the human reference at all?

    Geraldine Van der Auwera, PhD

  • kmdailykmdaily Posts: 16Member

    Thanks for the fast response @Geraldine_VdAuwera! I added the virus contig to the reference and built all necessary indexes with this, performed alignment, etc. I get a "Input files known2 and reference have incompatible contigs: Order of contigs differences, which is unsafe." error when using the 1000G phase1 indels for hg19; maybe I incorrectly assumed that it was due to the virus being there. I'm using that vcf in conjunction with the Mills and 1000G gold standard file.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,492Administrator, GSA Official Member admin

    Sounds like the order of the contigs in your custom reference is different from the order of the canonical reference and the known sites vcf. The contigs have to be in the same order as explained here: http://www.broadinstitute.org/gatk/guide/article?id=1204

    Geraldine Van der Auwera, PhD

  • kmdailykmdaily Posts: 16Member

    I did read that article, thanks. What was strange is that it worked fine with the Mills and 1000G gold standard fine alone, and only fails when using the 1000G phase1 indels file (either alone or along with the Mills), which is why I didn't think there was a problem with the ordering. I will reorder the BAM file and re-try. Thanks again!

  • kmdailykmdaily Posts: 16Member

    I was using an old version of the bundle provided by our IT team (hg19 1.5). I downloaded the files directly from the GATK ftp site for 2.3, and after re-ordering the bam file it is working correctly with both files. Thank you for your help, @Geraldine_VdAuwera!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,492Administrator, GSA Official Member admin

    Glad to hear your problem is solved!

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.