It looks like you're new here. If you want to get involved, click one of these buttons!
Geraldine_VdAuwera
Posts: 2,239Administrator, GSA Official Member admin
We're trying to put together some recommendations for folks who want to use GATK tools on non-human genomes. But we really don't have much experience with non-human genomes, so we're hoping that those of you in the GATK community who do will chime in and help your fellow scientists find the answers for a few common problems.
The most common problem seems to be finding sets of known sites for organisms like Drosophila, dogs, and various plants. If you know of such resources, please share your knowledge by commenting in this thread. You could earn upvotes and warm fuzzy feelings!
Geraldine Van der Auwera, PhD
Comments
I'm trying to run GATK tools for Yeast. I found that SNPs data for Yeast in http://gbrowse.princeton.edu/cgi-bin/gbrowse/yeast_strains_snps/, but the actual sequence of the predicted SNPs in the other yeast strains is not known; only the location of the SNP is predicted. Does GATK require the actual sequence of the predicted SNPs? Is it fine with the only information of the SNP location?
Giltae Song, Ph.D. Postdoctoral researcher Department of Genetics School of Medicine Stanford
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Giltae, For most usages (like indel realignment, base and variant recalibration) the GATK tools only use the locus information, not the alternate alleles. As long as the SNPs are in a valid VCF file, it should be okay.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi, I am working on cattle sequence, and to obtain known sites, I used a gvf file found on ensemble ftp site : ftp://ftp.ensembl.org/pub/release-68/variation/gvf/
The size of the gvf files vary between species but it's at least a starting point. I had to process a little bit the file (with two or three shell/awk command).
All this transformations were done in order to be read by IGV (maybe GATK is more lenient), sweeping any complex cases. Now my questions, is there any informations that i should not have discarded ? Should the known sites be as numerous as possible or should they be as reliable as possible (and what criteria could help to assess one SNP quality...when no quality measure is available).
Furthermore, I went on Illumina and Affimetrix website to download the map file associated to their highest density SNP chip, and based on these, create two additionnal vcf.
I put all the QUALITY field to ".", but would it be worthwhile to give an arbitrary value (possibly higher than the one of the GVF file SNV)
I hope this can help.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •These are good questions. The last one is the easiest: as far as being a resource for known variation, we do not look at the QUAL field so setting that value to "." is okay. As for your first question, it really depends on what you are trying to do. In general, it's more important for your truth set to be as accurate as possible - even if it means missing some real sites because of it. The only thing I would consider doing differently is keeping the multi-allelic SNP sites (since they aren't inherently errorful provided they show up at the right frequency).
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hello! I have sequenced the whole genome of 19 C. elegans and would like to know where I can obtain a variant file for this species. Does one exist?
I have tried running the BaseRecalibrator in GATK ver2 (already have run bwa->MarkDuplicates->Local realignment around indels) without the --knownSites option (which specifies where the variant file is located) but it comes back with an error and won't create the .grp file that is needed for the PrintReads -BQSR option. The error message is: "Invalid command line: This calculation is critically dependent on being able to skip over known variant sites. Please provide a VCF file containing known sites of genetic variation."
Any suggestions? Is there a way to turn off the --knownSites option if a variant file for C. elegans doesn't exist? Or is there another way? I'm open to any suggestions. Thanks!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •There is a section of the documentation that offers advice for those users processing organisms without a database of known variation. I'd recommend reading it.
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I'm currently dealing with this issue in mouse. I'm using mm9/MGSCv37 because there seem to be more resources available for this reference than the latest mm10 build. There are a few databases of variation that I am trying to choose between:
I'd welcome input from anyone who is using GATK for analysis mouse data or general advice on using databases of variation in mice, such as how to use strain-specific variants.
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •@PeteHaitch: That Keane paper looks really useful, especially since the UCSC dbSNP track is now so old. I will probably leave the sites from all 17 strains in. I can only think of three places in the pipeline that variant info is used:
For converting the UCSC file, there is indeed a very useful walker/Tribble codec in the GATK for this very purpose. You'll want to run something like
java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:OLDDBSNP dbSNP128.txt -o dbsnp128.vcfIt looks like the OLDDBSNP codec isn't documented anymore, but I'm pretty sure it's still there- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •Actually, now that I've read the paper a little more thoroughly, the Mus spretus sites should definitely go. I could see going either way on the other three "wild-derived" strains, and I would still leave in the 13 lab strains
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •@pdexheimer Thanks! That's just the sort of advice I was looking for regarding which strains of the Keane paper to retain. It's certainly looking the easiest option currently. A couple of minor things that you may need to be adjust before this can be used with GATK:
java -jar GenomeAnalysisTK.jar -R mm9.fa -T VariantsToVCF --variant:VCF3 Keane.vcf -o up_versioned_Keane.vcfshould do the trick but I haven't tested this.I also looked at
VariantsToVCFyesterday. Can anyone from the GATK team tell us whether theOLDDBSNPcodec is deprecated or is it just a case of the documentation going missing?I found a GVF file of the Ensembl68 variants (mapped to GRCm38) on the Ensembl website (http://asia.ensembl.org/info/data/ftp/index.html). I need to find a way to convert this to VCF - this tool (http://code.google.com/p/gvf2vcf/) claims to do so but I haven't tested it.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I believe oldDBSNP is still there. I see that the codec docs are missing. I'll ask for that to be fixed.
-- Mark A. DePristo, Ph.D. Co-Director, Medical and Population Genetics Broad Institute of MIT and Harvard
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •These seem to be good resources for many reference genomes.
http://www.ensembl.org/info/data/ftp/index.html
ftp://ftp.ncbi.nih.gov/genomes/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •We have used the Apple (Malus x domesticus) genome assembly from here:
http://www.rosaceae.org/node/475
They also have some other resources like gene predictions and functional annotations.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •What about for custom made libraries based on the human genome? Agilent allows extra baits to be added, for which we designed for a virus. I get errors at the step of RealignerTargetCreator which may be due to this.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •@kmdaily, assuming this means you have reads that align to the genome of the virus you are looking for, you need to add the virus contigs to the human genome reference. Once you've done that and formatted it properly, the rest of your analyses should proceed as normal. Unless you're not aligning to the human reference at all?
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks for the fast response @Geraldine_VdAuwera! I added the virus contig to the reference and built all necessary indexes with this, performed alignment, etc. I get a "Input files known2 and reference have incompatible contigs: Order of contigs differences, which is unsafe." error when using the 1000G phase1 indels for hg19; maybe I incorrectly assumed that it was due to the virus being there. I'm using that vcf in conjunction with the Mills and 1000G gold standard file.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Sounds like the order of the contigs in your custom reference is different from the order of the canonical reference and the known sites vcf. The contigs have to be in the same order as explained here: http://www.broadinstitute.org/gatk/guide/article?id=1204
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I did read that article, thanks. What was strange is that it worked fine with the Mills and 1000G gold standard fine alone, and only fails when using the 1000G phase1 indels file (either alone or along with the Mills), which is why I didn't think there was a problem with the ordering. I will reorder the BAM file and re-try. Thanks again!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I was using an old version of the bundle provided by our IT team (hg19 1.5). I downloaded the files directly from the GATK ftp site for 2.3, and after re-ordering the bam file it is working correctly with both files. Thank you for your help, @Geraldine_VdAuwera!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Glad to hear your problem is solved!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •