Attention:
The frontline support team will be slow on the forum because we are occupied with the GATK Workshop on March 21st and 22nd 2019. We will be back and more available to answer questions on the forum on March 25th 2019.

guide for known variants databases for non-human?

odoyle81odoyle81 ColoradoMember
edited August 2013 in Ask the GATK team

I'm trying to run GATK with the rice genome and I'm having trouble finding a known variant list of rice? Does anyone have a link or a more general list of known variant resources for other reference genomes?
Specifically, I found two rice genomes in dbSNP, but I can't find any documentation about which is which, and once I'm in the FTP site, which file I should be using?
ftp://ftp.ncbi.nih.gov/snp/organisms/rice_4530/
ftp://ftp.ncbi.nih.gov/snp/organisms/rice_4536/
Maybe take all the VCF files for each chromosome and cat them into one big file? Clearly, I'm a little clueless here :)
ftp://ftp.ncbi.nih.gov/snp/organisms/rice_4530/VCF/

Post edited by Geraldine_VdAuwera on
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Unfortunately we don't know enough about the non-human space to help you, but hopefully someone from the user community will pipe up with some useful advice...

  • MarkEdMarkEd KoreaMember

    Hi!

    I know this time the team is having their vacation but I want to have this post anyway.
    I'm also having tough time finding data sets of known variants in rice. I have also tried the VCF found here ftp://ftp.ncbi.nih.gov/snp/organisms/rice_4530/VCF/ but I think they are not in the standard VCF format.

    I've also tried the data on SNP found here ftp://ftp.plantbiology.msu.edu/pub/data/Oryza_SNP/ and used the info on SNP location to create a bed file. But then, I had some error regarding Tribble not able to create .idx file. I guess I had a malformed BED but I couldn't figure out what specifically that is.

    Hope someone out there could help us.

    Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @MarkEd,

    Unfortunately we don't work with rice genomes so we don't know what is the best source for rice genome reference data. Perhaps one of our users can contribute that information.

    I'm not sure what you mean when you say

    used the info on SNP location to create a bed file

    For most typical uses of known variants, you need a VCF file, not a BED file. We do have some tools for converting Old DBSNP format (which might be what the txt files at that second link are) into VCF. See the documentation for VariantsToVCF.

  • MarkEdMarkEd KoreaMember

    Hi Ma'am Geraldine!

    Hope you had a fine vacation.

    Thanks for the response. To clarify things first, my comment was actually related to base recalibration.

    If I'm not mistaken, I can use a bed file as input file for known variants. So, i thought that the walker? doesn't really need the specific SNP genotypes but rather their locations. Since what i have is VCF of rice SNPs, which is in a non-standard format and therefore can't be used directly as input, I used the locations of SNPs in that VCF and created a bed file.

    Well at first, it didn't worked out as i expected as the walker couldn't create an .idx file. What I did next, I used IGV to create .idx file for my bed file. Fortunately, it worked!

    I even have now calibration plots though obtaining it wasn't an easy task too.

    If there is any thing you think I did wrong, I hope you could comment on that.

    thank you very much.

  • odoyle81odoyle81 ColoradoMember

    Interesting approach Mark. I'd be curious to hear how this works out. What are you intentions with your bed file? Are you using SNPs across all 20 varieties? Do you have to correct the locations since the reference has changed between MSU versions 5 and 7?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Interesting indeed; we always use a VCF but you're right that it is possible to use a BED since all the program needs are the locations. But in line with @odoyle81's comment, be sure to check that the locations in your BED are based on the same reference that you used for mapping reads. And keep in mind that later on you may get stuck without a good way to compare your variant calls with the existing literature, since that requires alleles, not just positions. It may be worth putting in a little extra effort to find or produce a valid VCF file of rice knowns.

  • MarkEdMarkEd KoreaMember
    edited January 2014

    This is a little off the topic but I've already posted several replies but none of them has appeared yet, except for this one.

    Post edited by MarkEd on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Our spam filter must have overreacted to you for some reason. I've verified your account so it shouldn't happen again. If it does, send me a private message and I'll look into it further.

Sign In or Register to comment.