SNP chip data base for Base Quality Score Recalibration

shinkenshinken IrapuatoMember

Hi all,

I want to perform the Base Quality Score Recalibration for maize data, I am deciding how to obtain a:

A database of known polymorphic sites to mask out

One of the possibilities that I am exploring is to use positions from a SNP chip, because I know that the positions from the chip came from high quality SNPs, these are around 600K positions. Do you think that could be a good idea to use this as my database? I expect to abtain around 20 million of SNPs from my calling, so I am wondering if these data base is not to small.

Best.

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hello @shinken

    Have you tried the Maize Genetics and Genomics Database website

    I found a link to polymorphic traits at panzea.org using "snp" as the keyword.

  • shinkenshinken IrapuatoMember

    Thank you very much @AdelaideR ,

    Yes, there are some SNP positions in for example the maize hapmap. However the maize hapmap is quite diverse and include over 60 million of sites, I think that I could have the next problems with that database:
    1- I am not sure how accurate are those sites, I am sure that the sites from the chip are accurate. What Do you think about to use 600K sites for the recalibration?
    2- Maybe using the hapmap sites I coud end masking the most of my sites.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @shinken

    You can narrow down that list by strain, I believe.

    Is there a particular strain that is closest in characteristics to your set? Also, looking through the literature may yield a better SNP panel than a microchip. The microchip is designed for capturing the most generic sites, so it really depends on the company and the kit. A chip is a compilation, but sometimes has biases towards the most commercially important strains.

    I would recommend reaching out to the manufacturer of the kit to find out about how many of their sites are polymorphic for most strains. They probably have a table they can provide, or at least a description of the process of developing the chip.

  • shinkenshinken IrapuatoMember

    Thank you very much @AdelaideR,

    These 600K sites are more or less related with my data. From the hapmap I could select a subset of individuals more or less related with my data. Could be better to use a subset of my own SNPs derived from the data that I want to recalibrate (Maybe appling stronger hard filters ) than to use any of the other SNP sources?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @shinken

    It is just a guess, but my hunch is that you are correct. This will reduce a large amount of noise that could be introduced into the dataset by trying to incorporate all known variants. The good news is that as you add to your dataset, you can rerun this analysis to incorporate new variants that you discover.

Sign In or Register to comment.