Finding informative SNPs between groups irrelevant of the reference base.

KvpmulderKvpmulder WashingtonMember
edited October 2013 in Ask the GATK team

Dear GATK community,

I have illumina sequences for 8 individuals of two different species and want to find SNPs between and within species to develop 10.000 probes for a capture library on a large number of individuals. As no closely related whole genome is available for my species i assembled the reads to form contigs from which i constructed a reference. I remapped all individuals on this reference with BWA and want to find the best SNPs that are most likely to be real and not due to error. I also want them to be informative in distinguishing between both species as well as be valuable for pop gen within species. There are two issues i encounter:

  1. Is there a way to introduce some kind of grouping of individuals into GATK itself or into the read group information of the BAMfile so that GATK can use that information to find and score SNPs? An extreme example is 95 reads across one species all with a T and 5 reads across the other species with a C is still likely a good SNP but will not get a high qual score or will be filtered out with a higher than 0.05 general AF score. I could trick GATK and tell it that all individuals of the same species are different libraries of the same individual but i am sure there are smarter ways to do this that still include the individual information. Anybody any ideas?

  2. Is there a way to reduce the importance of the reference base for scoring SNPs? Another extreme example: the reference is A and 99% of reads across all samples are G. This will be considered a strong SNP but from my perspective of finding SNPs between individuals it is very weak.

I am running GATK 2.7.2.

Any information is very welcome!



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Kevin,

    I'm not sure I understand your first question. When you call multiple samples, the AF is still calculated per sample. So even if a SNP is only present in one of many samples, if there is a normal amount of evidence for it, it will be called and won't be filtered.

    For your second question, the solution is to design a post-processing analysis in which you identify the SNPs that are informative vs. those that aren't. It doesn't really matter for calling what the reference allele is.

  • KvpmulderKvpmulder WashingtonMember

    Hi Geraldine,

    Thanks for the quick reply. For question one, the SNP will indeed be found if the evidence for one individual is good. But if possible I would like to include something similar as GATK does for the same individual from different library preps. If im not mistaken it improves the score if the same variant is found in different libraries of the same individual while reducing the score if its found in only one of them. e.g. 2 reads each from 3 libraries for a certain variant is better than 6 reads in one and 0 in the other two.... Is that correct or does it just add up the reads? If it does take the different sources of library preps into account I would like to do the same but then on different individuals from the same species versus the other species. i guess thats not possible in GATK itself and I should do it in post processing just like in question two.

    Question two makes sense. I can indeed extract the calls per individual and their scores from the vcf file quite easily and then find ones that are informative. thank you!


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Kevin,

    I'm not sure where you got that impression -- the GATK callers aggregate the reads per sample, and do not distinguish between different libraries. Other tools in GATK do that (such as BaseRecalibrator) but not the callers. What you're describing is done between samples in multisample calling -- a variant seen only in one sample may be given less confidence than one seen in many samples. Does that clarify things?

Sign In or Register to comment.