Finding informative SNPs between groups irrelevant of the reference base.
Dear GATK community,
I have illumina sequences for 8 individuals of two different species and want to find SNPs between and within species to develop 10.000 probes for a capture library on a large number of individuals. As no closely related whole genome is available for my species i assembled the reads to form contigs from which i constructed a reference. I remapped all individuals on this reference with BWA and want to find the best SNPs that are most likely to be real and not due to error. I also want them to be informative in distinguishing between both species as well as be valuable for pop gen within species. There are two issues i encounter:
Is there a way to introduce some kind of grouping of individuals into GATK itself or into the read group information of the BAMfile so that GATK can use that information to find and score SNPs? An extreme example is 95 reads across one species all with a T and 5 reads across the other species with a C is still likely a good SNP but will not get a high qual score or will be filtered out with a higher than 0.05 general AF score. I could trick GATK and tell it that all individuals of the same species are different libraries of the same individual but i am sure there are smarter ways to do this that still include the individual information. Anybody any ideas?
Is there a way to reduce the importance of the reference base for scoring SNPs? Another extreme example: the reference is A and 99% of reads across all samples are G. This will be considered a strong SNP but from my perspective of finding SNPs between individuals it is very weak.
I am running GATK 2.7.2.
Any information is very welcome!