GATK AlternateReferenceMaker in non-model organism & SNP flanking regions
I have a question about using the GATK AlternateReferenceMaker and how it may affect down stream analysis. I apologize for the long post but I tried to include only relevant info. Any thoughts on analysis are helpful!
Here is basically what I did:
- Illumina sequencing of 300 libraries with methods similar to genotype-by-sequencing (like RAD-tag libraries without shearing)
- Created contigs of sequences within individuals (cap3) and compared across 300 individuals
- Selected contigs found in at least 5 individuals and mapped to a draft reference genome of sister species (bwa-mem)
Used GATK AlternateReferenceMaker to insert differences and create a "pseudo"-reference for my species
Then I have did SNP calling from raw sequences using GATK best practices (alignment of each individual to pseudo-reference, realignment, g.vcf, Haplotype Caller)
- I did hard filtering for quality, coverage, and missing data across SNPs
- I have sub-selected SNPs with minor allele frequencies > 0.25 to use in designing a SNP array for parentage (~800 SNPs)
- When I try to select the flanking regions around these SNPs using bedtools, it pulls the flanking regions from the pseudo-reference and many of them have a lot of N's... so my questions are:
How can I tell if these were produced from the process I used to create a "pseudo"-reference for my species? Could they be misalignments from sequence contigs to reference or between species? How would this affect the SNP calling proceess?
What is the best way to ground truth the sequencing region? Should I try to pull the alignments from the region around each SNP for each individual and look at them manually? How can I do this?
Finally, the sister species and my pseudo-reference are not indexed in the same way. I'm guessing I should've sorted them before indexing or specified the indexing to use when using the GATK AlternateReferenceMaker. Can I convert the indexing?
Thanks for your thoughts!