The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
Correct workflow for analyzing multiple populations of a non-model species.
Hello GATK Team,
I have spent the past few weeks working through your site trying to figure out how to best approach my experiment, but I still have some remaining questions regarding how these guidelines fit into my project on a non-model species.
We currently have 30 samples, from five different populations of fish (surface and cave).
Population – # of individuals
Cave pop1 – 9
Cave pop2 – 9
Surface pop1 – 9
Surface pop2 – 2
Outgroup (different species) – 1
We have a rough reference genome with ~10,000 scaffolds made from a single cave individual. Each wild-collected individual above was individually barcoded and sequenced on 2-4 lanes in combination with other barcoded samples. Observed heterozygosity is 0.39-0.47 for cave, and 0.7-0.8 for surface. Genetic variability is typically higher in surface populations. Our goal is to obtain variant and genotype information for all individuals so that population genetic analyses may be performed.
After reviewing the best practices, guides, tool documentation, and forum discussions, here is the workflow that I think may be most appropriate in analyzing this data, followed by a few of the questions that I have currently. I'd appreciate your advice on this workflow- I think it is right, but I'd appreciate some confirmation.
Perform quality control steps on 100bp PE reads we obtained: Trim, Clip, Filter with Trimmomatic (We have about 8-10x coverage for each individual).
Align Reads to a reference genome using BWA, and specify @RG headers that will be necessary for analysis using GATK.
Sort aligned bam files (merge unpaired & paired reads if present)
Use Picard-Tools MarkDuplicates to mark PCR duplicates.
Index the bam files after marking duplicates
To establish a list of known variants, pass the samples through the HaplotypeCaller and filter results (steps 7-9).
Use the HaplotypeCaller to create a gvcf of genotype likelihoods for each individual
Pass all 30 individuals through the JointGenotyper together to obtain a VCF with all variant locations.
Filter the resulting VCF to obtain a set of High-confidence known-variants.
Use RealignerTargetCreator (RTC) on each individual to identify list of target intervals for IndelRealigner.
Use IndelRealigner to realign each individual
Pass all individuals from the same population into the RTC together.
Perform indel realignment on each individual using the interval list created from the population
Perform BQSR on the samples
Analyze the covariate plots to determine if the empirical and reported quality scores align well.
If not, return to step 7.
If so, Use HC to and JointGenotyper to create raw callset
Use VQSR to recalibrate variant calls
Annotate and analyze variants.
Is this the workflow that you would recommend? I do have questions about a variety of the steps involved in this process. I would greatly appreciate your feedback on these:
Step 9: What is an appropriate way to filter the VCF file to obtain an accurate list of known variant sites?
Step 11: Can realignment be performed on single individuals without a list of known indels?
Step 14: If the known sites are filtered too stringently, will there be sites that are not masked/protected here that should be? (i.e. true variant locations that show up in only one individual may have a low variant quality score, and thus be filtered. Since they aren’t in the list of known sites, these may be given a low base quality score during recalibration that will prevent them from being called as variants during the later rounds of haplotype calling (step 17).
Step 14: Should BQSR be performed if we have low coverage for certain areas, and a potentially large amount of variation between individuals?
Step 15: Is there a quantitative way to determine whether the process should be redone, or if it is good enough to continue to create a raw callset?
Step 18: Since we must provide VQSR with a “truth” data, set it seems like a stringent filter would be best. I am unsure what would be best to use for the set of known variants in this type of experiment. It seems like I want a set of ‘knowns’ that went through a less stringent filter for BQSR, and a set of more stringent call for VQSR. Since it seems to be necessary to use the same set of “known variants” throughout all steps of the process, I am unsure what form of filtering is most appropriate.
What happens if sites that were called as “known variants” after the initial round of joint genotyping and filtering end up being resolved during realignment? Will VQSR or any other steps be thrown off if sites that are said to be known variants actually lack variation?
If indel realignment can be performed without being passed a list of known indels, perhaps using Use_reads or Use_sw, should that be done prior to creating a list of known variant sites using the haplotype caller? That way the variants that are present are more likely to be true, and not an artifact of misalignment. If it were done in that order, I am not sure when BQSR should be done.
Also, will the tools provided through GATK since our reference genome consists of ~10,000 scaffolds, or will pseudo-scaffolds need to be created?
Thank you for any help you can give! Apologies for the long list of questions- I am relatively new at this. Hopefully, your answers will help others struggling with this too.