Analysis of genome evolution over time with GATK
Dear GATK team,
as a first time user of GATK I have a conceptual question that is related to the analysis of my experiment (I have already searched the forum for many hours now and did not find anything related, however please apologise if I missed something).
My experiment involved a starting culture (Timepoint 1) that has been put under selective pressure for over a year. DNA samples were taken at the beginning (Timepoint 1) after 6 months (Timepoint 2) and 12 months (Timepoint 3) and run 150+150 PE with IlluminaHiSeq 2500.
Since it's a haploid non model organism, I recognise I need to quality filter & bootstrap a set of known sites/indels until convergence is reached. I can feed this vcf's to the BaseRecalibrator before continuing the analysis. However, I can`t wrap my head around two things:
1) I wanted to approach the recalibration like this:
Use the samples (n=3) from Timepoint 1, to generate a set of known sites/indels. Hence, this reference set should include:
- "homozygous" differences between the genomic reference and my reads. I assume that these differences do exist and are common sequence differences between the reference genome and my organisms. Example: Reference Chromosome 1, Position 1000 has "A", whereas all my reads have a "C" at this position.
- "heterozygous" differences, where position 1234 on Chromosome 5 could be 30% A and 70% T.
--> I would use this set to base recalibrate Timepoint 2 samples and Timepoint 3 samples as well? The problem is that these timepoints have "evolved" and now contain new true positive ("real") mutations that are not included in the reference set derived from Timepoint 1 (the "non-mutated" genome). As I understand BaseRecalibrator would count all new mutations in Timepoint 2 and Timepoint 3 as sequencing error. Isn't that going to mess up the covariation model for Timepoint 2 samples and Timepoint 3 samples? Or should I use TWO sets of known sites, one containing all Timepoints that is used for BaseRecalibration for all samples and the other only including known sites from Timepoint 1 that is used for the further analysis and variant discovery?
2) I have found a nice publication "Whole-Genome Resequencing Reveals Extensive Natural Variation in the Model Green Alga Chlamydomonas reinhardtii" (http://www.plantcell.org/content/early/2015/09/21/tpc.15.00492.full.pdf+html?with-ds=yes ) where they describe a step named HET labelling in Chlamydomonas, also a haploid organism.
"Variants with greater than one base strongly represented were presumed due to collapsed imperfect repeats in the reference genome. They were identified by running GATK’s UnifiedGenotyper in diploid mode and selecting variants called as heterozygous for at least one strain (with a genotype quality greater than 98) and with no strains called as a homozygous variant (with a genotype quality greater than 98). These variants were labeled as “HET” in the variant call format (vcf) files and were excluded from counts of SNVs."
I can see the benefit to target only "homozygous" mutations towards a reference genome if you are not looking for changes in the proportion of alleles/new alleles. However, in my case, the allelic composition might also change over time and this change can be relevant! Example: position 1234 on Chromosome 5 is 30% A and 70% T at Timepoint 1 and has changed to 80% A, 15% T and 5% C at Timepoint 3. I have been thinking to "split" my experiment into three different analyses: Analysis 1 would focus on "homozygous" loci, Analysis 2 would deal with changes in "heterozygous" loci, and Analysis 3 would look at InDels. Again, my problem is that I am not comparing Condition A vs Reference and looking for differences, but rather Timepoint 3 that is an evolution of Timepoint2 that is an evolution from Timepoint 1. Would it make more sense to do:
Timepoint 1 vs Reference + bootstrap known sites/indels and BaseRecalibrate on Timepoint 1 samples
Timepoint 2 vs Reference + bootstrap known sites/indels and BaseRecalibrate on Timepoint 2 samples
Timepoint 3 vs Reference + bootstrap known sites/indels and BaseRecalibrate on Timepoint 3 samples
...and finally looking for differences in between three sets? Would an "all vs all" genotyping produce the same results and the various categories can be extracted from the resulting vcf? I have the impression to be missing something important here!
I am sorry for the long post, but your help would be immensely appreciated!
Thank you for providing such a pipeline and the great work!