Analysis of genome evolution over time with GATK

alessandrorossonialessandrorossoni GermanyMember
edited October 2016 in Ask the GATK team

Dear GATK team,
as a first time user of GATK I have a conceptual question that is related to the analysis of my experiment (I have already searched the forum for many hours now and did not find anything related, however please apologise if I missed something).

My experiment involved a starting culture (Timepoint 1) that has been put under selective pressure for over a year. DNA samples were taken at the beginning (Timepoint 1) after 6 months (Timepoint 2) and 12 months (Timepoint 3) and run 150+150 PE with IlluminaHiSeq 2500.

Since it's a haploid non model organism, I recognise I need to quality filter & bootstrap a set of known sites/indels until convergence is reached. I can feed this vcf's to the BaseRecalibrator before continuing the analysis. However, I can`t wrap my head around two things:

1) I wanted to approach the recalibration like this:
Use the samples (n=3) from Timepoint 1, to generate a set of known sites/indels. Hence, this reference set should include:

  • "homozygous" differences between the genomic reference and my reads. I assume that these differences do exist and are common sequence differences between the reference genome and my organisms. Example: Reference Chromosome 1, Position 1000 has "A", whereas all my reads have a "C" at this position.
  • "heterozygous" differences, where position 1234 on Chromosome 5 could be 30% A and 70% T.

--> I would use this set to base recalibrate Timepoint 2 samples and Timepoint 3 samples as well? The problem is that these timepoints have "evolved" and now contain new true positive ("real") mutations that are not included in the reference set derived from Timepoint 1 (the "non-mutated" genome). As I understand BaseRecalibrator would count all new mutations in Timepoint 2 and Timepoint 3 as sequencing error. Isn't that going to mess up the covariation model for Timepoint 2 samples and Timepoint 3 samples? Or should I use TWO sets of known sites, one containing all Timepoints that is used for BaseRecalibration for all samples and the other only including known sites from Timepoint 1 that is used for the further analysis and variant discovery?

2) I have found a nice publication "Whole-Genome Resequencing Reveals Extensive Natural Variation in the Model Green Alga Chlamydomonas reinhardtii" ( ) where they describe a step named HET labelling in Chlamydomonas, also a haploid organism.

"Variants with greater than one base strongly represented were presumed due to collapsed imperfect repeats in the reference genome. They were identified by running GATK’s UnifiedGenotyper in diploid mode and selecting variants called as heterozygous for at least one strain (with a genotype quality greater than 98) and with no strains called as a homozygous variant (with a genotype quality greater than 98). These variants were labeled as “HET” in the variant call format (vcf) files and were excluded from counts of SNVs."

I can see the benefit to target only "homozygous" mutations towards a reference genome if you are not looking for changes in the proportion of alleles/new alleles. However, in my case, the allelic composition might also change over time and this change can be relevant! Example: position 1234 on Chromosome 5 is 30% A and 70% T at Timepoint 1 and has changed to 80% A, 15% T and 5% C at Timepoint 3. I have been thinking to "split" my experiment into three different analyses: Analysis 1 would focus on "homozygous" loci, Analysis 2 would deal with changes in "heterozygous" loci, and Analysis 3 would look at InDels. Again, my problem is that I am not comparing Condition A vs Reference and looking for differences, but rather Timepoint 3 that is an evolution of Timepoint2 that is an evolution from Timepoint 1. Would it make more sense to do:

Timepoint 1 vs Reference + bootstrap known sites/indels and BaseRecalibrate on Timepoint 1 samples
Timepoint 2 vs Reference + bootstrap known sites/indels and BaseRecalibrate on Timepoint 2 samples
Timepoint 3 vs Reference + bootstrap known sites/indels and BaseRecalibrate on Timepoint 3 samples

...and finally looking for differences in between three sets? Would an "all vs all" genotyping produce the same results and the various categories can be extracted from the resulting vcf? I have the impression to be missing something important here!

I am sorry for the long post, but your help would be immensely appreciated! :)
Thank you for providing such a pipeline and the great work!


Best Answers


  • alessandrorossonialessandrorossoni GermanyMember
    edited October 2016

    Dear Geraldine,
    thank you so much for your quick answer! Your suggestions helped a lot already :)
    One part of your answer is still unclear to me and i would very appreciate your opinion on this matter one more time. I also attached a sketch of the experimental setup, so that things are as clear as possible.

    If I got you right, you are assuming that I am working with genetically homogeneous samples. As you can see in the attached experiment sketch, the "Timepoint 1" samples are genetically homogeneous (yes, HC with -ploidy 1), representing 3 replicates of the "unevolved" starting culture. From that point on, all cultures (4 "stress" cultures + 4 "control" cultures) have been evolving independently.

    Now you said something interesting that made me think to change my original plan: ^^
    "If you mean to say that you're actually sequencing evolved colonies and that you're interested in capturing the allelic divergence within a colony, then you should treat each colony as a pool of samples, with a theoretical ploidy set to reflect the smallest allele frequency you hope to capture. For example, if you want to be able to capture alleles present in 10% of the cells in the colony, then you would set your ploidy to 10. If you want to go deeper, e.g. to 5%, you would set it to 20."

    I was planning to run the analysis with HC & -ploidy 4 and to act like the four haploid biological replicates of any timepoint would be a quadruploid organism (or population containing 4 haploid individuals) and track changes as allelic divergence. E.g:

    Timepoint 2 (Mid) - Stress Condition - is made of:
    Stress-Culture 1 (haploid)
    Stress-Culture 2 (haploid)
    Stress-Culture 3 (haploid)
    Stress-Culture 4 (haploid)

    I would treat those as 1 culture with a ploidy of 4.

    Now assuming position 1234 on Chromosome 5 is "A" and has changed to "C" in Stress-Culture 4. As a consequence, I would be able to see 75% "A" and 25% "C" at this position using -ploidy 4 (HET variant). Again, I did not think one step further to increase the ploidy to 8 (or even 16). With -ploidy 8 I would be able to see e.g. if only 50% of Stress-Culture 4 has changed its nucleotide at position 1234 on Chromosome 5 from "A" and to "C". For the overall "Timepoint 2 (Mid) - Stress Condition" that would be 87.5% "A" and 12.5% "C".
    (3) Did I understand what you said correctly given the last example?
    (4) Does the approach to treat 4 haploid biological replicates as a quadruploid organism make much sense in this case, or is it better to set -ploidy 1 and see if there is evidence of alternate alleles? This would be reflected in the variant context annotation metrics and I would be able to decide how to deal with those during downstream analysis. Is there a smarter way to do that?
    (5) I am going to BaseRecalibrate using the 8 End Timepoint 3 samples (4 "stress" + 4 "control") to catch as much variation as possible.

    Thank you very much for your help! Best,

Sign In or Register to comment.