A dog, a ship and an algorithm that measures relatedness
What is beagle.
Beagle is a type of dog known for its even temper and intelligence. It is also the name given to the ship Darwin sailed to the Galapagos (the H.M.S. Beagle), where he developed his theory of natural selection from observing finches. It is also the name of a genomics software package known for phasing and imputing genotypes. Beagle also calls genotypes and detects identity-by-descent (IBD), i.e. it can find segments of identical DNA that indicate two individuals are related.
I will be writing a series of posts where I share with you how I take 23andMe raw data to locate IBD segments using Beagle v4.1 (website; doi:10.1534/genetics.113.150029). For a review of the statistical methods and other theory underlying IBD, see doi:10.1534/genetics.112.148825. To see a skipper’s dog on a ship at sea, watch Irving Johnson’s footage of the Peking barque.
Why a GATK blog series that is mostly not about GATK?
I became interested in genomics after doing my 23andMe in 2013. I am in this field as a newb--some might say upstart. My team is giving me the freedom to write this series as a twenty percent project, as a reward for my hard work.
It appears to me that we are at the start of an era of genomics, as evidenced by the Precision Medicine Initiative. The initiative’s All of Us program will begin to recruit study participants this year (anyone in the U.S. can enroll) and aims to build a genomics resource that reflects the diversity of the U.S. population (doi:10.1038/ncomms7596). I became aware of this project when I attended a Town Hall meeting a week ago, because other teams under the same umbrella as the GATK team work on it.
I think this series of posts is useful for two reasons.
First, GATK and Picard are not the only programs that offer useful tools in genomic analyses. There are a multitude of bioinformatics software tools. Each can be said to occupy some niche in the ecosystem of tools that help us understand genomics. These versioned software programs enable reproducible data transformations. By experimenting on projects like this one, I hope to become more aware of the available tools. By focusing on one specific question, I can see how the tools fit together for a particular use case. Through this exercise, my perspective of the field will widen.
Second, the steps that I will go through to preprocess data will illustrate some considerations when preparing genotype data for analyses. If you haven’t noticed from our forum or experienced first-hand from error messages, the Picard and GATK are sticklers for adherence to SAM and VCF specifications. This is to ensure compatibility of data across the many genomics applications. Our reasons for this upfront cost are justified. We prevent the pain that unconventional formats, even in experienced hands, have a tendency to spawn.
Let’s go over some background.
Genotyping data is not just about calling two alleles for a diploid genomic site, aka locus. We can also capture phasing information, i.e. which set of alleles are on one chromosome molecule, or haplotype, and which are on the other. For example, you might have an
AT genotype for site 1 and a
GC for site 2. If we know that
C are from mom and
G are from dad, then we know the phasing of these two genotypes in relation to each other. I illustrate this concept in a diagram later.
Extend phasing to more loci, and what we get are blocks of sequence with a pattern of variant alleles that travel together and that we refer to as a linkage disequilibrium (LD) block. These blocks can be of variable size because they can be broken up in meiotic recombination 1, which happens on average roughly twice per chromosome. To phase short-read sequence and SNP array data, we impute an individual’s LD blocks using known population LD blocks. The alternative approach is to physically phase variants using expensive technology that sequences long molecules, e.g. from PacBio, or to incorporate long range phasing information, e.g. using 10X Genomics technology or algorithms such as phASER with RNA splice isoforms (doi:10.1101/039529).
Similarity and relatedness are two different concepts. You may have heard each of the following.
- Humans are 98.8% similar to chimpanzees and 84% similar to dogs.
- Humans are 99.9% genetically identical.
- Humans share 99.5% DNA with any other human.
- Siblings share between 83.81% and 87.47% of SNPs.
- Siblings are on average 50% identical in DNA to each other.
The first two instances compare the entirety of syntenic genomic sequence. The third incorporates copy number variants (CNVs). The fourth compares only genomic positions that are commonly variant in humans and ignores nonvariant DNA. The last refers to shared LD blocks.
As we decrease the degrees of relatedness between family members, the expected average percent of identical DNA segments drops precipitously, by halves. My sister Sue and I expect ~50% identical DNA (half-identical on 50% and fully identical on 25%). However, my 23andMe cousin Susan, who lives in Arkansas and who connected with me via 23andMe in 2013, only shares 0.11% of her DNA with me, stemming from a single half identical segment on chromosome 5 2. Beagle and algorithms like it enable us to identify and score such segments of potential shared haplotypes given unphased genotype data and a phased reference population resource. The algorithms impute phases using this resource.
What these algorithms do not capture is how Susan, Sue and I will share many alleles at sites of common variation across our genomes due to shared ethnicity. I want to clarify that despite the similarity of our names (my name is Soo Hee, pronounced So-hee), we are each real and distinct individuals.
The 23andMe data that I will use in this series of posts is from my sister Sue and myself. We are clearly related not just by the happenstance of our births but by 54.7% and 44 segments according to 23andMe. Given this, the results of IBD analysis will be unsurprising and serves as a control in how well we can impute phasing information using a public population resource file that, to be clear, does not represent our Korean ethnic background. This resource is the 1000 Genomes Project phase 3 genotypes (doi:10.1038/nature15393) that I will refer to as 1KGP. Of the resource’s 2504 individuals, who are all forward-thinking and generous volunteers, approximately 500 are East Asian. This supergroup includes ~300 Chinese, ~100 Vietnamese and ~100 Japanese, but no Koreans. Phasing confidence improves when the reference panel is ethnically matched to the individual. So ideally, we would want to match Sue and my data against a panel that includes Koreans. I should mention that the 1000 Genomes Project is continuing to fill gaps in population representation via the International Genome Sample Resource (IGSR).
Our expectation is that Beagle’s Refined IBD approach (doi:10.1534/genetics.113.150029) using the 1KGP resource will be robust to our genetic heterogeneity. I’m told there are better tools out there, e.g. IBDSeq, but that these require a reference panel from the same population as the analyzed samples. I did ask around if such phased data representing my peoples should exist but word is that none will be as carefully phased and reliable as the 1KGP resource.
To illustrate my last statement, let me digress a bit. Sue is getting married tomorrow, on April 8th, to Dave (pictured). So first I’d like to congratulate them. Sue and Dave’s ethnic origins are far apart--9,241 kilometers to be exact or a ~10 hour flight. Besides lending support to HLA-mediated attraction, their marriage has me asking this: what reference resource should we use to impute phasing for samples from Sue and Dave’s future children who would be admixed hapas?
Another reason to take this journey is to flex our file processing muscles. To start, posts will focus on specific pre-processing aims, and I use a variety of tools, both GATK and non-GATK, to achieve these. My selection criteria is that of a satisficer, i.e. I settle on the first tool or approach that achieves what I need. Given this, there may be more efficient ways to process the data, and hopefully those in the know who are reading will share these ways.
I'm curious to find out how our field’s better IBD algorithm performs using a reference resource that underrepresents the ethnic background of the samples. I am a minority in this country whose parents immigrated when I was a baby. Sue was born while we lived in Fargo, North Dakota. Growing up in the Midwest and in Washington state, I have gotten used to the majority of faces not looking like me. What I have learned is that unlike the proportion my ethnicity represents in the population, to enable well-powered genomics studies whose results will be applicable to me, my ethnicity needs to be disproportionately overrepresented in programs like All of Us. I want my genetics represented because I want to benefit from the findings from genomics studies towards a longer and healthier life.
Gleaning equally useful information from studies as those in the ethnic majority means that those of us in the minority, including those of us who are African, Hispanic, mixed and Native Americans, need to over represent. I know there are other efforts, e.g. Genome Asia 100K that plans for 100,000 Asian genomes. However, consider context. The Genome Asia 100K represents diversity within Asia, where populations are relatively homogenous. The resource, and findings from the resource, because of their population context, cannot apply widely to admixed populations such as that of America. As I explained earlier, context matters for IBD imputation in that, for Sue and my ethnically homogenous samples, the IBDSeq algorithm that uses a homogenous reference panel will out perform Beagle with the diverse reference panel.
Why is IBD analysis important to research? For one, validating pedigrees by measuring relatedness is an integral quality control step in population genomics studies involving extended families. Second, it can help genetic linkage mapping. For example, haplotype phase inference can help fill in missing genotypes. Third, it can help decrease bias. We see this application in the routine removal of closely related samples in population reference resources.
We can learn the importance of representing diversity from Darwin’s Dogs. Last I heard, in December 2016, at the Broad company retreat, the project is only interested in sequencing mutts. You have to fill out a 100-question survey and they will let you know if they are interested in sequencing your dog’s genome. One thing they are interested in is associating behavior, e.g. OCD in dogs, to complex variation, e.g. multiple common variants. Sue and Dave have signed up Pepper (pictured above), who is of indeterminate mixed breed.
From where I’m standing, the contribution of genomics towards understanding disease appears so far to have come from (i) rare de novo mutations and (ii) isolated founder effects in extended but small families that control for genetic backdrop. Darwin’s Dogs explains a third approach in their blog. They are enabled to take this third approach in their studies because of findings in a 2016 article titled Complex disease and phenotype mapping in the domestic dog (doi:10.1038/ncomms10460). The key enabling factor is knowing the number of dogs to power an analysis, e.g. either 400 dogs of a single breed (200 with and 200 without the trait) or 1000 dogs of many breeds (500 with and 500 without). I would surmise that the latter scenario would yield qualitatively different results that would be more applicable to dogs in general.
 The average frequency of recombination events across a chromosome is captured in a metric geneticists call the centiMorgan.
 The algorithm 23andMe uses has a 7 cM (centiMorgan) and 700 SNPs threshold for relatedness to ensure high quality matches, so it is highly likely we share common recent ancestors.