Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Read-backed Phasing

delangeldelangel Posts: 71GATK Developer mod
edited September 2012 in Methods and Workflows

Read-backed Phasing

Example and Command Line Arguments

For a complete, detailed argument reference, refer to the GATK document page here

Introduction

The biological unit of inheritance from each parent in a diploid organism is a set of single chromosomes, so that a diploid organism contains a set of pairs of corresponding chromosomes. The full sequence of each inherited chromosome is also known as a haplotype. It is critical to ascertain which variants are associated with one another in a particular individual. For example, if an individual's DNA possesses two consecutive heterozygous sites in a protein-coding sequence, there are two alternative scenarios of how these variants interact and affect the phenotype of the individual. In one scenario, they are on two different chromosomes, so each one has its own separate effect. On the other hand, if they co-occur on the same chromosome, they are thus expressed in the same protein molecule; moreover, if they are within the same codon, they are highly likely to encode an amino acid that is non-synonymous (relative to the other chromosome). The ReadBackedPhasing program serves to discover these haplotypes based on high-throughput sequencing reads.

The first step in phasing is to call variants ("genotype calling") using a SAM/BAM file of reads aligned to the reference genome -- this results in a VCF file. Using the VCF file and the SAM/BAM reads file, the ReadBackedPhasing tool considers all reads within a Bayesian framework and attempts to find the local haplotype with the highest probability, based on the reads observed.

The local haplotype and its phasing is encoded in the VCF file as a "|" symbol (which indicates that the alleles of the genotype correspond to the same order as the alleles for the genotype at the preceding variant site). For example, the following VCF indicates that SAMP1 is heterozygous at chromosome 20 positions 332341 and 332503, and the reference base at the first position (A) is on the same chromosome of SAMP1 as the alternate base at the latter position on that chromosome (G), and vice versa (G with C):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1   
chr20   332341  rs6076509   A   G   470.60  PASS    AB=0.46;AC=1;AF=0.50;AN=2;DB;DP=52;Dels=0.00;HRun=1;HaplotypeScore=0.98;MQ=59.11;MQ0=0;OQ=627.69;QD=12.07;SB=-145.57    GT:DP:GL:GQ 0/1:46:-79.92,-13.87,-84.22:99
chr20   332503  rs6133033   C   G   726.23  PASS    AB=0.57;AC=1;AF=0.50;AN=2;DB;DP=61;Dels=0.00;HRun=1;HaplotypeScore=0.95;MQ=60.00;MQ0=0;OQ=894.70;QD=14.67;SB=-472.75    GT:DP:GL:GQ:PQ  1|0:60:-110.83,-18.08,-149.73:99:126.93

The per-sample per-genotype PQ field is used to provide a Phred-scaled phasing quality score based on the statistical Bayesian framework employed for phasing. Note that for cases of homozygous sites that lie in between phased heterozygous sites, these homozygous sites will be phased with the same quality as the next heterozygous site.

Limitations:

  • ReadBackedPhasing doesn't currently support insertions, deletions, or multi-nucleotide polymorphisms.
  • Input VCF files should only be for diploid organisms.

More detailed aspects of semantics of phasing in the VCF format

  • The "|" symbol is used for each sample to indicate that each of the alleles of the genotype in question derive from the same haplotype as each of the alleles of the genotype of the same sample in the previous NON-FILTERED variant record. That is, rows without FILTER=PASS are essentially ignored in the read-backed phasing (RBP) algorithm.
  • Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.
  • A homozygous genotype is always "appended" to the preceding haplotype. For example, any 0/0 or 1/1 record is always converted into 0|0 and 1|1.
  • RBP attempts to phase a heterozygous genotype relative the preceding HETEROZYGOUS genotype for that sample. If there is sufficient read information to deduce the two haplotypes (for that sample), then the current genotype is declared phased ("/" changed to "|") and assigned a PQ that is proportional to the estimated Phred-scaled error rate. All homozygous genotypes for that sample that lie in between the two heterozygous genotypes are also assigned the same PQ value (and remain phased).
  • If RBP cannot phase the heterozygous genotype, then the genotype remains with a "/", and no PQ score is assigned. This site essentially starts a new section of haplotype for this sample.

For example, consider the following records from the VCF file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMP1   SAMP2
chr1    1   .   A   G   99  PASS    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    2   .   A   G   99  PASS    .   GT:GL:GQ:PQ 1|1:-100,0,-100:99:60   0|1:-100,0,-100:99:50
chr1    3   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:60   0|0:-100,0,-100:99:60
chr1    4   .   A   G   99  FAIL    .   GT:GL:GQ    0/1:-100,0,-100:99  0/1:-100,0,-100:99
chr1    5   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:70   1|0:-100,0,-100:99:60
chr1    6   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0/1:-100,0,-100:99  1|1:-100,0,-100:99:70
chr1    7   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:80   0|1:-100,0,-100:99:70
chr1    8   .   A   G   99  PASS    .   GT:GL:GQ:PQ 0|1:-100,0,-100:99:90   0|1:-100,0,-100:99:80

The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:

  1. AGAAA
  2. GGGAG

And two haplotypes at positions 6-8:

  1. AAA
  2. GGG

And, SAMP2 has the two haplotypes at positions 1-8:

  1. AAAAGGAA
  2. GGAAAGGG
  • Note that we have excluded the non-PASS SNP call (at chr1:4), thus assuming that both samples are homozygous reference at that site.
Post edited by Geraldine_VdAuwera on

Comments

  • shawpashawpa Posts: 1Member

    According to what is written above, the readbackphasing doesn't support insertions or deletions. If my vcf file contains indels, should I remove them before running the readbackphasing command. I did run this command without removing indels from my vcf file and I got phasing information from the indel lines in the vcf. Can I not trust these?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    I believe the expected behavior for this tool is to ignore indels -- I've asked the author of the tool (@fromer) to jump in here and give a definitive answer. Can you maybe post the lines with the indels that you say were phased?

    Geraldine Van der Auwera, PhD

  • fromerfromer Posts: 13Member, Third-party Developer, GSA Collaborator

    For now, the RBP algorithm does not include indels and considers only SNPs. In fact, it should treat indels and non-PASS (failing filter) SNPs the same way -- it acts as if they're not there in the VCF file.

    Therefore, when looking at the output, you need to remember that RBP tries to phase successive SNPs as if no indel was in between them (so it might look like the indel in between is phased relative to the next SNP).

    We've actually developed a more robust syntax for denoting phase that will cover situation such as these. It's important to remember that currently the "phased relative to" relationship is implied by "|" as being phased relative to the previous PASS-ing biallelic SNP.

  • jorei499jorei499 Posts: 1Member

    Is this really correct?

    "The proper interpretation of these records is that SAMP1 has the following haplotypes at positions 1-5 of chromosome 1:

    "1. AGAAA"

    "2. GGGAG"

    I would interpret that pos 4 since it is 0/1 can not be phased with pos 3. Have I missed something or is it wrong in the example?

  • amyamy SwedenPosts: 3Member

    About the phased output vcf file, is that possible to require other FORMAT like AD and DP? I am interested in getting the allelic counts. Thank you :)

  • HeidiJTPHeidiJTP Posts: 9Member

    I have the same question as jorei499, this example does not make sense to me. I am still struggling to understand the vcf format. I ran readbackedphasing on 100 samples (sequenced from PCR amplicons, each read covers all variable sites). I find in every case that the first heterozygous site is unphased. I can open the fasta alignment and easily see what the phasing should be. Is this normal? How do I interpret this?

    txt
    txt
    205_phased.txt
    2K
    txt
    txt
    300_phased.txt
    2K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    @amy I'm sorry but I don't understand your question, can you please clarify? Do you have a VCF file that does not have AD and DP for the samples?

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    @HeidiJTP‌, have a look at the part of the doc that says:

    Note that the first heterozygous genotype record in a pair of haplotypes will necessarily have a "/" - otherwise, they would be the continuation of the preceding haplotypes.

    Geraldine Van der Auwera, PhD

  • HeidiJTPHeidiJTP Posts: 9Member

    Thanks so much for always responding quickly Geraldine. Does this mean that there is no way to determine the phase of the first heterozygous site using this method?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    @HeidiJTP‌ The key here is that all the other sites after it are phased relative to that first site -- so by definition, it is phased (arbitrarily). I agree that the notation is ambiguous and confusing, but that's the format...

    Geraldine Van der Auwera, PhD

  • HeidiJTPHeidiJTP Posts: 9Member

    Ok thanks. I wasn't sure because the sample I was looking at had previously been typed differently, but I believe the issue was specific to that sample. Thanks for clearing that up for me!

  • amyamy SwedenPosts: 3Member

    @Geraldine_VdAuwera said: amy I'm sorry but I don't understand your question, can you please clarify? Do you have a VCF file that does not have AD and DP for the samples?

    The phased VCF file is the output form Read-backed phasing which only includes format GT:GL:GQ:(PQ) (just like the example in this page). My question is that can I ask other FORMATs like AD and DP? I would like to know the allelic counts. Thank you Geraldine!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    Hi @amy,

    I think the latest version should output those fields, if you started out with a VCF produced by GATK. Let me know if you find that's not the case.

    Geraldine Van der Auwera, PhD

  • bvecchiobvecchio Posts: 7Member

    Hi Geraldine,

    I am continuing my analysis of re-sequencing data for a 200kb region in 600 unrelated samples.  I have successfully used the ReadBackedPhasing tool and believe I will be able to take the output and feed it into BEAGLE for additional population based phasing of INDELs (which are not phased via the read based algorithm).  I am wondering if you know of a tool available which will generate a consensus sequence for phased regions, or that will easily allow determination of phase of SNPs in a given region.  For example, I'd like to query the phased VCF output of BEAGLE for a given snp, SNP A, and determine if it was found on the same haplotype has SNP B in a given individual (without delving into the VCF by hand in a text editor).
    

    As a side note, I was also able to use the VariantstoBinaryPed tool on our dataset for bi-allelic sites and wrote a code to convert multi-allelic sites into psudomarkers. However, the tool did have some trouble with the sex chromosomes, so I opted to remove them for the time being.

    Thank you for your time, -Briana

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    @bvecchio‌ There is no tool in GATK that will do exactly what you want. You can have the HaplotypeCaller output haplotype sequences using the -bamOut argument, but it doesn't actually use the phasing info from VCFs. Starting from a VCF you can use AlternateReferenceMaker to output consensus sequence, but that will output IUPAC ambiguity codes for heterozygous sites, and will not use phasing info either. I have heard of people writing their own scripts to do what you want but I can't recommend any as I haven't used them myself.

    Sex chromosomes tend to be problematic...

    Good luck!

    Geraldine Van der Auwera, PhD

  • amyamy SwedenPosts: 3Member

    Hi Geraldine,

    I started with a VCF file who do have FORMAT AD and DP. I tried the latest version 3.1-1 to do Read-backed phasing and things became a little complicate. First, I used a subset of the data, it works, but still do not have AD and DP. Then I tried all my data, it gave me " A GATK RUNTIME ERROR has occurred (version 3.1-1-g07a4bf8): ..... This might be a bug......MESSAGE: 255". Before, with exactly the same command, it worked in version 2.8.

    Do you think if I can use VariantAnnotator to get AD and DP? (e.g. -T VariantAnnotator -R xxx.fasta -I xxx.bam -o new.with.AD.DP.vcf -A DepthPerAlleleBySample -A Coverage --variant output.from.readbackedphasing.vcf -L interval.bed)

    Thank you very much :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    Hi @amy,

    Yes, I believe that should work.

    Can you post the full text of the error you got when you tried with all the data? Including all the log output please?

    Geraldine Van der Auwera, PhD

  • amyamy SwedenPosts: 3Member

    Hi Geraldine,

    Thank you for the answers:)

    Some more naive questions about AD and DP: I know the difference between them is filtered or not, however, if I sum all AD, should that equal to the total number of reads (passed the general filter) in the .bam file? (In my results, they are different, 25% less in sum AD.) What exactly is the filter who make difference between AD and DP? If the .bam file comes from RNA-seq, do you think AD can represent allelic expression ?

    Full text log output is attached.

    Thank you for the help.

    error.PNG
    1396 x 890 - 110K
  • fernfern Posts: 4Member

    Dear Authors

    Dose this phasing in the tool consider the paired end reads? If the one read of the pair shows variant "A", after some distance, the other read of the pair exhibits variant "G", rather than "T". Then, the haplotype is AG, rather than AT.

    Li

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,898Administrator, GATK Developer admin

    @fern Yes, read pair information is used in the phasing process.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.