Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Interface with BEAGLE Software

delangeldelangel Posts: 71GSA Member mod
edited September 2013 in Methods and Workflows

Note: As of version 4, BEAGLE reads and outputs VCF files directly, and can handle multiallelic sites. We have not yet evaluated what this means for the GATK-BEAGLE interface; it is possible that some of the information provided below is no longer applicable as a result.

Introduction

BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can

  • phase genotype data (i.e. infer haplotypes) for unrelated individuals, parent-offspring pairs, and parent-offspring trios.
  • infer sporadic missing genotype data.
  • impute ungenotyped markers that have been genotyped in a reference panel.
  • perform single marker and haplotypic association analysis.
  • detect genetic regions that are homozygous-by-descent in an individual or identical-by-descent in pairs of individuals.

The GATK provides an experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.

Workflow

The basic workflow for this interface is as follows:

After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.

  • User needs to run BEAGLE with this likelihood file specified as input.
  • After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
  • User can then run GATK walker BeagleOutputToVCF to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.

Producing Beagle input likelihoods file

Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.

For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:

marker    alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892 
20:60251    T        C     10.00   1.26    0.00     9.77   2.45    0.00 
20:60321    G        T     10.00   5.01    0.01    10.00   0.31    0.00 
20:60467    G        C      9.55   2.40    0.00     9.55   1.20    0.00 

Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).

IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.

Running Beagle

We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example

java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun

Extra BEAGLE arguments can be added as required.

About Beagle memory usage

Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.

Processing BEAGLE output files

BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:

# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like 

Creating a new VCF from BEAGLE data with BeagleOutputToVCF

Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.

The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.

The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).

The resulting VCF can now be used for further downstream analysis.

Merging VCFs broken up by chromosome into a single genome-wide file

Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.

java -jar /path/to/dist/GenomeAnalysisTK.jar \
  -T CombineVariants \
  -R reffile.fasta \
  --out genome_wide_output.vcf \
  -V:input1 beagle_output_chr1.vcf \
  -V:input2 beagle_output_chr2.vcf \
  .
  .
  .
  -V:inputX beagle_output_chrX.vcf \
  -type UNION -priority input1,input2,...,inputX
Post edited by Geraldine_VdAuwera on

Comments

  • ymwymw Posts: 9Member

    Hi, I used Beagle to perform genotype refinement based on a "ApplyRecalibration" vcf output. I am wondering whether I can use the "BaseRecalibrator" and "PrintReads" again to integrate the Beagle-refined VCF file into the original BAM file? The reason I ask is that I want to use the program psmc (by Heng Li) to infer the population history of my study species, and psmc requires fastq input files. I think I need to transfer a BAM file with refined base or variant scores to a fastq file for the psmc analyses. So, is there a best way to perform this job? Hope my question makes some sense.

    Thanks,

    Chih-Ming

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,277Administrator, GSA Member admin

    Hi there,

    Unfortunately I don't understand your question.What do you mean by "integrate the Beagle-refined VCF file into the original BAM file"? BAM and VCF files have very different types of content. Maybe what you want to do is recalibrate the base quality scores of your datatset, then revert the recalibrated BAM file to FASTQ format to use as input for psmc? Is that correct?

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 671GSA Member mod

    I'm happy to jump in on this one as I do understand what you want to do. I agree that your strategy would work just fine.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • ymwymw Posts: 9Member

    Hi Geraldine and Eric,

    I think both of you understand what I need. Although psmc bases on SNP data to infere population history, it requires input as fastq, not vcf, files. And psmc judges which SNPs to use based on their quality scores. According to my understading of GATK, the method I mentioned in the last post is the only way to put the base recalibration and variation recalibration results back to BAM file. Then, I will convert the refined BAM file to fatsq so that psmc can pick up the best (most reliable) SNPs for analysis.

    Please let me know if there are better ways to do this.

    Thanks,

    Chih-Ming

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,277Administrator, GSA Member admin

    That sounds correct. Good luck!

    Geraldine Van der Auwera, PhD

  • delangeldelangel Posts: 71GSA Member mod

    Actually, what you propose won't work: psmc takes a fasta (or fastq) with consensus diploid sequence of a particular individual, NOT a fastq coming from next-gen sequencing reads. You need to convert the vcf of your individual to the consensus sequence - maybe FastaAlternateReferenceMaker will produce what you need, but you need to check the psmc documentation to be sure of the format it requires.

  • blueskypyblueskypy Posts: 183Member

    If I use Beagle, I do not need to use ReadBackedPhasing; is it right?

  • delangeldelangel Posts: 71GSA Member mod

    Both tools actually address different problems: RBP is more accurate for phasing close variants since you use the read structure, and especially for higher depths and longer reads it will be more accurate than Beagle. On the other hand, if your sample comes from a population where you have good genotype data from other individuals (e.g. like in 1000 Genomes), using Beagle will be more effective in "filling the blanks" from your data and inferring LD structure and haplotype blocks from your other samples. So, in theory, you'd be better off running both, although we don't have extensive experience in the relative gains/issues of running one on top of the other.

  • blueskypyblueskypy Posts: 183Member
    edited May 2013

    hi, Delangel, Thanks so much for the detailed explanation! My samples are human samples. I'm new to genotype phasing, and I wonder if you can let me know how to run both. After Variant Quality Recalibration in the Best Practice workflow, should I do the following?

    1. ReadBackedPhasing
    2. ProduceBeagleInput
    3. beagle
    4. unzip beagle output
    5. BeagleOutputToVCF

    A 2nd question, if I have large RAM, can I run the whole exome on beagle instead of chr by chr? I used the following command where the vcf file contains all chrs, and beagle didn't give any error msg.

    # variant phasing
    #a. Performs physical phasing of SNP calls, based on sequencing reads. 
    java -Xmx3g -jar $gatkDir/GenomeAnalysisTK.jar -T ReadBackedPhasing \
     -R $refGenome \
     -I $sampleID.recal.reduced.bam \
     --variant $sampleID.snp.recaled.vcf \
     -L $sampleID.snp.recaled.vcf \
     -o $sampleID.snp.recaled.phased.vcf \
     --phaseQualityThresh 20.0
    
    #b. phasing by imputation
    #Converts the input VCF into a format accepted by the Beagle imputation/analysis program.
    java -Xmx3g -jar $gatkDir/GenomeAnalysisTK.jar -T ProduceBeagleInput \
     -R $refGenome \
     -V $sampleID.snp.recaled.phased.vcf \
     -o $sampleID.snp.recaled.phased.beagle.input
    
    java -Xmx10g -jar $softwareDir/beagle.jar \
    like=$sampleID.snp.recaled.phased.beagle.input \
    out=$sampleID.snp.recaled.phased.beagle.out
    
    Post edited by Geraldine_VdAuwera on
  • blueskypyblueskypy Posts: 183Member

    hi, Delangel, just wonder if I can ask a 3rd question:

    You mentioned that "if your sample comes from a population where you have good genotype data from other individuals (e.g. like in 1000 Genomes), using Beagle will be more effective", so I have two questions: 1. what are the snp and indel files from the broad bundle that I can use with beagle, 1000G_phase1.snps.high_confidence.b37.vcf and Mills_and_1000G_gold_standard.indels.b37.vcf? 2. should those files, e.g. 1000G_phase1.snps.high_confidence.b37.vcf, be called in beagle in the following way?

    java -Xmx10g -jar beagle.jar \
    like=unphased.beagle.input \
    phased=1000G_phase1.snps.high_confidence.b37.vcf \
    out=my.run
    

    Thanks so much for the help!

  • delangeldelangel Posts: 71GSA Member mod

    Hi there - unfortunately, those files won't work as you expect them to because they are sites files, and Beagle wants genotype files. Please review the GATK-Beagle interface documentation (available at http://gatkforums.broadinstitute.org/discussion/43/interface-with-beagle-software) about supported use cases. If you want to use Beagle to infer missing genotypes or to phase haplotypes from a single population you need genotype files from similar populations as your samples

  • blueskypyblueskypy Posts: 183Member

    hi, delangel, thanks for the reply! Would you mind answer my above question about how to run both RBP and Beagle?

  • delangeldelangel Posts: 71GSA Member mod

    Well, I thought I had replied :) - it's not fully true that running Beagle obviates the need for RBP since they do different things. You may want to use both as long as you have good genotype data from populations matching your sample so that Beagle works well.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,277Administrator, GSA Member admin

    Hi @blueskypy,

    For step-by-step info on how to run the tools, check out the slides from the December workshop here, specifically the "Genotype phasing and refinement" presentation.

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 183Member
    edited May 2013

    hi, Geraldine, welcome back! Hope you had a good vacation!

    Yes, I did check the slides. But it doesn't give details on how to run both. I just want to confirm my steps in the above post is correct.

    Also can I run the whole exome using beagle all in once, instead of chr by chr, if I have large RAM?

    Post edited by blueskypy on
  • blueskypyblueskypy Posts: 183Member

    hi, delangel, thanks for the reply! The slides Genotype Phasing and Refinement pointed by Geraldine gives the impression that RBP and beagle should run in sequential order. But I feel you suggest that they can run in parallel; then how do I combine the results from the two?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,277Administrator, GSA Member admin

    Thanks, I did :)

    Yes, the steps you posted earlier are correct. You do RBP first and then use the output of that in the Beagle workflow.

    I can't really comment on proper usage of Beagle since it's not our software, but I would say it can't hurt to try. I'd expect that the worst thing that can happen is your run fails or hangs. If you have any concerns about the results being somehow different, you can always run a single chromosome separately and compare the results you get for that chromosome when it's run alone or as part of the entire exome.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,277Administrator, GSA Member admin

    I guess you could run them in parallel and then use CombineVariants, but I don't know how the tool will deal with potentially conflicting phasing information. @delangel has way more experience with this than me so I'll defer to his comments.

    Geraldine Van der Auwera, PhD

  • delangeldelangel Posts: 71GSA Member mod

    Hi - yes, as Geraldine said, if you are to run both then do first RBP and then Beagle. Depending on how much memory you have you should have no problem fitting whole exome data in memory, as Beagle's memory goes linearly (I think) in the # of variants you have. In the past we had to split runs per chromosome with perhaps a small performance loss but we were doing whole genome experiments.

  • blueskypyblueskypy Posts: 183Member

    Thank you both, Geraldine and Gelangel!

  • LaurentTellierLaurentTellier Posts: 3Member
    edited June 2013

    Hello Geraldine,

    Which program do you typically use to split your VCF by chromosome? Is VCF splitting built into one of your walkers?

    The BEAGLE documentation suggests using the lowmem=true option, which is supposed to double the work time, but remove the necessity to split and then re-combine the sample by chromosome. Have you tried this?

    Post edited by LaurentTellier on
  • LaurentTellierLaurentTellier Posts: 3Member

    You know what, I found for myself what I expect you might use: http://faculty.washington.edu/browning/beagle/divide.sample Thanks anyway, Geraldine.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,277Administrator, GSA Member admin

    Hi @LaurentTellier,

    Actually we tend to use our own tool, SelectVariants, when we want to extract subsets from a VCF.

    I personally don't have much opportunity to use BEAGLE so I can't comment on its options. For something like that I would recommend asking the folks who write BEAGLE -- they are in the best position to give advice on how to use their tool.

    Geraldine Van der Auwera, PhD

  • davewooddavewood University of QueenslandPosts: 2Member

    Hi @blueskypy, I would like to impute and phase using Beagle after performing readBackPhasing using GATK as you have described above, however it appears to me as though the ProduceBeagleInput tool does not maintain the phase of the readBackPhased SNPs in the beagle input file. I wonder if you could comment on whether your method worked? Kind Regards, Dave

  • davewooddavewood University of QueenslandPosts: 2Member

    Hi Geraldine, Thanks I see now - currently only unphased, unrelated Beagle input data is supported.

Sign In or Register to comment.