Service note: Geraldine is on vacation this week; other members of GSA will be responding to questions, but they have a lot of work besides this, so be aware that responses may be a little slower than usual. Thank you for your patience.

Interface with BEAGLE Software

delangeldelangel Posts: 62GSA Official Member mod

Introduction

BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can

  • phase genotype data (i.e. infer haplotypes) for unrelated individuals, parent-offspring pairs, and parent-offspring trios.
  • infer sporadic missing genotype data.
  • impute ungenotyped markers that have been genotyped in a reference panel.
  • perform single marker and haplotypic association analysis.
  • detect genetic regions that are homozygous-by-descent in an individual or identical-by-descent in pairs of individuals.

The GATK provides and experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.

The basic workflow for this interface is as follows:

After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.

  • User needs to run BEAGLE with this likelihood file specified as input.
  • After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
  • User can then run GATK walker BeagleOutputToVCF to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.

Producing Beagle input likelihoods file

Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.

For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:

marker    alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892 
20:60251    T        C     10.00   1.26    0.00     9.77   2.45    0.00 
20:60321    G        T     10.00   5.01    0.01    10.00   0.31    0.00 
20:60467    G        C      9.55   2.40    0.00     9.55   1.20    0.00 

Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).

IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.

Running Beagle

We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example

java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun

Extra BEAGLE arguments can be added as required.

About Beagle memory usage

Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.

Processing BEAGLE output files

BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:

# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like 

Creating a new VCF from BEAGLE data with BeagleOutputToVCF

Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.

The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.

The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).

The resulting VCF can now be used for further downstream analysis.

Merging VCFs broken up by chromosome into a single genome-wide file

Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.

java -jar /path/to/dist/GenomeAnalysisTK.jar \
  -T CombineVariants \
  -R reffile.fasta \
  --out genome_wide_output.vcf \
  -V:input1 beagle_output_chr1.vcf \
  -V:input2 beagle_output_chr2.vcf \
  .
  .
  .
  -V:inputX beagle_output_chrX.vcf \
  -type UNION -priority input1,input2,...,inputX
Post edited by Geraldine_VdAuwera on

Comments

  • ymwymw Posts: 9Member

    Hi, I used Beagle to perform genotype refinement based on a "ApplyRecalibration" vcf output. I am wondering whether I can use the "BaseRecalibrator" and "PrintReads" again to integrate the Beagle-refined VCF file into the original BAM file? The reason I ask is that I want to use the program psmc (by Heng Li) to infer the population history of my study species, and psmc requires fastq input files. I think I need to transfer a BAM file with refined base or variant scores to a fastq file for the psmc analyses. So, is there a best way to perform this job? Hope my question makes some sense.

    Thanks,

    Chih-Ming

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi there,

    Unfortunately I don't understand your question.What do you mean by "integrate the Beagle-refined VCF file into the original BAM file"? BAM and VCF files have very different types of content. Maybe what you want to do is recalibrate the base quality scores of your datatset, then revert the recalibrated BAM file to FASTQ format to use as input for psmc? Is that correct?

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 483GSA Official Member mod

    I'm happy to jump in on this one as I do understand what you want to do. I agree that your strategy would work just fine.

    Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT

  • ymwymw Posts: 9Member

    Hi Geraldine and Eric,

    I think both of you understand what I need. Although psmc bases on SNP data to infere population history, it requires input as fastq, not vcf, files. And psmc judges which SNPs to use based on their quality scores. According to my understading of GATK, the method I mentioned in the last post is the only way to put the base recalibration and variation recalibration results back to BAM file. Then, I will convert the refined BAM file to fatsq so that psmc can pick up the best (most reliable) SNPs for analysis.

    Please let me know if there are better ways to do this.

    Thanks,

    Chih-Ming

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    That sounds correct. Good luck!

    Geraldine Van der Auwera, PhD

  • delangeldelangel Posts: 62GSA Official Member mod

    Actually, what you propose won't work: psmc takes a fasta (or fastq) with consensus diploid sequence of a particular individual, NOT a fastq coming from next-gen sequencing reads. You need to convert the vcf of your individual to the consensus sequence - maybe FastaAlternateReferenceMaker will produce what you need, but you need to check the psmc documentation to be sure of the format it requires.

  • blueskypyblueskypy Posts: 19Member

    If I use Beagle, I do not need to use ReadBackedPhasing; is it right?

  • delangeldelangel Posts: 62GSA Official Member mod

    Both tools actually address different problems: RBP is more accurate for phasing close variants since you use the read structure, and especially for higher depths and longer reads it will be more accurate than Beagle. On the other hand, if your sample comes from a population where you have good genotype data from other individuals (e.g. like in 1000 Genomes), using Beagle will be more effective in "filling the blanks" from your data and inferring LD structure and haplotype blocks from your other samples. So, in theory, you'd be better off running both, although we don't have extensive experience in the relative gains/issues of running one on top of the other.

  • blueskypyblueskypy Posts: 19Member

    hi, Delangel, Thanks so much for the detailed explanation! My samples are human samples. I'm new to genotype phasing, and I wonder if you can let me know how to run both. After Variant Quality Recalibration in the Best Practice workflow, should I do the following? 1. ReadBackedPhasing 2. ProduceBeagleInput 3. beagle 4. unzip beagle output 5. BeagleOutputToVCF

    A 2nd question, if I have large RAM, can I run the whole exome on beagle instead of chr by chr? I used the following command where the vcf file contains all chrs, and beagle didn't give any error msg.

    # variant phasing
    #a. Performs physical phasing of SNP calls, based on sequencing reads. 
    java -Xmx3g -jar $gatkDir/GenomeAnalysisTK.jar -T ReadBackedPhasing \
     -R $refGenome \
     -I $sampleID.recal.reduced.bam \
     --variant $sampleID.snp.recaled.vcf \
     -L $sampleID.snp.recaled.vcf \
     -o $sampleID.snp.recaled.phased.vcf \
     --phaseQualityThresh 20.0
    
    #b. phasing by imputation
    #Converts the input VCF into a format accepted by the Beagle imputation/analysis program.
    java -Xmx3g -jar $gatkDir/GenomeAnalysisTK.jar -T ProduceBeagleInput \
     -R $refGenome \
     -V $sampleID.snp.recaled.phased.vcf \
     -o $sampleID.snp.recaled.phased.beagle.input
    
    java -Xmx10g -jar $softwareDir/beagle.jar \
    like=$sampleID.snp.recaled.phased.beagle.input \
    out=$sampleID.snp.recaled.phased.beagle.out
    
  • blueskypyblueskypy Posts: 19Member

    hi, Delangel, just wonder if I can ask a 3rd question:

    You mentioned that "if your sample comes from a population where you have good genotype data from other individuals (e.g. like in 1000 Genomes), using Beagle will be more effective", so I have two questions: 1. what are the snp and indel files from the broad bundle that I can use with beagle, 1000G_phase1.snps.high_confidence.b37.vcf and Mills_and_1000G_gold_standard.indels.b37.vcf? 2. should those files, e.g. 1000G_phase1.snps.high_confidence.b37.vcf, be called in beagle in the following way?

    java -Xmx10g -jar beagle.jar \
    like=unphased.beagle.input \
    phased=1000G_phase1.snps.high_confidence.b37.vcf \
    out=my.run
    

    Thanks so much for the help!

Sign In or Register to comment.