It looks like you're new here. If you want to get involved, click one of these buttons!
BEAGLE is a state of the art software package for analysis of large-scale genetic data sets with hundreds of thousands of markers genotyped on thousands of samples. BEAGLE can
The GATK provides and experimental interface to BEAGLE. Currently, the only use cases supported by this interface are a) inferring missing genotype data from call sets (e.g. for lack of coverage in low-pass data), b) Genotype inference for unrelated individuals.
The basic workflow for this interface is as follows:
After variants are called and possibly filtered, the GATK walker ProduceBeagleInput will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.
Before running BEAGLE, we need to first take an input VCF file with genotype likelihoods and produce the BEAGLE likelihoods file using walker ProduceBealgeInput, as described in detail in its documentation page.
For each variant in inputvcf.vcf, ProduceBeagleInput will extract the genotype likelihoods, convert from log to linear space, and produce a BEAGLE input file in Genotype likelihoods file format (see BEAGLE documentation for more details). Essentially, this file is a text file in tabular format, a snippet of which is pasted below:
marker alleleA alleleB NA07056 NA07056 NA07056 NA11892 NA11892 NA11892
20:60251 T C 10.00 1.26 0.00 9.77 2.45 0.00
20:60321 G T 10.00 5.01 0.01 10.00 0.31 0.00
20:60467 G C 9.55 2.40 0.00 9.55 1.20 0.00
Note that BEAGLE only supports biallelic sites. Markers can have an arbitrary label, but they need to be in chromosomal order. Sites that are not genotyped in the input VCF (i.e. which are annotated with a "./." string and have no Genotype Likelihood annotation) are assigned a likelihood value of (0.33, 0.33, 0.33).
IMPORTANT: Due to BEAGLE memory restrictions, it's strongly recommended that BEAGLE be run on a separate chromosome-by-chromosome basis. In the current use case, BEAGLE uses RAM in a manner approximately proportional to the number of input markers. After BEAGLE is run and an output VCF is produced as described below, CombineVariants can be used to combine resulting VCF's, using the "-variantMergeOptions UNION" argument.
We currently only support a subset of BEAGLE functionality - only unphased, unrelated input likelihood data is supported. To run imputation analysis, run for example
java -Xmx4000m -jar path_to_beagle/beagle.jar like=path_to_beagle_output/beagle_output out=myrun
Extra BEAGLE arguments can be added as required.
Empirically, Beagle can run up to about ~800,000 markers with 4 GB of RAM. Larger chromosomes require additional memory.
BEAGLE will produce several output files. The following shell commands unzip the output files in preparation for their being processed, and put them all in the same place:
# unzip gzip'd files, force overwrite if existing
gunzip -f path_to_beagle_output/myrun.beagle_output.gprobs.gz
gunzip -f path_to_beagle_output/myrun.beagle_output.phased.gz
#rename also Beagle likelihood file to mantain consistency
mv path_to_beagle_output/beagle_output path_to_beagle_output/myrun.beagle_output.like
Once BEAGLE files are produced, we can update our original VCF with BEAGLE's data. This is done using the BeagleOutputToVCF tool.
The walker looks for the files specified with the -B(type,BEAGLE,file) triplets as above for the output posterior genotype probabilities, the output r^2 values and the output phased genotypes. The order in which these are given in the command line is arbitrary, but all three must be present for correct operation.
The output VCF has the new genotypes that Beagle produced, and several annotations are also updated. By default, the walker will update the per-genotype annotations GQ (Genotype Quality), the genotypes themselves, as well as the per-site annotations AF (Allele Frequency), AC (Allele Count) and AN (Allele Number).
The resulting VCF can now be used for further downstream analysis.
Assuming you have broken up your calls into Beagle by chromosome (as recommended above), you can use the CombineVariants tool to merge the resulting VCFs into a single callset.
java -jar /path/to/dist/GenomeAnalysisTK.jar \
-T CombineVariants \
-R reffile.fasta \
--out genome_wide_output.vcf \
-V:input1 beagle_output_chr1.vcf \
-V:input2 beagle_output_chr2.vcf \
.
.
.
-V:inputX beagle_output_chrX.vcf \
-type UNION -priority input1,input2,...,inputX
Comments
Hi, I used Beagle to perform genotype refinement based on a "ApplyRecalibration" vcf output. I am wondering whether I can use the "BaseRecalibrator" and "PrintReads" again to integrate the Beagle-refined VCF file into the original BAM file? The reason I ask is that I want to use the program psmc (by Heng Li) to infer the population history of my study species, and psmc requires fastq input files. I think I need to transfer a BAM file with refined base or variant scores to a fastq file for the psmc analyses. So, is there a best way to perform this job? Hope my question makes some sense.
Thanks,
Chih-Ming
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi there,
Unfortunately I don't understand your question.What do you mean by "integrate the Beagle-refined VCF file into the original BAM file"? BAM and VCF files have very different types of content. Maybe what you want to do is recalibrate the base quality scores of your datatset, then revert the recalibrated BAM file to FASTQ format to use as input for psmc? Is that correct?
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I'm happy to jump in on this one as I do understand what you want to do. I agree that your strategy would work just fine.
Eric Banks, PhD -- Group Leader, Methods Development, MPG, Broad Institute of Harvard and MIT
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine and Eric,
I think both of you understand what I need. Although psmc bases on SNP data to infere population history, it requires input as fastq, not vcf, files. And psmc judges which SNPs to use based on their quality scores. According to my understading of GATK, the method I mentioned in the last post is the only way to put the base recalibration and variation recalibration results back to BAM file. Then, I will convert the refined BAM file to fatsq so that psmc can pick up the best (most reliable) SNPs for analysis.
Please let me know if there are better ways to do this.
Thanks,
Chih-Ming
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •That sounds correct. Good luck!
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Actually, what you propose won't work: psmc takes a fasta (or fastq) with consensus diploid sequence of a particular individual, NOT a fastq coming from next-gen sequencing reads. You need to convert the vcf of your individual to the consensus sequence - maybe FastaAlternateReferenceMaker will produce what you need, but you need to check the psmc documentation to be sure of the format it requires.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •If I use Beagle, I do not need to use ReadBackedPhasing; is it right?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Both tools actually address different problems: RBP is more accurate for phasing close variants since you use the read structure, and especially for higher depths and longer reads it will be more accurate than Beagle. On the other hand, if your sample comes from a population where you have good genotype data from other individuals (e.g. like in 1000 Genomes), using Beagle will be more effective in "filling the blanks" from your data and inferring LD structure and haplotype blocks from your other samples. So, in theory, you'd be better off running both, although we don't have extensive experience in the relative gains/issues of running one on top of the other.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •hi, Delangel, Thanks so much for the detailed explanation! My samples are human samples. I'm new to genotype phasing, and I wonder if you can let me know how to run both. After Variant Quality Recalibration in the Best Practice workflow, should I do the following? 1. ReadBackedPhasing 2. ProduceBeagleInput 3. beagle 4. unzip beagle output 5. BeagleOutputToVCF
A 2nd question, if I have large RAM, can I run the whole exome on beagle instead of chr by chr? I used the following command where the vcf file contains all chrs, and beagle didn't give any error msg.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •hi, Delangel, just wonder if I can ask a 3rd question:
You mentioned that "if your sample comes from a population where you have good genotype data from other individuals (e.g. like in 1000 Genomes), using Beagle will be more effective", so I have two questions: 1. what are the snp and indel files from the broad bundle that I can use with beagle, 1000G_phase1.snps.high_confidence.b37.vcf and Mills_and_1000G_gold_standard.indels.b37.vcf? 2. should those files, e.g. 1000G_phase1.snps.high_confidence.b37.vcf, be called in beagle in the following way?
Thanks so much for the help!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •