Allele-specific annotation and filtering of germline short variants
The traditional VQSR recalibration paradigm evaluates each position, and passes or filters all alleles at that position, regardless of how many alternate alleles occur there. This has major disadvantages in cases where a real variant allele occurs at the same position as an error that has sufficient evidence to be called as a variant. The goal of the Allele-Specific filtering workflow is to treat each allele separately in the annotation, recalibration and filtering phases.
Multi-allelic sites benefit the most from the Allele-Specific filtering workflow because each allele will be evaluated more accurately than if its data was lumped together with other alleles. Large callsets will benefit more than small callsets because multi-allelics will occur more frequently as the number of samples in a cohort increases. One callset with 42 samples that was used for development contains 3% multi-allelic sites, while the ExAC callset [http://biorxiv.org/content/early/2015/10/30/030338] with approximately 60,000 samples contains nearly 8% multi-allelic sites. Recalibrating each allele separately will also greatly benefit rare disease studies, in which rare alleles may not be shared by other members of the callset, but could still occur at the same positions as common alleles or errors.
No additional resource files are necessary compared to the traditional VQSR workflow, but this must be run starting from the sample BAM files. The relevant annotations cannot be calculated from VCF or GVCF files alone.
After running the Allele-Specific filtering workflow, several new annotations will be added to the INFO field for your variants (see below), and VQSR results will be based on those new annotations, though using SNP and INDEL tranche sensitivity cutoffs equivalent to the non-allele-specific best practices. If after analyzing your recalibrated data, you’re not convinced that this workflow is for you, you can still run the classic VQSR on your genotyped VCF because the standard annotations for VQSR are still included in the genotyped VCF.
To be clear, this workflow cannot be run without the GVCF mode. This is because the way we generate and combine the allele-specific data depends on having raw data for each sample in the GVCF.
Note that this workflow is not yet officially part of the GATK Best Practices for germline variant discovery. Although we are happy with the performance of this workflow, our own production pipelines have not yet been updated to include this, so it should still be considered experimental. However, we do encourage you to try this out on your own data and let us know what you find, as this helps us refine the tools and catch bugs.
Summary of workflow steps
Begin with a BAM file that has been fully pre-processed according to our Best Practices recommendations for each sample. The read data in the BAM are necessary to generate the allele-specific annotations.
Step 1: Generate a GVCF per sample with HaplotypeCaller
Using the locally-realigned reads, HaplotypeCaller will generate GVCFs with all of its usual standard annotations, plus raw data to calculate allele-specific versions of the standard annotations. That means each alternate allele in each VariantContext will get its own data used by downstream tools to generate allele-specific QualByDepth, RMSMappingQuality, FisherStrand and allele-specific versions of the other standard annotations. For example, this will help us sort out good alleles that only occur in a few samples and have a good balance of forward and reverse reads but occur at the same position as another allele that has bad strand bias because it’s probably a mapping error.
-G StandardAnnotation -G AS_StandardAnnotation -G StandardHCAnnotation to request the appropriate annotations at this stage.
Step 2: Consolidate GVCFs (using either CombineGVCFs or ImportGenomicsDB)
Here the allele-specific data for each sample is combined per allele, but is not yet in its final annotation form. Use
-G StandardAnnotation -G AS_StandardAnnotation to request the appropriate annotations at this stage.
Step 3: Joint-call all samples with GenotypeGVCFs
Raw allele-specific data for each sample is used to calculate the finalized annotation values. Use
-G StandardAnnotation -G AS_StandardAnnotationto request the appropriate annotations at this stage.
Step 4: Variant filtering with VQSR
In allele-specific mode (activated using
-AS), the VariantRecalibrator builds the statistical model based on data for each allele, rather than each site. This has the added benefit of being able to recalibrate the SNPs in mixed sites according to the appropriate model, rather than lumping them in with INDELs as had been done previously. It will also provide better results by matching the exact allele in the training and truth data rather than just the position.
When you run the second step of VQSR, ApplyRecalibration, allele-specific filters are calculated and stored in the AS_FilterStatus INFO annotation. A site-level filter is applied to each site based on the most lenient filter across all alleles. For example, if any allele passes, the entire site passes. If no alleles pass, then the filter will be applied corresponding to the allele with the lowest tranche (best VQSLOD).
The two ApplyRecalibration modes should be run in series, as in our usual Best Practices recommendations. If SNP and INDEL ApplyRecalibration modes are run in parallel and combined with CombineVariants (which would work for the standard VQSR arguments), in allele-specific mode any mixed sites will fail to be processed correctly.
Output of the workflow
This workflow adds new allele-specific info-level annotations to the VCFs and produces a final output with allele-specific filters based on the VQSR SNP and INDEL tranches.
The AS_Standard annotation set will produce allele-specific versions of our standard annotations. For AS_MQ, this means that the root-mean-squared mapping quality will be given for all of the reads that support each allele, respectively. For rank sum and strand bias tests, the annotation for each allele will compare that alternative allele’s values against the reference allele.
Recalibration files from allele-specific VariantRecalibrator
Each allele will be described in a separate line in the output recalibration (.recal) files. For the advanced analyst, this is a good way to check which allele has the worst data and is responsible for a NEGATIVE_TRAIN_SITE classification.
After both ApplyRecalibration modes are run, the INFO field will contain an annotation called AS_FilterStatus, which will list the filter corresponding to each alternate allele. Allele-specific culprit and VQSLOD scores will also be added to the final VCF in the AS_culprit and AS_VQSLOD annotations, respectively.
3 195507036 . C G,CCT 6672.42 VQSRTrancheINDEL99.80to99.90 AC=7,2;AF=0.106,0.030;AN=66;AS_BaseQRankSum=-0.144,1.554;AS_FS=127.421,52.461;AS_FilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.80to99.90;AS_MQ=29.70,28.99;AS_MQRankSum=1.094,0.045;AS_ReadPosRankSum=1.120,-7.743;AS_SOR=9.981,7.523;AS_VQSLOD=-48.3935,-7.8306;AS_culprit=AS_FS,AS_FS;BaseQRankSum=0.028;DP=2137;ExcessHet=1.6952;FS=145.982;GQ_MEAN=200.21;GQ_STDDEV=247.32;InbreedingCoeff=0.0744;MLEAC=7,2;MLEAF=0.106,0.030;MQ=29.93;MQRankSum=0.860;NCC=9;NEGATIVE_TRAIN_SITE;QD=10.94;ReadPosRankSum=-7.820e-01;SOR=10.484
3 153842181 . CT TT,CTTTT,CTTTTTTTTTT,C 4392.82 PASS AC=15,1,1,1;AF=0.192,0.013,0.013,0.013;AN=78;AS_BaseQRankSum=-11.667,-3.884,-2.223,0.972;AS_FS=204.035,22.282,16.930,2.406;AS_FilterStatus=VQSRTrancheSNP99.90to100.00,VQSRTrancheINDEL99.50to99.70,VQSRTrancheINDEL99.70to99.80,PASS;AS_MQ=58.44,59.93,54.79,59.72;AS_MQRankSum=2.753,0.123,0.157,0.744;AS_ReadPosRankSum=-9.318,-5.429,-5.578,1.336;AS_SOR=6.924,3.473,5.131,1.399;AS_VQSLOD=-79.9547,-2.0208,-3.4051,0.7975;AS_culprit=AS_FS,AS_ReadPosRankSum,AS_ReadPosRankSum,QD;BaseQRankSum=-2.828e+00;DP=1725;ExcessHet=26.1737;FS=168.440;GQ_MEAN=117.51;GQ_STDDEV=141.53;InbreedingCoeff=-0.1776;MLEAC=16,1,1,1;MLEAF=0.205,0.013,0.013,0.013;MQ=54.35;MQRankSum=0.967;NCC=3;NEGATIVE_TRAIN_SITE;QD=4.42;ReadPosRankSum=-2.515e+00;SOR=4.740
Since GATK3.4, GenotypeGVCFs has had the ability to output a “spanning deletion allele” (now represented with *) to indicate that a position in the VCF is contained within an upstream deletion and may have “missing data” in samples that contain that deletion. While the upstream deletions will continue to be recalibrated and filtered by VQSR similar to the way they always have been, these spanning deletion alleles that occur downstream (and represent the same event) will be skipped.
GVCF size increase
Using the default GVCF bands, the raw allele-specific data causes a minimal size increase, which amounted to less than a 1% increase on the NA12878 exome used for development.
Typical usage errors
Problem: WARN 08:35:26,273 ReferenceConfidenceVariantContextMerger - WARNING: remaining (non-reducible) annotations are assumed to be ints or doubles or booleans, but 0.00|15005.00|14400.00|0.00 doesn't parse and will not be annotated in the final VC.
Solution: Remember to add
-G Standard -G AS_Standard to the GenotypeGVCFs command
Problem: Standard (non-allele-specific) annotations are missing
Solution: HaplotypeCaller and GenotypeGVCFs need -G Standard specified if -G AS_Standard is also specified.