To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Off-label workflow to simply call differences in two samples

shleeshlee CambridgeMember, Broadie, Moderator
edited February 1 in Announcements

image
Given my years as a biochemist, if given two samples to compare, my first impulse is to want to know what are the functional differences, i.e. differences in proteins expressed between the two samples. I am interested in genomic alterations that ripple down the central dogma to transform a cell.

Please note the workflow that follows is NOT a part of the Best Practices. This is an illustrative, unsupported workflow. For the official Somatic Short Variant Calling Best Practices workflow, see Tutorial#11136.

To call every allele that is different between two samples, I have devised a two-pass workflow that takes advantage of Mutect2 features. This workflow uses Mutect2 in tumor-only mode and appropriates the --germline-resource argument to supply a single-sample VCF with allele fractions instead of population allele frequencies. The workflow assumes the two case samples being compared originate from the same parental line and the ploidy and mutation rates make it unlikely that any site accumulates more than one allele change.


First, call on each sample using Mutect2's tumor-only mode.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
-O A.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
-O B.vcf

Second, for each single-sample VCF, move the sample-level AF allele-fraction annotation to the INFO field and simplify to a sites-only VCF.

This is a heuristic solution in which we substitute sample-level allele fractions for the expected population germline allele frequencies. Mutect2 is actually designed to use population germline allele frequencies in somatic likelihood calculations, so this substitution allows us to fulfill the requirement for an AF annotation with plausible fractional values. The terminal screenshots highlight the data transpositions.

Before:

image

After:

image

Third, call on each sample in a second pass, again in tumor-only mode, with the following additions.

gatk Mutect2 \
-R ref.fa \
-I A.bam \
-tumor A \
--germline-resource Baf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O A-B.vcf

gatk Mutect2 \
-R ref.fa \
-I B.bam \
-tumor B \
--germline-resource Aaf.vcf \
--af-of-alleles-not-in-resource 0 \
--max-population-af 0 \
-pon pon_maskAB.vcf \
-O B-A.vcf
  • Provide the matched single-sample callset for the case sample with the --germline-resource argument.
  • Avoid calling any allele in the --germline-resource by setting --max-population-af to zero.
  • Maximize the probability of calling any differing allele by setting --af-of-alleles-not-in-resource to zero.
  • Prefilter sites with artifacts and cross-sample contamination with a panel of normals (PoN) in which confident variant sites for both sample A and B have been removed, e.g. with gatk SelectVariants –V pon.vcf -XL AandB_haplotypecaller.vcf –O pon_maskAB.vcf.

Fourth, filter out unlikely calls with FilterMutectCalls.

gatk FilterMutectCalls \
-V A-B.vcf \
-O A-B-filter.vcf

gatk FilterMutectCalls \
-V B-A.vcf \
-O B-A-filter.vcf

FilterMutectCalls provides many filters, e.g. that account for low base quality, for events that are clustered, for low mapping quality and for short-tandem-repeat contractions. Of the filters, let's consider the multiallelic filter. It discounts sites with more than two variant alleles that pass the tumor LOD threshold.

  • We assume case sample variant sites will have a maximum of one allele that is different from the --germline-resource control. A single allele call will pass the multiallelic filter. However, if we emit any shared variant allele alongside the differing allele, e.g. for a heterozygous site without ref alleles, then the call becomes multiallelic and will be filtered, which is not what we want. We previously set Mutect2’s --max-population-af to zero to ensure only the differing allele is called, and so here we can rely on FilterMutectCalls to filter artifactual multiallelic sites.
  • If multiple variant alleles are expected per call, then FilterMutectCall’s multiallelic filtering will be undesirable. For example, if changes to allele fractions for alleles that are shared was of interest for the two samples derived from the same parental line, and Mutect2 --max-population-af was set to one in the previous step to additionally emit the shared variant alleles, then you would expect multiallelic calls. These will be indistinguishable from artifactual multiallelic sites.

This workflow produces contrastive variants. If the samples are a tumor and its matched normal, then the calls include sites where heterozygosity was lost.

We know that loss of heterozygosity (LOH) plays a role in tumorigenesis (doi:10.1186/s12920-015-0123-z). This leads us to believe the heterozygosity of proteins we express contributes to our health. If this is true, then for somatic studies, if cataloging the gain of alleles is of interest, then cataloging the loss of alleles should also be of interest. Can we assume just because variants are germline that they do not play a role in disease processes? How can we account for the combinatorial effects of the diploid nature of our genomes?

Remember regions of LOH do not necessarily represent a haploid state but can be copy-neutral or even copy-amplified. It may be that as one parental chromosome copy is lost, the other is duplicated to maintain copy number, which presumably compensates for dosage effects as is the case in uniparental isodisomy.


Post edited by shlee on

Comments

  • This article is awesome! I do have a question now that I read through this a bit. I guess I am just not fully understanding the --germline-resource flag. Based upon the commands in this article, where is the Baf.vcf or Aaf.vcf being generated? I understand it's purpose and function but I just don't how to accomplish this on the commandline.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited February 8

    Hi @JmeAlena,

    Thank you!

    I do some hackery to reformat the data into these files. I'm pretty sure there are better ways so I left these details out.

    What I do involves Excel and is as follows. Please do share any better approaches.

    [1] Convert VCF data to table format using VariantsToTable.

    gatk VariantsToTable \
         -V A.vcf \
         -F CHROM -F POS -F ID -F REF -F ALT -F TLOD \
         -GF GT -GF AF \
         -O A.table
    

    [2] Modify A.table in Excel, e.g. using the CONCATENATE function:

    Don't forget to add the # in front of CHROM. Copy-paste column in-column as values, then remove/modify columns to make a VCF format file. Copy-paste contents to a text editor, e.g. that has preferences set to remove trailing white spaces, and save as text file Aaf.table.

    [3] Modify the VCF header.

    Subset out the header then use nano to modify.

    cat A.vcf | grep '##' > A.header
    nano A.header
    

    Manually change (i) FORMAT-->INFOand (ii) use CTRL+k to delete unnecessary lines, etc.

    [4] Concatenate header and body.

    cat A.header Aaf.table > Aaf.vcf
    

    [5] Validate file.

    gatk ValidateVariants -V Aaf.vcf
    
Sign In or Register to comment.