We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Contamination output in Firecloud best practice pipeline

dannykwellsdannykwells San FranciscoMember ✭✭

Hi everyone (and, in particular, @bshifaw ) - another question about a Firecloud/Broad pipeline:

In the pipeline pre-processing-b37-gatk there is a returned value from the task CheckContamination that is called "contamination".

  1. For normal samples, I am guessing I can interpret this as "tumor-in-normal" contamination.
  2. For tumor samples, I do not know how to interpret this value - is it meaningful at all?


Best Answer


  • bshifawbshifaw Member, Broadie, Moderator admin
    edited September 2018

    The CheckContamination task in pre-processing-b37-gatk produces a contamination number used by haplotypecaller.

    --contamination_fraction_to_filter/-contamination    Fraction of contamination to aggressively remove
    If this fraction is greater is than zero, the caller will aggressively attempt to remove contamination through biased down-sampling of reads (for all samples). Basically, it will ignore the contamination fraction of reads for each alternate allele. So if the pileup contains N total bases, then we will try to remove (N * contamination fraction) bases for each alternate allele.

    I believe this is general contamination and not specific to tumor/normal samples.
    @gauthier , feel free to add any additional comments.

    Post edited by bshifaw on
  • dannykwellsdannykwells San FranciscoMember ✭✭

    Hey @gauthier that makes sense, but it's a bit confusing because I don't know what it could compare to. We aren't putting a second sample in to compare. So how does this that method differentiate between contamination and a somatic variant?

  • gauthiergauthier Member, Broadie, Dev ✭✭✭

    Hi @dannykwells ,

    The contamination tools use a resource for common germline variants: HapMap or 1000G for older tools and gnomAD for the new CalculateContamination tool. The assumption is that the common germline variants that are in the contaminating sample but not in your individual of interest will far outnumber the truly somatic mutations at those same sites. (We also want the proportion of contamination to fit a model of a diploid contaminant, so for X% contamination the less common sites will account for ~X/2% of your coverage (het in contaminant) and the very common will be closer to X% (homVar in contaminant), which is different from what you would see from an X/2% or X% subclone with those same mutations.) You can get into a situation where you have X% contamination and some somatic calls at <=X% and unfortunately it's very difficult to figure out if those are truly somatic or rare variants in the contaminating sample. Such variants will like get filtered, but if you have other criteria that make you more confident that the variants are somatic, like specific genes, you could ignore the filter and use the data anyway. Hopefully that gives you a little more confidence in our contamination estimation.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    Hey, I compared the output to contEst from CGA and the numbers are wildly different (and the contEst matches previously generated values on some of these samples) so I imagine I am likely using the contamination tool incorrectly. Would be happy to chat more, but we'll likely go with contEst for now.

  • dannykwellsdannykwells San FranciscoMember ✭✭

    Interestingly, I'm not the only one to find the verifyBamId gives wildly out of line estimates of contamination - this paper seems to have something similar.

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @dannykwells You probably weren't using CalculateContamination incorrectly, as long as you were using a common germline variants vcf from the GATK resource bundle. The fact that contEst matches old results doesn't necessarily mean much because basically every tool besides CalculateContamination has the same error modes when CNVs are present. ConPair is a partial exception. That said, there's no denying that CalculateContamination would overestimate contamination in some samples that contEst had no problem with, and we recently put in some big improvements to correct this. By the way, these improvements would never have happened without the generosity of users who take the time to let us know what our tools get wrong. It's much appreciated!

    We have put CalculateContamination through some very stringent validations with in silico spike-ins as our truth data. For example: the HCC1143 tumor (which is extremely aneuploid) without a matched normal with contaminating reads from two different samples adding up to 10%. In this case CalculateContamination gives about 10% while contEst gives 1.9%. I mean to put these up publicly on Firecloud some time soon because they are quite compelling.

Sign In or Register to comment.