CalculateContamination output table

igorigor New YorkMember ✭✭
edited March 2018 in Ask the GATK team

The CalculateContamination description says:

Calculates the fraction of reads coming from cross-sample contamination, given results from GetPileupSummaries ... this tool estimates contamination based on the signal from ref reads at hom alt sites.

It produces a resulting contamination table that looks something like:

level   contamination   error
whole_bam   0.001012896128155539    1.8192151501912648E-4

I wasn't able to find an explanation of what the output actually is. Of course, I assume that "contamination" and "error" should be as close to 0 as possible, but what exactly are they? Is it contamination from potential other individuals based on population allele frequencies? What happens when there is a matched normal? Do they both need to be contaminated? What about contamination of the normal by the tumor? Those seem to be very different problems, but could both be labeled as contamination.

Best Answers

  • SheilaSheila Broad Institute admin
    edited April 2018 Accepted Answer

    @igor
    Hi,

    Is it contamination from potential other individuals based on population allele frequencies?

    Yes, exactly. Have a look at the hands-on tutorial in the Presentations section for more information.

    What happens when there is a matched normal?

    Nothing, as this is not estimating normal cells in the tumor sample. There is a tool called deTiN that estimates tumor contamination in normal, and our team is working on a new tool to replace that, but it is not ready yet.

    Do they both need to be contaminated?

    No, we only estimate contamination from other samples in the normal. This is used in the filtering process (if a site has AF less than the contamination, it is most likely not real/contamination).

    What about contamination of the normal by the tumor?

    See above.

    Have a look at the Mutect2 presentations as well in the presentations section. Those may help as well as the hands on.

    -Sheila

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited April 2018 Accepted Answer

    @igor
    Hi,

    Is it contamination from potential other individuals based on population allele frequencies?

    Yes, exactly. Have a look at the hands-on tutorial in the Presentations section for more information.

    What happens when there is a matched normal?

    Nothing, as this is not estimating normal cells in the tumor sample. There is a tool called deTiN that estimates tumor contamination in normal, and our team is working on a new tool to replace that, but it is not ready yet.

    Do they both need to be contaminated?

    No, we only estimate contamination from other samples in the normal. This is used in the filtering process (if a site has AF less than the contamination, it is most likely not real/contamination).

    What about contamination of the normal by the tumor?

    See above.

    Have a look at the Mutect2 presentations as well in the presentations section. Those may help as well as the hands on.

    -Sheila

  • igorigor New YorkMember ✭✭

    Can you link to the proper hands-on tutorial? I am not sure how to find the right one.

    If the contamination is only estimated for the normal, why use both the tumor and the normal?

  • sergey_ko13sergey_ko13 Member

    Dear @Sheila ,

    we are a bit confused of getting contamination 0.082 with error 0.00264

    we have 40 Tumor FFPE WES samples run on 16 different lanes (10 flow cells). How could we interpret such a high contamination rate?

    could that be because we used af-only-gnomad.hg38.vcf.gz from mutect2 bundle and did not restrict our GetPileupSummaries input to our population (NFE) allele frequencies only?

    Thank You in advance!
    Sergey

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sergey_ko13
    Hi Sergey,

    Did you run per-sample?

    did not restrict our GetPileupSummaries input to our population (NFE) allele frequencies only?

    What do you mean by this? I don't think you need to do any restricting, as the tool does that for you. Have a look at this.

    I will ask the Mutect2 developer to jump in here as well, because he may have some insight on FFPE sample contamination.

    -Sheila

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2018

    Hi @sergey_ko13,

    For GetPileupSummaries, the recommendation is to use the subset common population variant sites VCF resource (small_exac_common_3_grch38.vcf.gz) and not the full af-only-gnomad.hg38.vcf.gz. The latter includes rare alleles. Please let us know if you still get high levels of contamination with the common sites list.

    For FFPE samples, Section#5 of Tutorial#11136 should interest you.

    -Soo Hee

  • davidbendavidben BostonMember, Broadie, Dev ✭✭✭

    @shlee is right about using just the common sites. The resource she pointed to contains sites that are common -- and with similar prevalence -- in pretty much all human populations, and certainly in one like NFE which is well-represented in ExAC. Therefore that is definitely not something to worry about. FFPE might be a concern, but why don't you run with the recommended resource and we'll see from there.

    Also, did all 40 of your samples have the same high contamination? What was the distribution of estimates?

  • sergey_ko13sergey_ko13 Member

    Thank You All!

    @Sheila , I have run it on our 40 samples BAM (readgroups carefully held) and as a result I have only one line in contamination table: whole_bam. Now I see there should be per-sample values. Is that a default behaviour? how could I change that?

    @shlee , Thank you for clarification. I did not know about the difference between how germline sourses are used in mutect2 itself and GetPileupSummaries. p.s. small_exac_common_3_grch38.vcf.gz has more than AF in INFO - should I clean it the way mutect_resources.wdl says?

    @davidben , sorry for bothering you with unrelated issue. Now I see there is nothing to do with FFPE in contamination. I just mentioned the nature of our samples. My true problem was: wrong population resourse choise and whole_bam instead of per-sample contamination output. btw thank You for m2! It's amazing!

    Sergey

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @sergey_ko13
    Hi Sergey,

    I think the tools are not read group aware, so you will need to split the BAM file per-sample and run.

    -Sheila

  • sergey_ko13sergey_ko13 Member

    thank You @Sheila ! that was bit of headache for us

    I would suggest to mention that in the docs. There it's said "The resulting table provides the fraction contamination, one line per sample, e.g. SampleID--TAB--Contamination. The file has no header." But I received whole_bam line with the header indeed. Bug?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @sergey_ko13,

    Thanks for reporting the discrepancy in the docs. I assume you are referring to the tool document. That may have been true at one point for the tool. With each version update, details may change. You can see an example output for v4.0.0.0 at https://software.broadinstitute.org/gatk/documentation/article?id=11136.

Sign In or Register to comment.