Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

How to interpret output of ContEst

rebberrebber StockholmMember

Hi!

I've recently started using ContEst and now I'd like to understand exactly what I get in the output file. What does each column mean?
Specifically:
-Are "contamination", "confidence_interval_95_width", "confidence_interval_95_low", "confidence_interval_95_high" fractions or percentages?
-What does "sites" represent?

I also have some questions regarding the following statistics printed to the screen:

INFO 14:38:48,617 ContEst - Population informed sites: 314
INFO 14:38:48,618 ContEst - Non homozygous variant sites: 277
INFO 14:38:48,618 ContEst - Homozygous variant sites: 37
INFO 14:38:48,619 ContEst - Passed coverage: 35
INFO 14:38:48,619 ContEst - Results: 10

-What is the coverage threshold which 35 sites have passed here? Can this threshold be set?
-What does the "Results" number refer to? I've noted that it is the same number as in the "sites" column in the output file.

Issue · Github
by Sheila

Issue Number
1449
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
chandrans

Best Answer

Answers

  • rebberrebber StockholmMember

    I use GATK 3.6, btw.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited November 2016

    Hi @rebber,

    Can you please share the output metrics you get? Let's tie your results back to the standard-out's metrics.

    Also, you can refer to the ContEst publication here (doi: 10.1093/bioinformatics/btr446).

    Post edited by shlee on
  • rebberrebber StockholmMember

    Thanks for your answers!

    I've read the publication but didn't find any specification of the details of the output file in it. Earlier posts on this forum talked about the contamination being a fraction, while I've had samples where the output contamination is >1. Therefore my confusion. But from reading the source code and from the linked page, it's obvious that it's given in percentage.

    Thanks also for the explanation on the site statistics. However, do you @Sheila really mean that "population informed sites" are the hom-var sites in my normal sample also present in the pop-file? What's then the difference to "homozygous variant sites"? I thought "population informed sites" were all the sites in the pop-file (non homozygous + homozygous variant sites).

    Where could I find more info about such parameters as --min-site-depth? I've read the page for ContEst (https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_cancer_contamination_ContEst.php) and the page with general command line arguments (https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_engine_CommandLineGATK.php#--interval_set_rule), but found nothing about this. java -jar /path/to/GenomeAnalysisTK.jar -T ContEst -h didn't give any info about the min. depth either.

    One more question: Which genotyper is used for the normal bam?

    If it still is of interest, @shlee (however, I think I've gotten the answers of my original questions already):
    The command I run is:
    java -jar /path/to/GenomeAnalysisTK.jar -T ContEst -R /path/to/human_g1k_v37_decoy.fasta -I:eval /path/to/tumor.bam -I:genotype /path/to/normal.bam --popfile target_SNPs_AF.vcf -o /path/to/output_contest.txt

    The output for one of my samples looks like this:
    name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites
    META CEU n/a 0.3 0.3 0.2 0.5 11

    And for the same sample the site statistics look like this:
    INFO 14:55:44,162 ContEst - Population informed sites: 314
    INFO 14:55:44,163 ContEst - Non homozygous variant sites: 277
    INFO 14:55:44,163 ContEst - Homozygous variant sites: 37
    INFO 14:55:44,163 ContEst - Passed coverage: 37
    INFO 14:55:44,163 ContEst - Results: 11

    /Rebecka

  • rebberrebber StockholmMember

    Hi again!

    After discussing this with the rest in my team we have another question: What exactly does the "contamination fraction c" in the publication mean? More specifically: how is the difference between homozygous ref contaminant and heterozygous contaminant handled? An example: reference is C, variant is A. For a sample the true genotype is A (i.e. it's hom-var) but 10% of the reads are C, then it could be a 10% contamination from a hom-ref sample, or 20% contamination from a heterozygous sample. What would ContEst show in these two cases (contaminant is hom-var & contaminant is het)? We cannot find the definition of this in the publication or the documentation.

    /Rebecka

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited November 2016

    Hi @rebber,

    Thanks for the info. I've asked someone in the know to look into your latest question. While they are getting back to us, here's a slide that I think is relevant. The slide illustrates how ContEst’s underlying algorithm uses a Bayesian approach to calculate the posterior probability of the contamination level and determine the maximum a posteriori probability estimate of the contamination level.
    image

    To work back from the numbers how ContEst calculates contamination, we will also need to know the allele depths for the 11 sites that show the reference allele in the tumor sample when the normal sample is hom-var.

  • kcibulkcibul Cambridge, MAMember, Broadie, Dev ✭✭✭

    Hi -- effectively ContEst is assuming the contaminant is drawn from a pool of contaminants where the population frequency of the contaminating allele is f (in the above picture). While you could build a more precise model based on hets/hom (e.g. f being 0.5 or 1) you would have to know that there is a single contaminating sample AND you would have to know the genotypes of that contaming sample. If you knew that... this would be a much easier problem!

    In our experience building and running contest, the contaminating allele much more frequently comes from multiple samples. However, in the original paper, there are results showing that the accuracy of the pooled allele frequency approach is robust even in the single sample case as shown in the simulated data.

    hth

    • kristian
Sign In or Register to comment.