If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Mutect2 with contamination estimates


How many sites does ContEst need to get an accurate answer?
A couple of my samples give me results like this:

name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites
META CEU n/a 57.3 0.8 56.9 57.7 83

57% contamination seems very high. Other samples report using around 1000 sites and the contamination comes out around 20%. I wonder if the high result is inaccurate as ConTest is only using 83 sites?
How does mutect2 use the output from ContEst?. I would to like to run Mutect2 with and without the ConTest results, as I am concerned I will get very few SNPs passing if such a high level of contamination is assumed . However Mutect2 is running very slowly and I don't have the compute resources to run it twice. Is there any way I can filter the output of muctect2 to take into account the contamination estimates?

Any thoughts much appreciated
Thank you



  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    Hi Frances,

    83 sites should be enough for ContEst. Is there anything special about the samples giving high contamination compared to the ones giving lower contamination?

    It is true you may get very few SNPs passing if such a high level of contamination is assumed. The contamination level is basically going to be the lower threshold for allele frequency of mutations you can detect. I am not sure what you mean by "filter the output of muctect2 to take into account the contamination estimates". Do you mean filter on Allele frequency?

    Have a look at these threads as well:


  • hyleihylei MDMember

    Hi, Frances and Sheila:

    We have the tumor only Exomseq data, without the normal control. We know our tumor sample have the normal contamination, so we first want to use the ContEst to estimate the contamination percentage. I use the following command line for the ContEst,

    java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar \
    -I tumor.bam \
    -R human_g1k_v37.fasta \
    -B:pop,vcf hg19_population_stratified_af_hapmap_3.3.vcf \
    -T Contamination -B:genotypes,vcf hg00142.vcf \
    -BTI genotypes -o tumor.txt
    I don't know whether these the above vcf files are right for my sample, can you please give me some suggestions which file i should put in the command line?

    And also can you please give me some suggestions how i use the ContEst results for the MuTect2? If I use the ContEst result during the MuTect2 run, Will the output same as without ContEst file? Because we know our tumor sample has around 20% normal contamination, does it any way to normalize the tumor data and lift the VAF in the output? Or any way to correct the normal contamination? We currently use the Mutect2 tumor only model for the variant calling. Thanks very much for your suggestions.



  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited May 2017

    Hi Haiyan,

    I think this thread will answer your questions.


    EDIT: For the command, have a look at the tool doc for tips.

Sign In or Register to comment.