Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Mutect2 with contamination estimates

Hi GATK

How many sites does ContEst need to get an accurate answer?
A couple of my samples give me results like this:

name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites
META CEU n/a 57.3 0.8 56.9 57.7 83

57% contamination seems very high. Other samples report using around 1000 sites and the contamination comes out around 20%. I wonder if the high result is inaccurate as ConTest is only using 83 sites?
How does mutect2 use the output from ContEst?. I would to like to run Mutect2 with and without the ConTest results, as I am concerned I will get very few SNPs passing if such a high level of contamination is assumed . However Mutect2 is running very slowly and I don't have the compute resources to run it twice. Is there any way I can filter the output of muctect2 to take into account the contamination estimates?

Any thoughts much appreciated
Thank you
Frances

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @fturner
    Hi Frances,

    83 sites should be enough for ContEst. Is there anything special about the samples giving high contamination compared to the ones giving lower contamination?

    It is true you may get very few SNPs passing if such a high level of contamination is assumed. The contamination level is basically going to be the lower threshold for allele frequency of mutations you can detect. I am not sure what you mean by "filter the output of muctect2 to take into account the contamination estimates". Do you mean filter on Allele frequency?

    Have a look at these threads as well:
    http://gatkforums.broadinstitute.org/gatk/discussion/8071/contest-what-if-i-do-not-have-a-genotype-array-of-my-nomal-samples/p1

    http://gatkforums.broadinstitute.org/gatk/discussion/9345/empty-contest-output/p1

    http://gatkforums.broadinstitute.org/gatk/discussion/8588/how-to-interpret-output-of-contest/p1

    -Sheila

  • hyleihylei MDMember

    Hi, Frances and Sheila:

    We have the tumor only Exomseq data, without the normal control. We know our tumor sample have the normal contamination, so we first want to use the ContEst to estimate the contamination percentage. I use the following command line for the ContEst,

    java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar \
    -I tumor.bam \
    -R human_g1k_v37.fasta \
    -B:pop,vcf hg19_population_stratified_af_hapmap_3.3.vcf \
    -T Contamination -B:genotypes,vcf hg00142.vcf \
    -BTI genotypes -o tumor.txt
    I don't know whether these the above vcf files are right for my sample, can you please give me some suggestions which file i should put in the command line?

    And also can you please give me some suggestions how i use the ContEst results for the MuTect2? If I use the ContEst result during the MuTect2 run, Will the output same as without ContEst file? Because we know our tumor sample has around 20% normal contamination, does it any way to normalize the tumor data and lift the VAF in the output? Or any way to correct the normal contamination? We currently use the Mutect2 tumor only model for the variant calling. Thanks very much for your suggestions.

    best

    haiyan

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited May 2017

    @hylei
    Hi Haiyan,

    I think this thread will answer your questions.

    -Sheila

    EDIT: For the command, have a look at the tool doc for tips.

Sign In or Register to comment.