ContEst with GATK4

We would like to calculate intra-individual contamination in our data with ContEst, i've been trying the whole day to use ContEst with GATK4. Though it looks like ContEst isn't in GATK4 anymore, and i can't find it in the list of tools present in section "Diagnostics and Quality Control".

I tried many command line with example dataset, but none of them worked...

for this command line i get an error:

java -Xmx2g -jar \
/Users/tools/gatk- \
-T ContEst.jar \
-I ContEst_example_data/chr20_sites.bam \
-R human_g1k_v37.fasta \
-B:pop,vcf hg18_population_stratified_af_hapmap_3.3.vcf \
-T Contamination \
-B:genotypes,vcf ContEst_example_data/hg00142.vcf \
-BTI genotypes \
-o contamination_results_chr20.txt

"A USER ERROR has occurred: '-T' is not a valid command."

i also tried the command line from this page :

java -jar ContEst.jar -T Contamination -h

but i get errors:
Exception in thread "main" java.lang.ExceptionInInitializerError
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.(
at org.broadinstitute.sting.gatk.CommandLineExecutable.(
at org.broadinstitute.sting.gatk.CommandLineGATK.(
at org.broadinstitute.sting.gatk.CommandLineGATK.main(
Caused by: java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: could not create class file from P11KeyAgreement$AllowKDF.class
at org.reflections.Reflections.scan(
at org.reflections.Reflections.(
at org.broadinstitute.sting.utils.classloader.PluginManager.(
... 4 more

i guess ContEst.jar is looking for GATK...

i also tried this with the same error:

java -Xmx2g -jar ContEst.jar \
-I ContEst_example_data/chr20_sites.bam \
-R human_g1k_v37.fasta \
-B:pop,vcf hg19_population_stratified_af_hapmap_3.3.vcf \
-T Contamination \
-B:genotypes,vcf hg00142.vcf \
-BTI genotypes \
-o contamination_results_chr20.txt

from this page :
i've seen this command line:

java -jar GenomeAnalysisTK.jar \
-T ContEst \
-R hs37d.fa \
-I tumor.bam \
--genotypes Panel_of_normal.vcf \
--popfile hg19_population_stratified_af_hapmap_3.3.FIX.vcf.gz \
-L target.bed \
-isr INTERSECTION -o contamination_out.txt

where we use the GenomeAnalysisTK.jar and ContEst as an option, which is different...
But i've been unable to find any GenomeAnalysisTK.jar but only gatk-package-, are these 2 the same ?

Anyway still doesn't work...

So if you could help me by giving me the correct command line syntax it would be nice.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited April 6


    ContEst is no longer the way to calculate cross sample contamination in tumor samples. There are two new tools called GetPileupSummaries and CalculateContamination that replace it. Have a look at the hands on Mutect2 tutorial we present at workshops for more information. They are in the Presentations section. Also, you may find this tutorial helpful.


  • dpratdprat Member

    @Sheila Thanks for your answer !

    After hours of research, I finally managed to make it work with GATK3 :smiley:
    Though it doesn't work on my data and i only get this in my output file :

    Warning: We're throwing out lane META since it has fewer than 500 read bases at genotyped positions
    name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites

    I guess i don't have enough depth on my data...
    I haven't been able to find a solution for this at the moment, if you have any idea ?

    So i tried with the 2 new tools : GetPileupSummaries and CalculateContamination, though i'm not sure to understand which VCF i'm supposed to use, the one from gnomAD ? or my own VCF file ?

    My data are WGS with low depth, around 1 - 2X, should i download all the VCF separately from here :
    and then apply a "cat" command line ?
    or is there any unique file for the entire genome ?

    Thank you

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    1-2X coverage is very low. I think ContEst does need to see more coverage than that.

    I also think the new workflow tools do a better job on low coverage. Have a look at this article for more information on which VCF to use.

    Let us know how things go :smile:


  • dpratdprat Member


    So i tested with the 2 new tools GetPileupSummaries and CalculateContamination.
    I used VCF file from gnomAD, for now i tested on chr1, 2 and 3 separately
    And i got 0% contamination in the 3 case

    I also tried by concatening chr1, 2 and 3 in one unique file, but i still got 0%

    I sorted my BAM by chr but i still get 0%

    Should i try on all the chr ?

    Or maybe the sensitivity for my data is too high ?

    I don't know what to do to solve the problem...

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    Can you check on the entire BAM file?


  • dpratdprat Member

    hi @Sheila like i said here i'm working on hg19 and not hg38, so i have to modify all the VCF of gnomAD to make it work, and i don't have enough space right now to do it.
    I'm working on this problem to find a way to download all of these and uncompress it...

    I'm actually trying to make this work on my data which are NIPT data. We are working on a way to calculate the amount of foetal DNA circulating in the mother blood. Our idea was to transpose your program which is able to detect contamination "intra-individual" to our case, where the foetal DNA would correspond to a contamination.

    Though for now it's not working

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin


    i'm working on hg19 and not hg38, so i have to modify all the VCF of gnomAD to make it work, and i don't have enough space right now to do it.

    I am not sure if it will help, but can you try running per-chromosome? Or, do you have a cluster or access to the cloud that you can run on? Also, FireCloud is giving free credits if you sign up :smiley:


