Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Can I use ContEst just for tumor only sample?

hyleihylei MDMember

Hi,

I want to use the ContEst to estimate my Exom-seq tumor sample. We do not have the normal control. Can I use these two vcf files in the command line? Thanks.
-B:pop,vcf example/hg19_population_stratified_af_hapmap_3.3.vcf \
-T Contamination -B:genotypes,vcf example/hg00142.vcf \

best

HY

Answers

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @hylei,

    I think our MuTect2 hands-on tutorial will be helpful to you. You can find it here. The sample you are trying to measure contamination in must be in SAM/BAM format.

  • hyleihylei MDMember

    Hi, shlee:

    Thanks for the suggestion. I see the tutorial, sorry I am still not sure whether my script is right, can you please help me check?

    java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar \
    -I tumor.bam \
    -R human_g1k_v37.fasta \
    -B:pop,vcf hg19_population_stratified_af_hapmap_3.3.vcf \
    -T Contamination -B:genotypes,vcf hg00142.vcf \
    -BTI genotypes -o tumor.txt

    I am not sure whether it is right to use the above two vcf file for my data. I use my tumor.bam file, and got the result

    name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites
    META CEU n/a 100 0.1 99.9 100 498
    META CEU n/a 87.5 1.6 86.7 88.3 584

    I have several tumor samples, and for all the tumor samples I got the very high contamination. i think it is not right for me to use the above vcf files.

    We have the tumor exom-seq data, no normal control. We know some of our tumor sample has the normal contamination, can you please give me some suggestions how to correct the normal contamination in the tumor sample, and does it have way to lift the VAF from the MuTect2 output? Will MuTect2 output have difference with or without ContEst result? Thanks very much for your great suggestions.

    best

    Haiyan

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited May 2017

    @hylei, are you using a jar from the CGA or the GATK jar? I'm only familiar with the GATK conventions.

    Given your metrics, I suspect a sample swap. Are you certain your VCF is from the same individual as your BAM? You can check this using Picard's Fingerprinting tools, e.g. CrosscheckReadGroupFingerprints.

    Fingerprinting Tools:                            Tools for manipulating fingerprints, or related data.
        CheckFingerprint                             Computes a fingerprint from the supplied input (SAM/BAM or VCF) file and compares it to the provided genotypes
        CrosscheckReadGroupFingerprints              Checks if all read groups appear to come from the same individual
    

    For a tumor sample, be sure to consider the LOSS_OF_HET_RATE option (available with CrosscheckReadGroupFingerprints). I explain the haplotype map format here.

  • hyleihylei MDMember

    Hi, shlee:

    The vcf files are not from the same individual as my tumor.bam. Can you please tell me how i can get these two required vcf files for my tumor sample? I have the tumor bam file, MuTect2 called vcf file. Thanks.

    best

    Haiyan

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @hylei, Ideally, you have a VCF from an array or other independent approach for the same sample. Otherwise, you can genotype the tumor BAM on the fly within your ContEst command using -I:genotype OR alternatively produce a genotype file using another caller, e.g. HaplotypeCaller.

  • hyleihylei MDMember

    Hi, shlee:

    Sorry I still not very clear about which vcf files.
    1, From https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_cancer_contamination_ContEst.php website, only --popfile parameter is the required parameter, from the http://archive.broadinstitute.org/cancer/cga/contest_prepare webiste, my understanding is hg19_population_stratified_af_hapmap_3.3.vcf file can be used for all the human sample. Does every tumor sample have the specific popfile or share one hg19_population_stratified_af_hapmap_3.3.vcf file? or should I put the vcf files from the HaploytpeCaller in here?

    2, I think I can skip the -genotype parameter, right? Sorry to keep asking you so many questions. Thanks very much.

    best

    Haiyan

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    No worries @hylei. I think you'll find this thread helpful. In the second half of the thread, we discuss the population allele frequency VCF that you can use for every sample you are measuring contamination for. In addition to this stratified AF file, you will need to provide ContEst a VCF or BAM file that informs ContEst what are the expected genotypes. In the thread, there are a number of example commands you can reference. Also, the forum discusses ContEst quite a bit--search using the upper-right search bar.

    Going forward, in GATK4, we'll be offering a different tool for contamination estimation.

  • hyleihylei MDMember

    Hi, shlee:

    Thanks very much for the suggestion. The thread is really helpful, I think it should be right to use hg19_population_stratified_af_hapmap_3.3.vcf as the --popfile. I changed my script to this:
    java -Xmx2g -jar $CONTESTJARPATH/ContEst.jar \
    -I H0097_Post_22.recal.bam \
    -R $dir/human_g1k_v37.fasta \
    -B:pop,vcf $dir/hg19_population_stratified_af_hapmap_3.3.vcf \
    -T Contamination -B:genotypes,vcf $dir1/H0097_Post_22.sample.vcf \
    -BTI genotypes -o contamination_H0097_Post_22_1.txt
    -I is my tumor.bam file, and the -B:genotypes,vcf file is the vcf calling from the Haploytper. But I got the wrong message, "ERROR MESSAGE: Your input file has a malformed header: VCFv4.2 is not a supported version", I chekced the hg19_population_stratified_af_hapmap_3.3.vcf file is VCFv4.0 version, it seems ContEst accepted 4.0 version. Do I need to change my VCF file to 4.0 version? Thanks.

    Haiyan

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @hylei,

    Please be sure to use the ContEst tool within the GATK v3 program jar and not the CGA release. FYI, in GATK v4, which we will release hopefully next month, the contamination estimation tool is called CalculateContamination. When we release, this new tool will be in beta status. That is, we would love for folks to test it out to help us fine tune its parameters. You can get a sneak peak of the tool right now by downloading the alpha release jar here.

Sign In or Register to comment.