Forum Login Issue:
Currently the "Log in with Google" button redirects you to a "Page not found." This is an issue that our forum vendors are working on fixing. In the meantime, while on the "Page not found" you can edit the URL to delete the second gatk, firecloud, or wdl (depending on what subforum you are acessing).
ex: https://gatkforums.broadinstitute.org/gatk/gatk/entry/...

Which one should I use for ContEst --popFile?

Haiying7Haiying7 Heidelberg, GermanyMember

Answers

  • Haiying7Haiying7 Heidelberg, GermanyMember

    I tried different files, but none of them are working.

    (1) 1000G_phase3_v4_20130502.sites.vcf
    kong@cloud-hpc-06-kong:~/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/temp> java -Xmx7G -Djava.io.tmpdir=tmp -jar ${GATK} -T ContEst \

    -I:genotype ${GATK_BQSR_dir}/${Blood}.recal.bam -I:eval ${GATK_BQSR_dir}/${Tumor}.recal.bam \
    -o ${ContEst_dir}/${Tumor}_${Blood}.vcf \
    -R ${GATK_hg} --popfile ${popFile} -L ${!Intervals} \
    -isr INTERSECTION

    INFO 13:24:29,436 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:24:29,439 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
    INFO 13:24:29,440 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 13:24:29,440 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 13:24:29,440 HelpFormatter - [Wed Jan 25 13:24:29 CET 2017] Executing on Linux 3.11.10-29-default amd64
    INFO 13:24:29,440 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26
    INFO 13:24:29,445 HelpFormatter - Program Args: -T ContEst -I:genotype /home/kong/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/Lock/GATK_BaseQualityScoreRecalibration/B100327.recal.bam -I:eval /home/kong/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/Lock/GATK_BaseQualityScoreRecalibration/T3398.recal.bam -o /home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/QC/ContEst/BySamples/T3398_B100327.vcf -R /home/kong/Reference/GATK/Genome/human_g1k_v37.fasta --popfile /home/kong/Reference/GATK/SNP/1000G_phase3_v4_20130502.sites.vcf -L /home/kong/Reference/Agilent/SureSelectHumanAllExon_V5UTRs/S04380219_Covered_nochr.bed -isr INTERSECTION
    INFO 13:24:29,457 HelpFormatter - Executing as kong@cloud-hpc-06-kong on Linux 3.11.10-29-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26.
    INFO 13:24:29,458 HelpFormatter - Date/Time: 2017/01/25 13:24:29
    INFO 13:24:29,458 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:24:29,458 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:24:29,470 GenomeAnalysisEngine - Strictness is SILENT
    INFO 13:24:29,597 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 13:24:29,605 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 13:24:29,666 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06
    INFO 13:24:36,243 IntervalUtils - Processing 74569526 bp from intervals
    INFO 13:24:36,380 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files
    INFO 13:24:36,679 GenomeAnalysisEngine - Done preparing for traversal
    INFO 13:24:36,680 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 13:24:36,680 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 13:24:36,681 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 13:24:36,681 ContEst - Running in sequencing mode

    ERROR --
    ERROR stack trace

    java.lang.RuntimeException: No population frequency annotation for CEU in [VC popfile @ 1:668630-850204 Q100.00 of type=SYMBOLIC alleles=[G*, ] attr={AC=64, AF=0.0127796, AN=5008, CIEND=[-150, 150], CIPOS=[-150, 150], CS=DUP_delly, END=850204, IMPRECISE=true, NS=2504, SVLEN=181574, SVTYPE=DUP} GT=[]
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.calcStats(ContEst.java:626)
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:401)
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:127)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: No population frequency annotation for CEU in [VC popfile @ 1:668630-850204 Q100.00 of type=SYMBOLIC alleles=[G*, ] attr={AC=64, AF=0.0127796, AN=5008, CIEND=[-150, 150], CIPOS=[-150, 150], CS=DUP_delly, END=850204, IMPRECISE=true, NS=2504, SVLEN=181574, SVTYPE=DUP} GT=[]
    ERROR ------------------------------------------------------------------------------------------

    (2) hg19_population_stratified_af_hapmap_3.3.vcf
    kong@cloud-hpc-06-kong:~/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/temp> java -Xmx7G -Djava.io.tmpdir=tmp -jar ${GATK} -T ContEst \

    -I:genotype ${GATK_BQSR_dir}/${Blood}.recal.bam -I:eval ${GATK_BQSR_dir}/${Tumor}.recal.bam \
    -o ${ContEst_dir}/${Tumor}_${Blood}.vcf \
    -R ${GATK_hg} --popfile ${popFile} -L ${!Intervals} \
    -isr INTERSECTION

    INFO 13:26:07,075 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:26:07,078 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
    INFO 13:26:07,078 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 13:26:07,079 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 13:26:07,079 HelpFormatter - [Wed Jan 25 13:26:07 CET 2017] Executing on Linux 3.11.10-29-default amd64
    INFO 13:26:07,079 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26
    INFO 13:26:07,084 HelpFormatter - Program Args: -T ContEst -I:genotype /home/kong/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/Lock/GATK_BaseQualityScoreRecalibration/B100327.recal.bam -I:eval /home/kong/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/Lock/GATK_BaseQualityScoreRecalibration/T3398.recal.bam -o /home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/QC/ContEst/BySamples/T3398_B100327.vcf -R /home/kong/Reference/GATK/Genome/human_g1k_v37.fasta --popfile /home/kong/Reference/ContEst/hg19_population_stratified_af_hapmap_3.3.vcf -L /home/kong/Reference/Agilent/SureSelectHumanAllExon_V5UTRs/S04380219_Covered_nochr.bed -isr INTERSECTION
    INFO 13:26:07,099 HelpFormatter - Executing as kong@cloud-hpc-06-kong on Linux 3.11.10-29-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26.
    INFO 13:26:07,100 HelpFormatter - Date/Time: 2017/01/25 13:26:07
    INFO 13:26:07,100 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:26:07,101 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:26:07,109 GenomeAnalysisEngine - Strictness is SILENT
    INFO 13:26:07,255 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 13:26:07,265 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 13:26:07,326 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.06

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.7-0-gcfedb67):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: The provided VCF file is malformed at approximately line number 4: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AC=66;AF=0.02369;ALL={C=0.97629, T=0.02371};AN=2786;ASW={C=1.00000, T=0.00000};CEU={C=1.00000, T=0.00000};CHB={C=1.00000, T=0.00000};CHD={C=1.00000, T=0.00000};CHS={C=0.00000, T=0.00000};CLM={C=0.00000, T=0.00000};FIN={C=0.00000, T=0.00000};GBR={C=0.00000, T=0.00000};GIH={C=1.00000, T=0.00000};IBS={C=0.00000, T=0.00000};JPT={C=1.00000, T=0.00000};LWK={C=1.00000, T=0.00000};MKK={C=0.82044, T=0.17956};MXL={C=1.00000, T=0.00000};PUR={C=0.00000, T=0.00000};TSI={C=1.00000, T=0.00000};YRI={C=0.99752, T=0.00248};set=MKK-YRI GT"
    ERROR ------------------------------------------------------------------------------------------

    (3) hapmap_3.3.b37.vcf
    kong@cloud-hpc-06-kong:~/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/temp> java -Xmx7G -Djava.io.tmpdir=tmp -jar ${GATK} -T ContEst \

    -I:genotype ${GATK_BQSR_dir}/${Blood}.recal.bam -I:eval ${GATK_BQSR_dir}/${Tumor}.recal.bam \
    -o ${ContEst_dir}/${Tumor}_${Blood}.vcf \
    -R ${GATK_hg} --popfile ${popFile} -L ${!Intervals} \
    -isr INTERSECTION

    INFO 13:27:03,752 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:27:03,755 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
    INFO 13:27:03,755 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 13:27:03,756 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk
    INFO 13:27:03,756 HelpFormatter - [Wed Jan 25 13:27:03 CET 2017] Executing on Linux 3.11.10-29-default amd64
    INFO 13:27:03,756 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26
    INFO 13:27:03,761 HelpFormatter - Program Args: -T ContEst -I:genotype /home/kong/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/Lock/GATK_BaseQualityScoreRecalibration/B100327.recal.bam -I:eval /home/kong/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/Lock/GATK_BaseQualityScoreRecalibration/T3398.recal.bam -o /home/kong/Haiying/Projects/PrimaryMelanoma/AllBatches/QC/ContEst/BySamples/T3398_B100327.vcf -R /home/kong/Reference/GATK/Genome/human_g1k_v37.fasta --popfile /home/kong/Reference/GATK/HapMap/hapmap_3.3.b37.vcf -L /home/kong/Reference/Agilent/SureSelectHumanAllExon_V5UTRs/S04380219_Covered_nochr.bed -isr INTERSECTION
    INFO 13:27:03,766 HelpFormatter - Executing as kong@cloud-hpc-06-kong on Linux 3.11.10-29-default amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b26.
    INFO 13:27:03,767 HelpFormatter - Date/Time: 2017/01/25 13:27:03
    INFO 13:27:03,767 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:27:03,767 HelpFormatter - --------------------------------------------------------------------------------
    INFO 13:27:03,776 GenomeAnalysisEngine - Strictness is SILENT
    INFO 13:27:03,897 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 13:27:03,907 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 13:27:03,960 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.05
    INFO 13:27:10,265 IntervalUtils - Processing 74569526 bp from intervals
    INFO 13:27:10,391 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files
    INFO 13:27:10,706 GenomeAnalysisEngine - Done preparing for traversal
    INFO 13:27:10,707 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 13:27:10,708 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 13:27:10,708 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 13:27:10,709 ContEst - Running in sequencing mode

    ERROR --
    ERROR stack trace

    java.lang.RuntimeException: No population frequency annotation for CEU in [VC popfile @ 1:948921 Q. of type=SNP alleles=[T*, C] attr={AC=1633, AF=0.787, AN=2074} GT=[]
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.calcStats(ContEst.java:626)
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:401)
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:127)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: No population frequency annotation for CEU in [VC popfile @ 1:948921 Q. of type=SNP alleles=[T*, C] attr={AC=1633, AF=0.787, AN=2074} GT=[]
    ERROR ------------------------------------------------------------------------------------------

    kong@cloud-hpc-06-kong:~/Haiying/Projects/PrimaryMelanoma/ILSE1669_A22/temp> hapmap_3.3.b37.vcf
    If 'hapmap_3.3.b37.vcf' is not a typo you can use command-not-found to lookup the package that contains it, like this:
    cnf hapmap_3.3.b37.vcf

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Haiying7
    Hi,

    The hg19_population_stratified_af_hapmap_3.3.vcf file you are using has the correct allele frequencies for ContEst, but it looks like it is formatted incorrectly. Where did you get the file from? Can you try to fix the formatting? Or, let the people you got the file from that the format is incorrect?

    Thanks,
    Sheila

  • escaonescaon Limoges, FranceMember

    Same kind of error here using :

    java -jar $GATK -T ContEst -R GATK_bundle_h38/Homo_sapiens_assembly38.fasta -I:eval tumor.bam -I:genotype blood.bam --popfile GATK_bundle_h38/hapmap_3.3.hg38.vcf -isr INTERSECTION -o output.txt

    The files Homo_sapiens_assembly38.fasta, hapmap_3.3.hg38.vcf (& respective hapmap_3.3.hg38.vcf.tbi) comes straight from ftp://ftp.broadinstitute.org/bundle/hg38/

    ...
    INFO 11:34:29,889 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18
    ...
    INFO 11:34:29,899 GenomeAnalysisEngine - Strictness is SILENT
    INFO 11:34:30,735 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 11:34:30,739 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
    INFO 11:34:30,857 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.12
    INFO 11:34:32,117 GenomeAnalysisEngine - Preparing for traversal over 2 BAM files
    INFO 11:34:32,808 GenomeAnalysisEngine - Done preparing for traversal
    INFO 11:34:32,809 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 11:34:32,809 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 11:34:32,809 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 11:34:32,809 ContEst - Running in sequencing mode

    ERROR --
    ERROR stack trace

    java.lang.RuntimeException: No population frequency annotation for CEU in [VC popfile @ chr1:11241657 Q. of type=SNP alleles=[A*, G] attr={AC=1628, AF=0.583, AN=2794} GT=[]
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.calcStats(ContEst.java:626)
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:401)
    at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:127)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.7-0-gcfedb67):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: No population frequency annotation for CEU in [VC popfile @ chr1:11241657 Q. of type=SNP alleles=[A*, G] attr={AC=1628, AF=0.583, AN=2794} GT=[]
    ERROR ------------------------------------------------------------------------------------------
  • Haiying7Haiying7 Heidelberg, GermanyMember

    @Sheila said:
    @Haiying7
    Hi,

    The hg19_population_stratified_af_hapmap_3.3.vcf file you are using has the correct allele frequencies for ContEst, but it looks like it is formatted incorrectly. Where did you get the file from? Can you try to fix the formatting? Or, let the people you got the file from that the format is incorrect?

    Thanks,
    Sheila

    It is downloaded from this website:
    http://archive.broadinstitute.org/cancer/cga/contest_download

    Issue · Github
    by Sheila

    Issue Number
    1704
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @escaon
    Hi,

    Are you using the file from CGA as well?

    Thanks,
    Sheila

  • escaonescaon Limoges, FranceMember

    I was using hapmap_3.3.hg38.vcf from ftp://ftp.broadinstitute.org/bundle/hg38/ as the --popFile argument value.
    Problem is, INFO fields in hapmap_3.3.hg38.vcf doesn't contain enough stuff (like CEU fields, ect...)
    Only 3 INFO fields in hapmap_3.3.hg38.vcf (thus the ERROR MESSAGE: No population frequency annotation for CEU with ContEst) :

    INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">

    INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">

    INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

    CGA files are for hg18/hg19. So i did not use them previously (since we are working with hg38).
    I tried a liftover on the hg19_population_stratified_af_hapmap_3.3.vcf from CGA after seeing your reply, but i still have some errors with $PICARD LiftoverVcf (i did manage this correction : http://gatkforums.broadinstitute.org/gatk/discussion/7904/empty-output-file-and-providing-malformed-vcf-file-error-when-using-gatk-contest, but not this one : http://gatkforums.broadinstitute.org/gatk/discussion/3962/genotypeandvalidate-error-key-callstatus-found-in-variantcontext-field-info).

    Best regards

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited February 2017

    Hi @Haiying7 and @escaon,

    It appears that ContEst expects a VCF file with population stratification in the INFO field, e.g.:

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
    1       566875  rs2185539       C       T       .       PASS    AC=66;AF=0.02369;ALL={C*=0.97629,T=0.02371};AN=2786;ASW={C*=1.00000,T=0.00000};CEU={C*=1.00000,T=0.00000};CHB={C*=1.00000,T=0.00000};CHD={C*=1.00000,T=0.00000};CHS={C*=0.00000,T=0.00000};CLM={C*=0.00000,T=0.00000};FIN={C*=0.00000,T=0.00000};GBR={C*=0.00000,T=0.00000};GIH={C*=1.00000,T=0.00000};IBS={C*=0.00000,T=0.00000};JPT={C*=1.00000,T=0.00000};LWK={C*=1.00000,T=0.00000};MKK={C*=0.82044,T=0.17956};MXL={C*=1.00000,T=0.00000};PUR={C*=0.00000,T=0.00000};TSI={C*=1.00000,T=0.00000};YRI={C*=0.99752,T=0.00248};set=MKK-YRI
    

    The CGA file that is labeled hg19 is actually b37--the chromosomes are [1, 2, ...] without the chr prefix. For use of this file in GRCh38, is it possible the lift-over errored because of the confusion between hg19 and b37?

    Another problem with the file appears to be spaces in the INFO field, which GATK dislikes. For example ASW={C*=1.00000, T=0.00000};CEU={C*=1.00000, T=0.00000}; is not acceptable and the two spaces need to be removed to ASW={C*=1.00000,T=0.00000};CEU={C*=1.00000,T=0.00000};. Note also that the file appears to use spaces as column separators and not tabs.

    I've taken steps to remove the white spaces in the INFO fields and ContEst appears to process the resulting file. I hope this is helpful.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited February 2017

    Hi @escaon,

    I also tested the lift over process on the reformatted CGA hapmap VCF. Broadly, I had to lift over hg37-->hg19, then hg19-->hg38. The former I did with awk and sed according to your biostars post (thanks for sharing that), and the latter I did with Picard's LiftOverVcf. LiftOverVcf does not require an index but does require a lot of memory and I got it to work on my third try with -Xmx16G. Here's the command I used:

    java -Xmx16G -jar $PICARD LiftoverVcf \
        I=hapmap_3.3_hg19_pop_stratified_af.vcf \
        O=hapmap_3.3_grch38_pop_stratified_af.vcf \
        CHAIN=hg19ToHg38.over.chain \
        REJECT=rejected_vars_liftoverhg19tohg38_hapmap_3.3.vcf \
        R=/hg38/Homo_sapiens_assembly38.fasta \
        ALLOW_MISSING_FIELDS_IN_HEADER=true
    

    Doing a sanity check:

    WMCF9-CB5:resources shlee$ cat rejected_vars_liftoverhg19tohg38_hapmap_3.3.vcf | grep -v '#' | wc -l
        4422
    WMCF9-CB5:resources shlee$ cat hapmap_3.3_grch38_pop_stratified_af.vcf | grep -v '#' | wc -l
     1451867
    

    This looks reasonable. The numbers add up to the original number of variant records:

    WMCF9-CB5:resources shlee$ cat hapmap_3.3_b37_pop_stratified_af.vcf | grep -v '#' | wc -l
     1456289
    WMCF9-CB5:resources shlee$ cat hapmap_3.3_hg19_pop_stratified_af.vcf | grep -v '#' | wc -l
     1456289
    

    This should allow for plenty of sites for ContEst.

  • escaonescaon Limoges, FranceMember

    @shlee

    It appears that ContEst expects a VCF file with population stratification in the INFO field

    Yes that's the core of the problem.

    I did something very similar to your steps :

    -> Goal = create adequate input --popfile for ContEst
    -> We will start with this file (since it contains CEU infos fields) : http://www.broadinstitute.org/~kcibul/contest/hg19_population_stratified_af_hapmap_3.3.vcf.gz

    -> Retrive file & gunzip
    wget http://www.broadinstitute.org/~kcibul/contest/hg19_population_stratified_af_hapmap_3.3.vcf.gz; gunzip hg19_population_stratified_af_hapmap_3.3.vcf.gz;

    -> Remove useless ' ___GT'
    sed -i 's/\sGT//g' hg19_population_stratified_af_hapmap_3.3.vcf;

    -> Remove spaces in lines not starting with '#'
    sed -i '/^#/!s/ //g' hg19_population_stratified_af_hapmap_3.3.vcf;

    -> Convert chr names from 1 to chr1
    sed -i '/^#/!s/^/chr/g' hg19_population_stratified_af_hapmap_3.3.vcf;

    -> Add missing INFO fields
    sed -i '2 a ##INFO=<ID=set,Number=1,Type=String,Description="set of population population">\n##INFO=<ID=AC,Number=1,Type=Integer,Description="Allele count in genotypes for ALT allele">\n##INFO=<ID=AN,Number=1,Type=Integer,Description="total number of alleles in called genotypes">\n##INFO=<ID=AF,Number=A,Type=Float,Description="Allele frequency">\n##INFO=<ID=ALL,Number=A,Type=String,Description="Allele frequency in all populations">\n##INFO=<ID=ASW,Number=A,Type=String,Description="Allele frequency in Americans of African Ancestry in SW USA population">\n##INFO=<ID=CEU,Number=A,Type=String,Description="Allele frequency in Utah Residents (CEPH) with Northern and Western Ancestry population">\n##INFO=<ID=CHB,Number=A,Type=String,Description="Allele frequency in Han Chinese in Bejing, China population">\n##INFO=<ID=CHD,Number=A,Type=String,Description="Allele frequency in Chinese in Metropolitan Denver, Colorado population">\n##INFO=<ID=CHS,Number=A,Type=String,Description="Allele frequency in Southern Han Chinese population">\n##INFO=<ID=CLM,Number=A,Type=String,Description="Allele frequency in Colombians from Medellin, Colombia population">\n##INFO=<ID=FIN,Number=A,Type=String,Description="Allele frequency in Finnish in Finland population">\n##INFO=<ID=GBR,Number=A,Type=String,Description="Allele frequency in British in England and Scotland population">\n##INFO=<ID=GIH,Number=A,Type=String,Description="Allele frequency in Gujarati Indian from Houston, Texas population">\n##INFO=<ID=IBS,Number=A,Type=String,Description="Allele frequency in Iberian Population in Spain population">\n##INFO=<ID=JPT,Number=A,Type=String,Description="Allele frequency in Japanese in Tokyo, Japan population">\n##INFO=<ID=LWK,Number=A,Type=String,Description="Allele frequency in Luhya in Webuye, Kenya population">\n##INFO=<ID=MKK,Number=A,Type=String,Description="Allele frequency in Maasai in Kinyawa, Kenya population">\n##INFO=<ID=MXL,Number=A,Type=String,Description="Allele frequency in Mexican Ancestry from Los Angeles USA population">\n##INFO=<ID=PUR,Number=A,Type=String,Description="Allele frequency in Puerto Ricans from Puerto Rico population">\n##INFO=<ID=TSI,Number=A,Type=String,Description="Allele frequency in Toscani in Italia population">\n##INFO=<ID=YRI,Number=A,Type=String,Description="Allele frequency in Yoruba in Ibadan, Nigeria population">' hg19_population_stratified_af_hapmap_3.3.vcf;

    -> Convert hg19 to hg38
    java -Xmx16G -jar $PICARD LiftoverVcf I=hg19_population_stratified_af_hapmap_3.3.vcf O=hg19_lifted_to_hg38_population_stratified_af_hapmap_3.3.vcf CHAIN=/hg19ToHg38.over.chain REJECT=hg19_lifted_to_hg38_rejected_population_stratified_af_hapmap_3.3.vcf R=Homo_sapiens_assembly38.fasta WARN_ON_MISSING_CONTIG=true

    -> Try ContEst
    java -jar $GATK -T ContEst -R Homo_sapiens_assembly38.fasta -I:eval tumor.bam -I:genotype normal.bam --popfile hg19_lifted_to_hg38_population_stratified_af_hapmap_3.3.vcf -L targets.interval_list -isr INTERSECTION -o test.txt

    -> Output (basically an empty file, not quite sure it worked then ...)
    Warning: We're throwing out lane META since it has fewer than 500 read bases at genotyped positions
    name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites

    Ps : How do you "highlight" the code in your posts ?

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited February 2017

    Hi @escaon,

    To Markdown format code blocks, simply use three backticks ``` in lines above and below your code blocks. Within a line of text, single backticks surround the code. I'm actually writing a blogpost that shows how to do this.

    Sounds like we need to add an error message for when ContEst results produce an empty file. Can you double check if your hg19_lifted_to_hg38_population_stratified_af_hapmap_3.3.vcf file (from Convert hg19 to hg38 step) actually contains variants, e.g. by counting variants?

    I'm not sure if the file I start with is the same file as yours. The one I start with is the file labeled hg19 from the CGA website (you have to log in) shown:
    image

    However, the file is mislabeled and is actually GRCh37.

    So the steps I took to clean up this file is basically yours with the addition of a conversion between GRCh37-->hg19 so that I can use the hg19ToHg38.over.chain file. I also avoid sed, in case something should go silently wrong.

    Thanks for the INFO blocks. I'm going to incorporate them too. This obviates my previous need to use ALLOW_MISSING_FIELDS_IN_HEADER=true option.

    A few days ago, I was able to use the grch37 hapmap file to produce ContEst results. I've just generated some alignments to also test the grch38 hapmap file. Using -L chr20, I see

    name    population      population_fit  contamination   confidence_interval_95_width    confidence_interval_95_low      confidence_interval_95_high     sites
    META    CEU     n/a     1.6     0.3     1.4     1.7     56
    

    I'm doing on-the-fly genotyping of the normal. Here's the command I use:

    java -jar $GATK -T ContEst \
        -R GRCh38_1kg/GRCh38_full_analysis_set_plus_decoy_hla.fa \
        -I:eval hcc1143_T_md.bam \
        -I:genotype hcc1143_N_md.bam \
        --popfile hapmap_3.3_grch38_pop_stratified_af.vcf.gz \
        --interval_set_rule INTERSECTION \
        -o hcc1143_T_contest.txt \
        -L grch38_primary.intervals \
        -L chr20
    

    If you don't mind reviewing it, I can give it to you the file I generated (hapmap_3.3_grch38_pop_stratified_af.vcf.gz). I can also post my cleanup process; however, I think you are more adept at such things than I.

  • fortunofortuno ChicagoMember

    Hi @shlee,

    I've read this thread and checked some of the solutions in here. I don't think my issue is related with the liftover procedure given that I completed that step correctly. Anyway I checked I don't have any whitespace in my VCF or 'chr' format is correct regarding hg38 reference.

    However, I'm still having the same error I reported in the other thread. Maybe I could try with your own liftover file to discard that could be the problem. As I mentioned, it is a little weird because I could complete some runs with specific bam files but not with others.

    Let me know if you could need any additional information. Thank you very much for your kind assistance.
    Francisco

  • shleeshlee CambridgeMember, Broadie, Moderator

    @fortuno, are you genotyping the normal on-the-fly? That is, is the normal in BAM or VCF format? @escaon also has an odd error:

    Warning: We're throwing out lane META since it has fewer than 500 read bases at genotyped positions
    

    I'd have to ask what this particular warning refers to as the wording is ambiguous. What I do have in my notes is the following for the Normal sample:

    ContEst will call any site with >80% bases showing an alternate allele with at least 50x coverage, it will call these homozygous variant.

    So if coverage is low across the normal BAM, where not enough sites offer 50x coverage, then ContEst may end up with insufficient number of sites to examine in the tumor BAM. To get around this, you can provide ContEst with a genotype VCF for the normal. The argument is --genotypes normalGenotypes.vcf .

  • shleeshlee CambridgeMember, Broadie, Moderator

    @fortuno and @escaon,

    I've uploaded cleaned up resource files to ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/beta. Let me know if these do or do not work for you.

  • escaonescaon Limoges, FranceMember

    Thank you @shlee for the Markdown format tips and for the beta bundle !!

    Yes we started from the same CGA file.

    Doing on-the-fly genotyping with your file (hapmap_3.3_grch38_pop_stratified_af.vcf), i am still getting an empty output & the lane META warning. I'll investigate if this is related to low coverage in .bam.

    Warning: We're throwing out lane META since it has fewer than 500 read bases at genotyped positions
    name    population  population_fit  contamination   confidence_interval_95_width    confidence_interval_95_low  confidence_interval_95_high sites
    

    Command line (GATK version 3.7-0, openjdk version 1.8.0_121) :

    java -XX:ParallelGCThreads=8 -jar $GATK -T ContEst \
         -R GATK_bundle_hg38/Homo_sapiens_assembly38.fasta \
         -I:eval BISCEm/Pipeline_V1_outputs/R14_12_tt.4.bam \
         -I:genotype BISCEm/Pipeline_V1_outputs/R14_10_sang.4.bam \
         --popfile Hg38_files/hapmap_3.3_grch38_pop_stratified_af.vcf
         -L ContEst/targets.interval_list \
         -isr INTERSECTION -o out.txt
    
  • fortunofortuno ChicagoMember
    edited February 2017

    Thank you very much for the information @shlee!

    Yes, I am genotyping the normal on-the-fly, I have both normal and tumor bams. I'm still getting the same error with your liftover hapmap_3.3_grch38_pop_stratified_af.vcf file.

    I'll continue debugging this error to see what is going on. Any other ideas are also welcomed!

    Thanks,
    Francisco

  • shleeshlee CambridgeMember, Broadie, Moderator

    @escaon and @fortuno,

    If the issue ends up being low coverage, and you feel that the tool should handle lower coverage regions, then let us know the coverage depth you deem appropriate for a cutoff. Remember that this tool was originally written in 2011 for exomes. I think with today's PCR-free preps, better exome capture technology, and better tools, we can get more out of lower coverage data than before. ContEst is slated for some attention for GATK v4 and we definitely appreciate feedback. I'll be sure to convey your thoughts to our developers. Thanks.

  • fortunofortuno ChicagoMember

    Hi @shlee,

    Is that low coverage limit referred to the whole normal BAM file or is it calculated for specific regions?

    I'm asking because after a few tests I saw that my error is produced in specific regions. I mean, for the same normal/tumor BAM files it works depending on the interval I use with -L option. For instance:

    Using -L chr1 -L chr20 -isr UNION worked correctly, whereas -L chr2 failed. Is this helpful to know what could be happening?

    Thanks,
    Francisco

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @fortuno
    Hi Francisco,

    The coverage requirement refers to the minimum overall number of bases that are at the sites in the interval. It's basically the number of observations required for the statistic to be valid.

    Can you tell us how much coverage is at the sites in chr1 compared to chr2?

    Thanks,
    Sheila

  • escaonescaon Limoges, FranceMember

    The issue was indeed low coverage in my "markedDuplicates" BAMs.

    ContEst did work for me when omitting the MarkDuplicates step.
    I have to point out that the data i am handling ended up containing A LOT of reads marked as duplicates.

    I did run ContEst for the same T/N BAMs pair, with or without marked duplicates :

    ~~~ Without markedDups
    java -jar $GATK -T ContEst \ -R ucsc.hg19.fasta \ -I:eval p3.2.bam \ -I:genotype p3_nor.2.bam \ --popfile hapmap_3.3_hg19_pop_stratified_af.vcf \ -L targets.interval_list \ --interval_set_rule INTERSECTION \ --base_report br_noDups.txt \ -o noDups.txt

    ~~ Output
    name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites META CEU n/a 0.4 0.4 0.2 0.6 3

    ~~~ With markedDups
    java -jar $GATK -T ContEst \ -R ucsc.hg19.fasta \ -I:eval p3.3.bam \ -I:genotype p3_nor.3.bam \ --popfile hapmap_3.3_hg19_pop_stratified_af.vcf \ -L targets.interval_list \ --interval_set_rule INTERSECTION \ --min_mapq 15 \ --min_qscore 15 \ --minimum_base_count 100 \ --min_genotype_depth 35 \ --base_report br_dups.txt \ -o dups.txt

    ~~ Output
    Warning: We're throwing out lane META since it has fewer than 100 read bases at genotyped positions name population population_fit contamination confidence_interval_95_width confidence_interval_95_low confidence_interval_95_high sites

    As you can notice i tried some arguments (min_mapq, min_qscore, minimum_base_count, min_genotype_depth), trying to enable as many reads as possible in the markedDups BAMs, but the warning remained.

    Ps : I am giving the contamination percentage (from ContEst) (0.4 here) to Mutect2 "--contamination_fraction_to_filter" argument, that is correct, right ?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @escaon,

    Depending on whether your coverage profile for targets generally have flattops __| |__ (amplicon) or mountain peaks __/ \__ (target-capture), the extent of duplicate marking can vary drastically. This is because MarkDuplicates defines duplicates as sets of read pairs that have the same unclipped alignment start and unclipped alignment end.

    In section 3 of this tutorial, I give two instances in which you may not care about duplicate marking—amplicon sequencing and RNA-Seq. In these instances, you can use -drf DuplicateRead in downstream commands to enable use of duplicate reads. The drf stands for disable read filter. This is a GATK engine parameter so you can apply it to any GATK tool that processes read data. Our best practices apply generally to common data types, and so if your data falls outside this, then you should tweak your workflow appropriately.

    If your data is amplicon-based, and you will be using MuTect2, then this discussion should also interest you.


    You need to convert the percentage given by ContEst into a fraction for MuTect2. So if you have 0.4 % contamination, then for MuTect this becomes —contamination_fraction_to_filter 0.004.

    I’ve been writing a MuTect2 tutorial for a workshop and I have two additional points I’d like to share. First, remember that MuTect1 and MuTect2 handle the contamination fraction differently. MuTect1 incorporates the contamination estimate as a hard cutoff for calling in all samples. For the same parameter, v3.7 MuTect2 uses the fraction to downsample reads for each alternate allele. For example, if a site contains X coverage depth, then MuTect2 will remove X times the contamination fraction of reads for each alternate allele.

    Second, consider MuTect2’s -minPruning option. The default is set to two. What this means is that at least two reads must support a path in the assembly graph for consideration, whether in the normal or tumor. Thus, the depth of coverage at a site, which can vary greatly or not much at all across a target locus depending on data type, has implications for calling contaminant alleles as well as low fraction alleles.

    I hope this is helpful!

  • escaonescaon Limoges, FranceMember

    @shlee, again, thanks for the detailed answer.

    My dataset is indeed a gene panel (amplicon) sequenced on Ion Proton.

    With that in mind, I ran the variant calling pipeline with both marked and non-marked duplicates (I was not aware of the -drf DuplicateRead argument at the time (thx for this), thus I processed the BAMs in "two separate ways"). Non-marked duplicates yield more results ofc.

    Regarding the Mutect2 discussion for amplicon-based sequencing that you linked, I notice that @cooperjam is skipping both the MarkDuplicates and the BQSR steps. Why should BQSR be skipped ? I did read this in the doc :

    Small targeted experiments
    The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

    Are those "small datasets" still defined by ? :

    Small datasets have fewer than 100M total bases

    —contamination_fraction_to_filter :

    You need to convert the percentage given by ContEst into a fraction for MuTect2

    Good catch, I was erroneously giving the percentage with :
    -contamination $(awk 'NR==2 {print $4}' ContEst.txt")

    I’ve been writing a MuTect2 tutorial for a workshop

    I am very much looking forward to reading it !
    I have been reading the latest GATK Best Practices for variant discovery - Mutect2 beta I could found through the blog. However there is so much doc out there, between actual doc, forum posts, tutorial & presentation slides. I always feel silly for asking questions because most of the time they have already been addressed somewhere, but sometimes it feels like you need to be a search engine / GATK forum keyword guru or an already well experienced bioinformatician to find the information you are looking for in a reasonable amount of time.

    Regarding the -minPruning option, since i am expecting a high depth with targeted sequencing, this option should be set > 2, right ?

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @escaon,

    If you are interested in our latest tutorials, please invite us to give a workshop! Also, you can see the first iteration of the MuTect2 tutorial in one of the links in this post.

    I'm not familiar with Ion technology--I don't believe we test our tools with data from Ion sequencing. There are some discussions of using Ion data and BQSR within the forum and the conclusion appears to be that it is up to you whether you use it or not. Best to test and judge for yourself if it brings improvements to calling on your particular experimental setup and data. In @cooperjam's case, I expect the large PoN helps to filter out many sequencing and mapping artifacts and they must be comfortable with the quality of their sequencing data enough to omit BQSR.

    I believe the small dataset being defined by <100M total bases as discussed here and elsewhere is still the rough definition we use for effective BQSR.

    Yes, we have a lot of documents! I also find it challenging finding relevant docs and so I empathize. As a result of these difficulties, a couple of months ago we switched how the upper right search bar searches the forum. It's now Google-powered. So hopefully this improves search results. But if you're still finding it difficult getting to the documentation you need, let me know. I will keep this in mind towards future plans to improve the forum.

    Yes, best to keep the -minPruning option set to at least two or greater. Otherwise, you will get too many spurious calls.

  • fortunofortuno ChicagoMember
    edited October 2017

    Hi @shlee,

    After a while I finally had time to get back to this issue with ContEst. I am still stuck with the same error.

    Debugging a little more in deep I realized that I can identify some contigs which are causing the exception. Just to refresh, this is what i am getting:

    java -jar GenomeAnalysisTK.jar -T ContEst --lane_level_contamination SAMPLE -I:genotype C317.TCGA-AB-2941-11A-01W-0745-08.8_gdc_realn.bam -I:eval C317.TCGA-AB-2941-03A-01W-0745-08.14_gdc_realn.bam -R GRCh38.d1.vd1.fa -L chr1 --out test.txt --popfile hg38_population_allele_frequency_hapmap_3.3.vcf.gz

    ##### ERROR stack trace 
    java.lang.NullPointerException
        at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:375)
        at org.broadinstitute.gatk.tools.walkers.cancer.contamination.ContEst.map(ContEst.java:127)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
        at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
        at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:98)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:316)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:123)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:256)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:158)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:108)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version exported):
    ##### ERROR
    ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions https://software.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: Code exception (see stack trace for error itself)
    ##### ERROR ------------------------------------------------------------------------------------------
    

    I also tested it with other GATK 3.7 and 3.8 versions with identical results. However, ContEst successfully finished when using a different region, for instance:

    java -jar GenomeAnalysisTK.jar -T ContEst --lane_level_contamination SAMPLE -I:genotype C317.TCGA-AB-2941-11A-01W-0745-08.8_gdc_realn.bam -I:eval C317.TCGA-AB-2941-03A-01W-0745-08.14_gdc_realn.bam -R GRCh38.d1.vd1.fa -L chr1:1-3849190 --out test.txt --popfile hg38_population_allele_frequency_hapmap_3.3.vcf.gz

    Rereading this thread I saw you mentioned the limitation of >80% with at least 50x coverage. I am not sure if I am using the correct tool but running:

    java -jar /usr/local/bin/GenomeAnalysisTK.jar -T DepthOfCoverage -I C317.TCGA-AB-2941-11A-01W-0745-08.8_gdc_realn.bam -R GRCh38.d1.vd1.fa -o /home/ubuntu/data/GATKdepth -L chr1

    I got that the sample_cumulative_coverage_proportions file shows very low values for gte_50 column. But these low values are in both -L chr1 and -L chr1:1-3849190. Is that what you meant? If so, it seems that the problem wouldn't be the coverage.

    If interested I can provide access to some of my test BAMs so you can reproduce the error. I can send you an email with the information.

    Thanks in advance for the help.
    Francisco

    Post edited by fortuno on
  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @fortuno,

    I'm preparing for a conference and so @Sheila will help you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @fortuno
    Hi Francisco,

    There will be no more bug fixing in GATK3, so can you try the GATK4 contamination estimation tool? There are two tools involved in the workflow: GetPileupSummaries and CalculateContamination. You can also have a look at the Mutect2 tutorials in the presentations section for more information.

    -Sheila

  • fortunofortuno ChicagoMember

    Hi @Sheila,

    Thank you very much for your response. I will test and try to reproduce my previous runs in the new GATK4 workflow for contamination calculation.

    Having a first look, i don't see a straight-forward way to run that workflow in on-the-fly genotyping mode as previously. Am i right? Is there a way to do on-the-fly genotyping using the normal bam in the new GATK4 workflow?

    Thanks in advance,
    Francisco

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @fortuno
    Hi Francisco,

    Ah, no I don't think there is a way to do on-the-fly genotyping in the GATK4 workflow. In fact, there are no genotypes needed in this contamination calculation workflow (the genotypes used are in the sites VCF you provide from a known resource file) . You can read about the algorithm here. I think the tool doc will help too.

    -Sheila

Sign In or Register to comment.