Empty output file and providing malformed VCF file error when using GATK ContEst

Hi all,
Recently I have used ContEst for estimating cross-sample contamination.Firstly I downloaded all example data from CGA website http://www.broadinstitute.org/cancer/cga/contest_download and used contest-1.0.24530-bin for test and it worked great!
java -jar contest-1.0.24530-bin/ContEst.jar
-I ContEst_example_data/chr20_sites.bam
-R human_g1k_v37.fasta
-B:pop,vcf hg19_population_stratified_af_hapmap_3.3.vcf.gz
-T Contamination
-B:genotypes,vcf ContEst_example_data/hg00142.vcf
-BTI genotypes
-o contamination_results_chr20_1.txt

However,when I delivered the example data to GATK 3.6 or 3.5,it failed with the following error:
../jdk1.8.0_91/bin/java -jar ../GenomeAnalysisTK-3.6.jar
-T ContEst
-R human_g1k_v37.fasta
-I ContEst_example_data/chr20_sites.bam
--genotypes ContEst_example_data/hg00142.vcf
--popfile ../hg19_population_stratified_af_hapmap_3.3.vcf.gz
-isr INTERSECTION
-o contamination_results_chr20_2.txt

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 4: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AC=1239;AF=0.44377;ALL={G=0.55627,T=0.44373};AN=2792;ASW={G=0.50575, T=0.49425};CEU={G=0.69091, T=0.30909};CHB={G=0.57721, T=0.42279};CHD={G=0.66514, T=0.33486};CHS={G=0.00000, T=0.00000};CLM={G=0.00000, T=0.00000};FIN={G=0.00000, T=0.00000};GBR={G=0.00000, T=0.00000};GIH={G=0.61386, T=0.38614};IBS={G=0.00000, T=0.00000};JPT={G=0.57080, T=0.42920};LWK={G=0.45413, T=0.54587};MKK={G=0.47826, T=0.52174};MXL={G=0.53488, T=0.46512};PUR={G=0.00000, T=0.00000};TSI={G=0.63725, T=0.36275};YRI={G=0.45320, T=0.54680};set=Intersection GT",forinput source: /pub6/Temp/liaojianlong/contamination_test1/../hg19_population_stratified_af_hapmap_3.3.vcf.gz

According to the error message,I eliminated the whitespace in the INFO field using R and tested but got error again:

ERROR MESSAGE: The provided VCF file is malformed at approximately line number 4: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "AC=1239;AF=0.44377;ALL={G=0.55627,T=0.44373};AN=2792;ASW={G=0.50575,T=0.49425};CEU={G=0.69091,T=0.30909};CHB={G=0.57721,T=0.42279};CHD={G=0.66514,T=0.33486};CHS={G=0.00000,T=0.00000};CLM={G=0.00000,T=0.00000};FIN={G=0.00000,T=0.00000};GBR={G=0.00000,T=0.00000};GIH={G=0.61386,T=0.38614};IBS={G=0.00000,T=0.00000};JPT={G=0.57080,T=0.42920};LWK={G=0.45413,T=0.54587};MKK={G=0.47826,T=0.52174};MXL={G=0.53488,T=0.46512};PUR={G=0.00000,T=0.00", for input source: /pub6/Temp/liaojianlong/contamination_test1/../population_files/hg19_population_stratified_af_hapmap_3.3.vcf.gz

On the other hand,I tested GATK ContEst in another mode but got empty file in addition to header.
../jdk1.8.0_91/bin/java -jar
../GenomeAnalysisTK-3.6.jar
-T ContEst
-R ../reference_genome/hg19_complete.fasta
-I:eval G01H_chr22.recal.bam
-I:genotype G01N_chr22.recal.bam
--popfile hg19_population_stratified_af_hapmap_3.3.vcf.gz
-isr INTERSECTION
-o contamination_output.txt
The process worked successfully with following information:
image
And the contamination_output.txt was empty:
image

Thank you very much for any recommendation!

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @xiaolongge
    Hi,

    Where did you get the Hapmap VCF you are using? Did you try with the file from our bundle? Did you manipulate the file in any way after you received it?

    -Sheila

  • xiaolonggexiaolongge chinaMember

    Thank you for your reply. I downloaded the Hapmap VCF from http://www.broadinstitute.org/cancer/cga/contest_download and it actually worked when using old version ContEst also downloaded from CGA website.I think it's the difference of handling the popfile between old version ContEst and the new version integrated to GATK resulted in the malformed VCF file error because when I removed the last column of hg19_population_stratified_af_hapmap_3.3.vcf,it worked with the new version ContEst!
    image
    Did you mean the 1000G_phase3_v4_20130502.sites.vcf.gz file in your bundle?I think it is a common VCF file and it doesn't contain population frequency information.
    image

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @xiaolongge
    Hi,

    Ah okay. So, you got it to work with the HapMap VCF, but you had to manipulate it? A user at the end of this thread also got ContEst to work with the HapMap VCF, but it does not look like he had to manipulate the file.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @xiaolongge
    Hi,

    Ah, okay. Thanks for reporting your solution! :smile:

    -Sheila

  • ym_wangym_wang china liaoningMember

    @xiaolongge
    I have met the same questions when I used ContEst just the same way to you, but my hapmap vcf don't had 'GT' column. I was wonder if you can share your hg19_population_stratified_af_hapmap_3.3.vcf.gz with me, I will be appreciate to your help.

Sign In or Register to comment.