Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotyPE Concordance Error

Hi,
I have generated multi-sample SNPs in VCF format following GATK best practice (gVCF generation + combinegVCF+GenotypegVCF+VQSR for SNPs) and trying to run GATK GenotypeConcordance between Genotype (muti-sample) in VCF and Haplotype Caller (multi-sample) VCF file as below:

java -jar /GATK/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T GenotypeConcordance -R /genomes/hg19/hg19.fa -eval 2500Sample_new_filtered.vcf.gz --comp Genotype_SNPs.vcf.gz --moltenize -o output_GenotypeConcordance_filtered.grp

I have cleaned up GenotypeVCF using vcftools to remove monomorphic alle sites. The error occurs like the following. Genotype header shows teh vcf format as below:

fileformat=VCFv4.2

fileDate=20150515

source=PLINKv1.90

The provided VCF file is malformed at approximately line number 205771: Insertions/Deletions are not supported when reading 3.x VCF's. Please convert your file to VCF4 using VCFTools, available at http://vcftools.sourceforge.net/index.html, for input source: 2500Sample_new_filtered.vcf.gz -comp Genotype_SNPs.vcf.gz

Could you please throw some light on this error? Thanks.

Regards
Lavanya

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Lavanya
    Hi Lavanya,

    It looks like one or more of your vcfs are in an older version that is no longer expected. I suspect it is the -eval 2500Sample_new_filtered.vcf.gz file if you generated the comp file from Haplotype Caller.

    Can you try using vcf convert on the eval file? http://vcftools.sourceforge.net/perl_module.html#vcf-convert

    -Sheila

  • LavanyaLavanya Member

    Thanks Shiela. I have rectified the problem...

    Could you please let me know whether order of ID names should be maintained in both VCFs that are getting compared? I am getting some weird results. eval contains 2531 samples and comp contains 2466 samples (same IDs have been maintained across both eval and comp set) Thanks.

    Regards

  • LavanyaLavanya Member

    but the orders (sample columns) are not maintained in the VCF

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Lavanya
    Hi Lavanya,

    Sorry for the late response. I don't think it matters whether the samples are in the same order in the two files, as long as the sample names match exactly. What do you mean you by "weird results"?

    -Sheila

  • LavanyaLavanya Member

    @Sheila
    Thanks for your reply.
    I extracted one sample (same) from both Genotype calls as well as HaplotypeCaller VCF and initiated GenotypeConcordance. The results makes sense with 99% concordance rate etcc.
    Where as when I started the GenotypeConcordance module with multisample VCF (both Genotypecalls and HaplotypeCaller VCF), the results vary and showing very poor results.
    Hence not able to rely on this module when I started with multi-sample VCFs.. What could be the reasons. Thanks.
    Regards

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Lavanya
    Hi,

    Can you post an example comparing single sample and multisample outputs where you think a mistake is happening? I may need you to submit a bug report.

    Thanks,
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Lavanya, when you mention "both Genotypecalls and HaplotypeCaller VCF", what do you mean by "Genotypecalls"? We have no tool with that name. Please be precise in your descriptions.

  • LavanyaLavanya Member

    @Geraldine_VdAuwera
    As I explained earlier, I have two multi sample VCFs to compare and caluculate Genotype Concordance.
    One of the VCFs in from genotype called or from array data and I will using it as truthset. This was not generated through GATK module.
    I have another VCF generated from multi-sample Haplotype called VCFs

  • LavanyaLavanya Member

    @Sheila and @Geraldine_VdAuwera

    The issue has been resolved. The VCF had 3 duplicate sample names which created this issue.

    I have a question about output results interpretation. The results as below. How some of the results show negative numbers? Thanks.

    :GATKTable:4:46607:%s:%s:%s:%.3f:;

    :GATKTable:GenotypeConcordance_EvalProportions:Per-sample concordance tables: proportions of genotypes called in eval

    Sample Eval_Genotype Comp_Genotype Proportion
    ALL HET HET 0.240
    ALL HET HOM_REF 0.001
    ALL HET HOM_VAR 0.004
    ALL HET MIXED 0.000
    ALL HET NO_CALL 0.000
    ALL HET UNAVAILABLE 0.754
    ALL HOM_REF HET -0.000
    ALL HOM_REF HOM_REF -0.029
    ALL HOM_REF HOM_VAR -0.000
    ALL HOM_REF MIXED -0.000
    ALL HOM_REF NO_CALL -0.000
    ALL HOM_REF UNAVAILABLE 1.029
    ALL HOM_VAR HET 0.001
    ALL HOM_VAR HOM_REF 0.000
    ALL HOM_VAR HOM_VAR 0.288
    ALL HOM_VAR MIXED 0.000
    ALL HOM_VAR NO_CALL 0.000
    ALL HOM_VAR UNAVAILABLE 0.711
    ALL Mismatching_Alleles Mismatching_Alleles -0.000

  • SheilaSheila Broad InstituteMember, Broadie admin

    @Lavanya
    Hi Lavanya,

    This is odd. Can you upload a bug report? Instructions are here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

Sign In or Register to comment.