Using vcf files supplied in the GATK Resource Bundle with Picard GenotypeConcordance

cottrellcottrell University of DelawareMember

I want to compare my vcf file to a vcf file supplied in the GATK bundle using Picard GenotypeConcordance. In the terminology used by Picard GenotypeConcordance I want to use a vcf file in the bundle as the "truth sample."

The problem is the vcf files in the bundle lack the sample name needed by Picard GenotypeConcordance.

That is, there is no value in these supplied vcf files to satisfy the Picard GenotypeConcordance required option:
TRUTH_SAMPLE (String) The name of the truth sample within the truth VCF Required.

Take dbsnp_138.hg19.vcf.gz as an example:

$ zcat dbsnp_138.hg19.vcf.gz | grep CHROM

Based on the description of the vcf file format described elsewhere on this GATK site https://broadinstitute.org/gatk/guide/article?id=1268 I expect to see a FORMAT field and a sample name field following the INFO field.

How should I proceed?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You can only evaluate genotype concordance between files that have sample genotypes, but resources like dbSNP don't contain sample genotypes. Are you sure it's genotype concordance you want to analyze?

  • cottrellcottrell University of DelawareMember

    Okay, which files in the GATK Resource Bundle do have sample genotypes?

    The vcf files in the bundle include:

    $ls GATK_bundle_2.8_hg19$/*.vcf
  • cottrellcottrell University of DelawareMember

    I am conducting a study to compare the variant calls that one can obtain from Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Methylation-sensitive Restriction Enzyme digestion followed by sequencing (MRE-seq).

    Variant calling of the MRE-seq data are being done using GATK best practices. Variant calling of the WGBS and RRBS sequences are being done with Bis-SNP, which is essentially the GATK best practices tweaked to deal with the bisulfite treatment.

    I'm using the human embryonic stem cell (H1) sequence data that was published by Harris et al. (2010). That study performed all three of these sequencing approaches on H1 genomic DNA.

    My goal is to compare the three vcf files generated from these sequence datasets to some kind of a standard set of known variants. I assume that at least one of the vcf files provided by the GATK resource bundle could serve as such a standard.

    My approach to comparing the vcf files includes vcf-compare (vcf-tools) and Picard GenotypeConcordance. The statistics that I want to compare are sensitivity, specificity and the number of variants in common among approaches.

    If you could advise me on which vcf file in the bundle would be the appropriate standard for such a comparison, that would be really great.

  • cottrellcottrell University of DelawareMember

    All of the analyses were done using genomic DNA from the same passage of human embryonic stem cells (H1), so it's going to be a clean comparison of the different sequencing approaches.

    Okay, I'll use dbsnp. Documentation elsewhere on this site suggests that the VariantEval would be the tool to use, right?

    And I found a very useful page describing the different vcf files in the bundle here. It indicates that when using VariantEval with dbSNP a version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129 should be used. I see that one in the bundle and I'll use it.

    Well, I've got my work cut out for me. Thanks for pointing me in the right direction.

