The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

# Using vcf files supplied in the GATK Resource Bundle with Picard GenotypeConcordance

University of DelawareMember Posts: 4

I want to compare my vcf file to a vcf file supplied in the GATK bundle using Picard GenotypeConcordance. In the terminology used by Picard GenotypeConcordance I want to use a vcf file in the bundle as the "truth sample."

The problem is the vcf files in the bundle lack the sample name needed by Picard GenotypeConcordance.

That is, there is no value in these supplied vcf files to satisfy the Picard GenotypeConcordance required option:
TRUTH_SAMPLE (String) The name of the truth sample within the truth VCF Required.

Take dbsnp_138.hg19.vcf.gz as an example:

$zcat dbsnp_138.hg19.vcf.gz | grep CHROM #CHROM POS ID REF ALT QUAL FILTER INFO  Based on the description of the vcf file format described elsewhere on this GATK site https://broadinstitute.org/gatk/guide/article?id=1268 I expect to see a FORMAT field and a sample name field following the INFO field. How should I proceed? Tagged: ## Best Answers ## Answers • Administrator, Dev Posts: 11,117 admin You can only evaluate genotype concordance between files that have sample genotypes, but resources like dbSNP don't contain sample genotypes. Are you sure it's genotype concordance you want to analyze? Geraldine Van der Auwera, PhD • University of DelawareMember Posts: 4 Okay, which files in the GATK Resource Bundle do have sample genotypes? The vcf files in the bundle include: $ls GATK_bundle_2.8_hg19\$/*.vcf
1000G_omni2.5.hg19.sites.vcf
1000G_phase1.indels.hg19.sites.vcf
1000G_phase1.snps.high_confidence.hg19.sites.vcf
CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf
dbsnp_138.hg19.excluding_sites_after_129.vcf
dbsnp_138.hg19.vcf
hapmap_3.3.hg19.sites.vcf
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf
NA12878.knowledgebase.snapshot.20131119.hg19.vcf

• University of DelawareMember Posts: 4

I am conducting a study to compare the variant calls that one can obtain from Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Methylation-sensitive Restriction Enzyme digestion followed by sequencing (MRE-seq).

Variant calling of the MRE-seq data are being done using GATK best practices. Variant calling of the WGBS and RRBS sequences are being done with Bis-SNP, which is essentially the GATK best practices tweaked to deal with the bisulfite treatment.

I'm using the human embryonic stem cell (H1) sequence data that was published by Harris et al. (2010). That study performed all three of these sequencing approaches on H1 genomic DNA.

My goal is to compare the three vcf files generated from these sequence datasets to some kind of a standard set of known variants. I assume that at least one of the vcf files provided by the GATK resource bundle could serve as such a standard.

My approach to comparing the vcf files includes vcf-compare (vcf-tools) and Picard GenotypeConcordance. The statistics that I want to compare are sensitivity, specificity and the number of variants in common among approaches.

If you could advise me on which vcf file in the bundle would be the appropriate standard for such a comparison, that would be really great.

• University of DelawareMember Posts: 4

All of the analyses were done using genomic DNA from the same passage of human embryonic stem cells (H1), so it's going to be a clean comparison of the different sequencing approaches.

Okay, I'll use dbsnp. Documentation elsewhere on this site suggests that the VariantEval would be the tool to use, right?

And I found a very useful page describing the different vcf files in the bundle here. It indicates that when using VariantEval with dbSNP a version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129 should be used. I see that one in the bundle and I'll use it.

Well, I've got my work cut out for me. Thanks for pointing me in the right direction.