The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# Using vcf files supplied in the GATK Resource Bundle with Picard GenotypeConcordance

University of DelawareMember

I want to compare my vcf file to a vcf file supplied in the GATK bundle using Picard GenotypeConcordance. In the terminology used by Picard GenotypeConcordance I want to use a vcf file in the bundle as the "truth sample."

The problem is the vcf files in the bundle lack the sample name needed by Picard GenotypeConcordance.

That is, there is no value in these supplied vcf files to satisfy the Picard GenotypeConcordance required option:
TRUTH_SAMPLE (String) The name of the truth sample within the truth VCF Required.

Take dbsnp_138.hg19.vcf.gz as an example:

$zcat dbsnp_138.hg19.vcf.gz | grep CHROM #CHROM POS ID REF ALT QUAL FILTER INFO  Based on the description of the vcf file format described elsewhere on this GATK site https://broadinstitute.org/gatk/guide/article?id=1268 I expect to see a FORMAT field and a sample name field following the INFO field. How should I proceed? Tagged: ## Best Answers ## Answers • Cambridge, MAMember, Administrator, Broadie You can only evaluate genotype concordance between files that have sample genotypes, but resources like dbSNP don't contain sample genotypes. Are you sure it's genotype concordance you want to analyze? • University of DelawareMember Okay, which files in the GATK Resource Bundle do have sample genotypes? The vcf files in the bundle include: $ls GATK_bundle_2.8_hg19\$/*.vcf
1000G_omni2.5.hg19.sites.vcf
1000G_phase1.indels.hg19.sites.vcf
1000G_phase1.snps.high_confidence.hg19.sites.vcf
CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf
dbsnp_138.hg19.excluding_sites_after_129.vcf
dbsnp_138.hg19.vcf
hapmap_3.3.hg19.sites.vcf
Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf
NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf
NA12878.knowledgebase.snapshot.20131119.hg19.vcf
`
• University of DelawareMember

I am conducting a study to compare the variant calls that one can obtain from Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Methylation-sensitive Restriction Enzyme digestion followed by sequencing (MRE-seq).

Variant calling of the MRE-seq data are being done using GATK best practices. Variant calling of the WGBS and RRBS sequences are being done with Bis-SNP, which is essentially the GATK best practices tweaked to deal with the bisulfite treatment.

I'm using the human embryonic stem cell (H1) sequence data that was published by Harris et al. (2010). That study performed all three of these sequencing approaches on H1 genomic DNA.

My goal is to compare the three vcf files generated from these sequence datasets to some kind of a standard set of known variants. I assume that at least one of the vcf files provided by the GATK resource bundle could serve as such a standard.

My approach to comparing the vcf files includes vcf-compare (vcf-tools) and Picard GenotypeConcordance. The statistics that I want to compare are sensitivity, specificity and the number of variants in common among approaches.

If you could advise me on which vcf file in the bundle would be the appropriate standard for such a comparison, that would be really great.

• University of DelawareMember

All of the analyses were done using genomic DNA from the same passage of human embryonic stem cells (H1), so it's going to be a clean comparison of the different sequencing approaches.

Okay, I'll use dbsnp. Documentation elsewhere on this site suggests that the VariantEval would be the tool to use, right?

And I found a very useful page describing the different vcf files in the bundle here. It indicates that when using VariantEval with dbSNP a version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129 should be used. I see that one in the bundle and I'll use it.

Well, I've got my work cut out for me. Thanks for pointing me in the right direction.