The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block.
Powered by Vanilla. Made with Bootstrap.
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Using vcf files supplied in the GATK Resource Bundle with Picard GenotypeConcordance

cottrellcottrell University of DelawareMember Posts: 4

I want to compare my vcf file to a vcf file supplied in the GATK bundle using Picard GenotypeConcordance. In the terminology used by Picard GenotypeConcordance I want to use a vcf file in the bundle as the "truth sample."

The problem is the vcf files in the bundle lack the sample name needed by Picard GenotypeConcordance.

That is, there is no value in these supplied vcf files to satisfy the Picard GenotypeConcordance required option:
TRUTH_SAMPLE (String) The name of the truth sample within the truth VCF Required.

Take dbsnp_138.hg19.vcf.gz as an example:

$ zcat dbsnp_138.hg19.vcf.gz | grep CHROM
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO

Based on the description of the vcf file format described elsewhere on this GATK site https://broadinstitute.org/gatk/guide/article?id=1268 I expect to see a FORMAT field and a sample name field following the INFO field.

How should I proceed?

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,390 admin

    You can only evaluate genotype concordance between files that have sample genotypes, but resources like dbSNP don't contain sample genotypes. Are you sure it's genotype concordance you want to analyze?

    Geraldine Van der Auwera, PhD

  • cottrellcottrell University of DelawareMember Posts: 4

    Okay, which files in the GATK Resource Bundle do have sample genotypes?

    The vcf files in the bundle include:

    $ls GATK_bundle_2.8_hg19$/*.vcf
    1000G_omni2.5.hg19.sites.vcf
    1000G_phase1.indels.hg19.sites.vcf
    1000G_phase1.snps.high_confidence.hg19.sites.vcf
    CEUTrio.HiSeq.WGS.b37.bestPractices.hg19.vcf
    dbsnp_138.hg19.excluding_sites_after_129.vcf
    dbsnp_138.hg19.vcf
    hapmap_3.3.hg19.sites.vcf
    Mills_and_1000G_gold_standard.indels.hg19.sites.vcf
    NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.sites.vcf
    NA12878.HiSeq.WGS.bwa.cleaned.raw.subset.hg19.vcf
    NA12878.knowledgebase.snapshot.20131119.hg19.vcf
    
  • cottrellcottrell University of DelawareMember Posts: 4

    I am conducting a study to compare the variant calls that one can obtain from Whole Genome Bisulfite Sequencing (WGBS), Reduced Representation Bisulfite Sequencing (RRBS) and Methylation-sensitive Restriction Enzyme digestion followed by sequencing (MRE-seq).

    Variant calling of the MRE-seq data are being done using GATK best practices. Variant calling of the WGBS and RRBS sequences are being done with Bis-SNP, which is essentially the GATK best practices tweaked to deal with the bisulfite treatment.

    I'm using the human embryonic stem cell (H1) sequence data that was published by Harris et al. (2010). That study performed all three of these sequencing approaches on H1 genomic DNA.

    My goal is to compare the three vcf files generated from these sequence datasets to some kind of a standard set of known variants. I assume that at least one of the vcf files provided by the GATK resource bundle could serve as such a standard.

    My approach to comparing the vcf files includes vcf-compare (vcf-tools) and Picard GenotypeConcordance. The statistics that I want to compare are sensitivity, specificity and the number of variants in common among approaches.

    If you could advise me on which vcf file in the bundle would be the appropriate standard for such a comparison, that would be really great.

  • cottrellcottrell University of DelawareMember Posts: 4

    All of the analyses were done using genomic DNA from the same passage of human embryonic stem cells (H1), so it's going to be a clean comparison of the different sequencing approaches.

    Okay, I'll use dbsnp. Documentation elsewhere on this site suggests that the VariantEval would be the tool to use, right?

    And I found a very useful page describing the different vcf files in the bundle here. It indicates that when using VariantEval with dbSNP a version of dbSNP subsetted to only sites discovered in or before dbSNP BuildID 129 should be used. I see that one in the bundle and I'll use it.

    Well, I've got my work cut out for me. Thanks for pointing me in the right direction.

Sign In or Register to comment.