The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.2 is now available. As of 2.10.0, Picard supports NovaSeq CBCL data. Download and read release notes at https://github.com/broadinstitute/picard/releases.
**GATK4-BETA.2** is here. That's TWO, as in the second beta release. Be sure to read about the known issues before test driving. See Article#9881 to start and https://github.com/broadinstitute/gatk/blob/master/README.md for details.

Mills reference for indel VQSR

Hi all --

This should be a simple problem -- I cannot find a valid version of the Mills indel reference in the resource bundle, or anywhere else online!

All versions of the reference VCF are stripped of genotypes and do not contain a FORMAT column or any additional annotations.

I am accessing the Broad's public FTP, and none of the Mills VCF files in bundle folders 2.5 or 2.8 contain a full VCF. I understand that there are "sites only" VCF, but I can't seem to find anything else.

Can anyone link me to a version that contains the recommended annotations for indel VQSR, or that can be annotated?

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    edited March 2014

    Hi @vasya,

    You don't need to have the annotations in the reference VCF for VQSR, you only have to have them in your test VCF. Are you experiencing issues running the tool?

    Post edited by Geraldine_VdAuwera on
  • vasyavasya Member

    Thanks for the quick reply Geraldine! I misunderstood what data were being used to train the model.

    And yes, I am having trouble with the VQSR of indels. GATK throws the following error:

    ERROR MESSAGE: Your input file has a malformed header: The FORMAT field was provided but there is no genotype/sample data

    The Mills reference is missing a FORMAT column.

    The relevant parts of the command line (GATK v2.8):

    --use_annotation "DP"
    --use_annotation "MQRankSum"
    --use_annotation "ReadPosRankSum"
    --mode "INDEL"
    --input:input_0,vcf "/ephemeral/0/condor/dir_20949/tmp-gatk-_t7uu2/input_variants_0.vcf"
    --resource:Unknown,vcf,known=true,training=true,truth=true,bad=false,prior=12.0 "/ephemeral/0/condor/dir_20949/tmp-gatk-_t7uu2/input_Unknown_0.vcf"

    Here the file "input_Unknown_0.vcf" is pulled directly from the broad's FTP (/bundle/2.8/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz).

    We have successfully run the VQSR on SNPs contained in my input VCF using the recommended reference files, and nearly identical annotations.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hm. Can you try deleting the FORMAT definition line in the header? That might do the trick. The file shouldn't need a FORMAT column as far as I can remember. Not sure how we ended up emitting a file with a malformed header, will check that.

  • vasyavasya Member

    I have tried a couple of modified versions of the reference:

    • Removing the FORMAT fields from the header.
    • Removing all but at GT FORMAT field from the header, and adding a GT FORMAT column.

    Unfortunately, both of these produced the same error. Is this the same file on the internal FTP server? Can I get a copy that has been successfully used previously?

    On an unrelated note -- congrats on 5,000 posts! Its a real testament to the support that you provide!

  • vasyavasya Member

    Thanks for your help Geraldine. The recalibration worked perfectly with the file that you specified.

Sign In or Register to comment.