Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Mills reference for indel VQSR

vasyavasya Posts: 5Member

Hi all --

This should be a simple problem -- I cannot find a valid version of the Mills indel reference in the resource bundle, or anywhere else online!

All versions of the reference VCF are stripped of genotypes and do not contain a FORMAT column or any additional annotations.

I am accessing the Broad's public FTP, and none of the Mills VCF files in bundle folders 2.5 or 2.8 contain a full VCF. I understand that there are "sites only" VCF, but I can't seem to find anything else.

Can anyone link me to a version that contains the recommended annotations for indel VQSR, or that can be annotated?

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin
    edited March 27

    Hi @vasya,

    You don't need to have the annotations in the reference VCF for VQSR, you only have to have them in your test VCF. Are you experiencing issues running the tool?

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • vasyavasya Posts: 5Member

    Thanks for the quick reply Geraldine! I misunderstood what data were being used to train the model.

    And yes, I am having trouble with the VQSR of indels. GATK throws the following error:

    ERROR MESSAGE: Your input file has a malformed header: The FORMAT field was provided but there is no genotype/sample data

    The Mills reference is missing a FORMAT column.

    The relevant parts of the command line (GATK v2.8):

    --use_annotation "DP" --use_annotation "MQRankSum" --use_annotation "ReadPosRankSum" --mode "INDEL" --input:input_0,vcf "/ephemeral/0/condor/dir_20949/tmp-gatk-_t7uu2/input_variants_0.vcf" --resource:Unknown,vcf,known=true,training=true,truth=true,bad=false,prior=12.0 "/ephemeral/0/condor/dir_20949/tmp-gatk-_t7uu2/input_Unknown_0.vcf"

    Here the file "input_Unknown_0.vcf" is pulled directly from the broad's FTP (/bundle/2.8/hg19/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz).

    We have successfully run the VQSR on SNPs contained in my input VCF using the recommended reference files, and nearly identical annotations.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,176Administrator, GATK Developer admin

    Hm. Can you try deleting the FORMAT definition line in the header? That might do the trick. The file shouldn't need a FORMAT column as far as I can remember. Not sure how we ended up emitting a file with a malformed header, will check that.

    Geraldine Van der Auwera, PhD

  • vasyavasya Posts: 5Member

    I have tried a couple of modified versions of the reference:

    • Removing the FORMAT fields from the header.
    • Removing all but at GT FORMAT field from the header, and adding a GT FORMAT column.

    Unfortunately, both of these produced the same error. Is this the same file on the internal FTP server? Can I get a copy that has been successfully used previously?

    On an unrelated note -- congrats on 5,000 posts! Its a real testament to the support that you provide!

  • vasyavasya Posts: 5Member

    Thanks for your help Geraldine. The recalibration worked perfectly with the file that you specified.

Sign In or Register to comment.