understanding ValidateVariants output

sf21sf21 Member
edited December 5 in Ask the GATK team

Hi,

I am trying to understand what does the number of records mean in ValidateVariants output? Notice in the below output for the VCF file it says checked 3814206 records when using GVCF it says 1 record. Using GATK version v3.7-0-gcfedb67 . thanks!

Output when i run ValidateVariants on a VCF (multi-sample)

<br /> Successfully validated the input file. Checked 3814206 records with no failures.<br /> Done. There were 1 WARN messages, the first 1 are repeated below.<br /> WARN 17:01:10,487 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation<br />

Output when running on a GVCF (multi-sample) file.

Successfully validated the input file. Checked 1 records with no failures. There were 2 WARN messages, the first 2 are repeated below. WARN 22:00:24,094 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation WARN 22:00:24,159 ValidateVariants - GVCF format is currently incompatible with allele validation. Not validating Alleles.

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi there @sf21 - Could you please send me the headers of your two files? That might help determine whether your GVCF is creating a #GVCBlock that is combining variant sites. Also, a little more information about the types and number of samples that you are running would be helpful. Also, were the VCF files generated by the HaplotypeCaller in GATK3.7?

    Also, what is the exact command you ran when validating the GVCF? Did you use the option "--validateGVCF"?

    Here is a description of the parameter:

    --validateGVCF / -gvcf
    Validate this file as a GVCF
    This validation option REQUIRES that the input GVCF satisfies the following conditions: (1) every variant record must feature an allele in the list of ALT alleles, and (2) every position in the genomic territory under consideration must covered by a record, whether a single-position record or a reference block record. If the analysis that produced the file was restricted to a subset of genomic regions (for example using the -L or -XL arguments), the same intervals must be provided for validation. Otherwise, the validation tool will find positions that are not covered by records and will fail.

    Please provide the command for both the VCF and GVCF.

    Here is a discussion about the differences between VCF and gVCF.

  • sf21sf21 Member

    I need to correct my original post. On looking at the logs i noticed GVCF validation was run using 3.7 whereas VCF validation was run using version v3.6-0-g89b7209. I ran ValidateVariants method from GATK versions 3.4 through 3.7 on the VCF and found that number of records mentioned in the output is different.

    Version Number of records
    v3.4-0-g7e26428 4958138
    v3.5-0-g36282e4 4958138
    v3.6-0-g89b7209 6823860
    v3.7-0-gcfedb67 0

    ValidateVariants command on VCF
    /nfs/sw/java/jdk-1.8.0.45/bin/java -Djava.io.tmpdir=$PWD -Xmx32g -jar /nfs/sw/gatk/gatk-3.7/GenomeAnalysisTK.jar -T ValidateVariants -R /resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa -V b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz > $PWD/b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz.vv

    ValidateVariants command on gVCF
    /nfs/sw/java/jdk-1.8.0.45/bin/java -Djava.io.tmpdir=$PWD -Xmx32g -jar /nfs/sw/gatk/gatk-3.7/GenomeAnalysisTK.jar -T ValidateVariants -R /resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa -V b38_NA12878_2018-03-19.raw.g.vcf.gz -gvcf > $PWD/b38_NA12878_2018-03-19.raw.g.vcf.gz.vv

    Find the headers attached.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @sf21,

    It appears you used the b38_NA12878_2018-03-19.raw.g.vcf.gz file instead of the b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz file in the second command.

  • sf21sf21 Member

    The first command is when running on a VCF whereas the second one is for running on gVCF and so the input is different.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    edited December 10

    Hi @sf21

    Both the file headers that you have sent has the GVCFBlock in it, hence looks like both are gvcf files. Would you please check that and get back to us. Thank you.

Sign In or Register to comment.