Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

understanding ValidateVariants output

sf21sf21 Member
edited December 2018 in Ask the GATK team

Hi,

I am trying to understand what does the number of records mean in ValidateVariants output? Notice in the below output for the VCF file it says checked 3814206 records when using GVCF it says 1 record. Using GATK version v3.7-0-gcfedb67 . thanks!

Output when i run ValidateVariants on a VCF (multi-sample)


Successfully validated the input file. Checked 3814206 records with no failures.
Done. There were 1 WARN messages, the first 1 are repeated below.
WARN 17:01:10,487 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation

Output when running on a GVCF (multi-sample) file.

Successfully validated the input file. Checked 1 records with no failures. There were 2 WARN messages, the first 2 are repeated below. WARN 22:00:24,094 IndexDictionaryUtils - Track variant doesn't have a sequence dictionary built in, skipping dictionary validation WARN 22:00:24,159 ValidateVariants - GVCF format is currently incompatible with allele validation. Not validating Alleles.

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi there @sf21 - Could you please send me the headers of your two files? That might help determine whether your GVCF is creating a #GVCBlock that is combining variant sites. Also, a little more information about the types and number of samples that you are running would be helpful. Also, were the VCF files generated by the HaplotypeCaller in GATK3.7?

    Also, what is the exact command you ran when validating the GVCF? Did you use the option "--validateGVCF"?

    Here is a description of the parameter:

    --validateGVCF / -gvcf
    Validate this file as a GVCF
    This validation option REQUIRES that the input GVCF satisfies the following conditions: (1) every variant record must feature an allele in the list of ALT alleles, and (2) every position in the genomic territory under consideration must covered by a record, whether a single-position record or a reference block record. If the analysis that produced the file was restricted to a subset of genomic regions (for example using the -L or -XL arguments), the same intervals must be provided for validation. Otherwise, the validation tool will find positions that are not covered by records and will fail.

    Please provide the command for both the VCF and GVCF.

    Here is a discussion about the differences between VCF and gVCF.

  • sf21sf21 Member

    I need to correct my original post. On looking at the logs i noticed GVCF validation was run using 3.7 whereas VCF validation was run using version v3.6-0-g89b7209. I ran ValidateVariants method from GATK versions 3.4 through 3.7 on the VCF and found that number of records mentioned in the output is different.

    Version Number of records
    v3.4-0-g7e26428 4958138
    v3.5-0-g36282e4 4958138
    v3.6-0-g89b7209 6823860
    v3.7-0-gcfedb67 0

    ValidateVariants command on VCF
    /nfs/sw/java/jdk-1.8.0.45/bin/java -Djava.io.tmpdir=$PWD -Xmx32g -jar /nfs/sw/gatk/gatk-3.7/GenomeAnalysisTK.jar -T ValidateVariants -R /resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa -V b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz > $PWD/b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz.vv

    ValidateVariants command on gVCF
    /nfs/sw/java/jdk-1.8.0.45/bin/java -Djava.io.tmpdir=$PWD -Xmx32g -jar /nfs/sw/gatk/gatk-3.7/GenomeAnalysisTK.jar -T ValidateVariants -R /resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa -V b38_NA12878_2018-03-19.raw.g.vcf.gz -gvcf > $PWD/b38_NA12878_2018-03-19.raw.g.vcf.gz.vv

    Find the headers attached.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @sf21,

    It appears you used the b38_NA12878_2018-03-19.raw.g.vcf.gz file instead of the b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz file in the second command.

  • sf21sf21 Member

    The first command is when running on a VCF whereas the second one is for running on gVCF and so the input is different.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited December 2018

    Hi @sf21

    Both the file headers that you have sent has the GVCFBlock in it, hence looks like both are gvcf files. Would you please check that and get back to us. Thank you.

  • sf21sf21 Member

    Hi @bhanuGandham , That is because when you run the gVCF generated by haplotype caller through the variant calling steps outlined in best practices (GenotypeGVCF, VQSR etc) they do not remove the GVCFBlock lines from the header. If you look at the bcftools_viewCommand in the file, you will notice they were generated off of different files.

    And just to be clear,
    b38_NA12878_2018-03-19.raw.g.vcf.gz - Is the gVCF generated from Haplotype caller version - 3.5-0-g36282e4
    b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz - Is the VCF file after running GenotypeGVCF and VQSR on the above gVCF.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @sf21 Could you please provide the command that was used to generate the gVCF and the commands that were used to generate the VCF?

    What is the source of your sequencing data? How many samples and what is the coverage depth?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @sf21 also, what is your reference file? I see that no dictionary file was available? Are you using the latest resource bundle from the google cloud bucket?

    I see mention of the gVCF record keeping being slightly different in this thread

  • sf21sf21 Member
    edited December 2018

    The sample was sequenced on Illumina hiseqX and processed using the CCDG standardized pipeline. Target coverage was 30X. what dictionary file are you referring to? The reference fasta dict (*.dict) file was available in the same directory as the fasta file.

    GenoytpeGVCF command,
    ${java} -Djava.io.tmpdir=${tmpdir} -Xmx90g -XX:ParallelGCThreads=1 \ -jar ${gatk} \ -T GenotypeGVCFs \ -R ${reference} \ -nt 5 \ --disable_auto_index_creation_and_locking_when_reading_rods \ --variant ${gvcf} -o ${output_vcf}

    VQSR - VariantRecalibration (SNP)
    ${java} -Djava.io.tmpdir=${tmpdir} -Xmx90g -XX:ParallelGCThreads=1 \ -jar ${gatk} \ -T VariantRecalibrator \ -R ${reference} \ -nt 5 \ -input ${raw_vcfs} \ -mode SNP \ -recalFile ${tmpdir}/${analysis_name}.recalibrate_${vqsr_mode}.recal \ -tranchesFile ${tmpdir}/${analysis_name}.recalibrate_${vqsr_mode}.tranches \ -rscriptFile ${tmpdir}/${analysis_name}.recalibrate_${vqsr_mode}_plots.R -resource:hapmap,known=false,training=true,truth=true,prior=15.0 ${hapmap} \ -resource:omni,known=false,training=true,truth=true,prior=12.0 ${kg_omni} \ -resource:1000G,known=false,training=true,truth=false,prior=10.0 ${kg_snps} \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbsnp} \ -an QD -an MQ -an FS -an MQRankSum -an ReadPosRankSum -an SOR -an DP \ -tranche 100.0 -tranche 99.8 -tranche 99.6 -tranche 99.4 -tranche 99.2 -tranche 99.0 -tranche 95.0 -tranche 90.0 \

    VQSR - VariantRecalibration (INDEL)
    ${java} -Djava.io.tmpdir=${tmpdir} -Xmx90g -XX:ParallelGCThreads=1 \ -jar ${gatk} \ -T VariantRecalibrator \ -R ${reference} \ -nt 5 \ -input ${raw_vcfs} \ -mode INDEL \ -recalFile ${tmpdir}/${analysis_name}.recalibrate_${vqsr_mode}.recal \ -tranchesFile ${tmpdir}/${analysis_name}.recalibrate_${vqsr_mode}.tranches \ -rscriptFile ${tmpdir}/${analysis_name}.recalibrate_${vqsr_mode}_plots.R -resource:mills,known=true,training=true,truth=true,prior=12.0 ${kg_mills} \ -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 ${dbsnp} \ -an QD -an FS -an ReadPosRankSum -an MQRankSum -an SOR -an DP \ -tranche 100.0 -tranche 99.0 -tranche 95.0 -tranche 92.0 -tranche 90.0 \ --maxGaussians 4

    VQSR - ApplyRecalibration (SNP)
    ${java} -Djava.io.tmpdir=${tmpdir} -Xmx90g -XX:ParallelGCThreads=1 \ -jar ${gatk} \ -T ApplyRecalibration \ -R ${reference} \ -nt 5 \ -input ${raw_vcf} \ -mode SNP \ --ts_filter_level 99.60 \ -recalFile ${tmpdir}/${analysis_name}.recalibrate_SNP.recal \ -tranchesFile ${tmpdir}/${analysis_name}.recalibrate_SNP.tranches \ -o ${output_vcf}

    VQSR - ApplyRecalibration (INDEL)
    ${java} -Djava.io.tmpdir=${tmpdir} -Xmx90g -XX:ParallelGCThreads=1 \ -jar ${gatk} \ -T ApplyRecalibration \ -R ${reference} \ -nt 5 \ -input ${raw_vcf} \ -mode INDEL \ --ts_filter_level 99.60 \ -recalFile ${tmpdir}/${analysis_name}.recalibrate_INDEL.recal \ -tranchesFile ${tmpdir}/${analysis_name}.recalibrate_INDEL.tranches \ -o ${output_vcf}

    I am using resource files that came with the GATK bundle.
    reference=GRCh38_full_analysis_set_plus_decoy_hla.fa
    dbsnp=Homo_sapiens_assembly38.dbsnp138.vcf
    hapmap=/GatkBundle/hg38bundle/hapmap_3.3.hg38.vcf.gz
    kg_omni=/GatkBundle/hg38bundle/1000G_omni2.5.hg38.vcf.gz
    kg_snps=/GatkBundle/hg38bundle/1000G_phase1.snps.high_confidence.hg38.vcf.gz
    kg_mills=/GatkBundle/hg38bundle/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited December 2018

    Hi @sf21

    It looks like this maybe a bug in gatk3. would you please try with the latest ValidateVariants in gatk4 version. Please get back to us if this does not resolve the issue.

    Thank you!

  • sf21sf21 Member

    Sorry about the delay in responding, I was on vacation. Ran it with GATK4 and the number of variants reported in the log was consistent with version 3.4 and 3.5 (4958138). Below is the output,
    10:17:08.628 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nfs/sw/gatk/gatk-4.0.5.1/gatk-package-4.0.5.1-local.jar!/com/intel/gkl/native/libgkl_compression.so 10:17:08.759 INFO ValidateVariants - ------------------------------------------------------------ 10:17:08.760 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v4.0.5.1 10:17:08.760 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/ 10:17:08.760 INFO ValidateVariants - Executing as [email protected] on Linux v3.10.0-862.el7.x86_64 amd64 10:17:08.760 INFO ValidateVariants - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_45-b14 10:17:08.760 INFO ValidateVariants - Start Date/Time: January 14, 2019 10:17:08 AM EST 10:17:08.760 INFO ValidateVariants - ------------------------------------------------------------ 10:17:08.760 INFO ValidateVariants - ------------------------------------------------------------ 10:17:08.761 INFO ValidateVariants - HTSJDK Version: 2.15.1 10:17:08.761 INFO ValidateVariants - Picard Version: 2.18.2 10:17:08.761 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2 10:17:08.761 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 10:17:08.761 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 10:17:08.762 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 10:17:08.762 INFO ValidateVariants - Deflater: IntelDeflater 10:17:08.762 INFO ValidateVariants - Inflater: IntelInflater 10:17:08.762 INFO ValidateVariants - GCS max retries/reopens: 20 10:17:08.762 INFO ValidateVariants - Using google-cloud-java patch 6d11bef1c81f885c26b2b56c8616b7a705171e4f from https://github.com/droazen/google-cloud-java/tree/dr_all_nio_fixes 10:17:08.762 INFO ValidateVariants - Initializing engine 10:17:09.644 INFO FeatureManager - Using codec VCFCodec to read file file:///b38_NA12878_2018-03-19.recalibrated_variants.vcf.gz 10:17:09.959 INFO ValidateVariants - Done initializing engine 10:17:09.959 INFO ProgressMeter - Starting traversal 10:17:09.960 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 10:17:19.971 INFO ProgressMeter - chr3:54628839 0.2 842000 5048466.1 10:17:29.972 INFO ProgressMeter - chr6:87608874 0.3 1863000 5585927.7 10:17:39.978 INFO ProgressMeter - chr10:59140712 0.5 2839000 5674784.3 10:17:57.654 INFO ProgressMeter - chr12:69671346 0.8 3314000 4169077.9 10:18:17.776 INFO ProgressMeter - chr16:2956240 1.1 3889000 3440780.9 10:18:27.777 INFO ProgressMeter - chrX:119383104 1.3 4822000 3717953.7 10:18:29.794 INFO ProgressMeter - chrUn_JTFH01001972v1_decoy:993 1.3 4958138 3726382.3 10:18:29.794 INFO ProgressMeter - Traversal complete. Processed 4958138 total variants in 1.3 minutes. 10:18:29.794 INFO ValidateVariants - Shutting down engine [January 14, 2019 10:18:29 AM EST] org.broadinstitute.hellbender.tools.walkers.variantutils.ValidateVariants done. Elapsed time: 1.35 minutes. Runtime.totalMemory()=5762449408

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited January 14

    HI @sf21

    Looks like that was probably caused by a bug in 3.7 version. Most bugs are resolved in the latest versions. We recommend you use the latest. We already have version 4.0.12.0 out and version 4.1 will be out in a few weeks from now. Keep an eye out for it as it has new interesting features.

Sign In or Register to comment.