We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
Genotype qualities/Qual scores for All sites

Hi
I am trying analyze 8 exome datasets. In the final output, I want to get calibrated QUAL or Genotype scores for each position (even if it is homozygous ref) in the target region.
I am following the following workflow to analyze this dataset: Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode.
As per the above document, I ran haplotypecaller on individual bams to generate gvcf. Then I run genotypeGVCF on all the gvcfs. However for each position, I do not see any annotations in the INFO field which will be then used by VQSR step. How can I add more annotations to the vcf? Below is a copy of the command I used and snapshot of the vcf. Please note that currently for testing I used only 8 datasets in genotypeGVCF but in final run, I plan to include 20 exome datsets from 1000 genomes as specified in best practices.
I appreciate any help.
/usr/java/jre1.7.0_67/bin/java -Xmx4g -jar ~/software/gatk/GenomeAnalysisTK.jar \ -R ~/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta \ -T GenotypeGVCFs \ -D ~/software/gatk/resource_bundle/v2.5/dbsnp_137.hg19.vcf \ --disable_auto_index_creation_and_locking_when_reading_rods \ -stand_call_conf 30 \ -stand_emit_conf 0 \ -L /home/target_hg19.bed \ --includeNonVariantSites \ --variant ../1hc_cohort.gvcf \ --variant ../2hc_cohort.gvcf \ ..... --variant ../8hc_cohort.gvcf \ -o genotyped.vcf ##fileformat=VCFv4.1 ##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location"> ##FILTER=<ID=LowQual,Description="Low quality"> ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block"> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification"> ##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias."> ##GATKCommandLine=<ID=GenotypeGVCFs,Version=3.2-2-gec30cee,Date="Wed Oct 15 11:45:34 PDT 2014",Epoch=1413398734405,CommandLineOptions="analysis_type=GenotypeGVCFs input_file=[] showFullBamList=false read_buf fer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=[/home/rjain/2014/2014_08_ExomeCompetition/SRP004917/TargetRegion/chips/AGWE1.0.target_hg19.bed] excludeIntervals=null interval_set_ rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/home/rjain/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxR untimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allo w_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qs cores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locki ng_when_reading_rods=true num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] ped igreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=f alse variant=[(RodBindingCollection [(RodBinding name=variant source=../SRR089532_SRR089535/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant2 source=../SRR089538_SRR089541/hc_cohort.gvcf)]) , (RodBindingCollection [(RodBinding name=variant3 source=../SRR089553_SRR089555/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant4 source=../SRR089575_SRR089576/hc_cohort.gvcf)]), (RodBindi ngCollection [(RodBinding name=variant5 source=../SRR089579_SRR089581/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant6 source=../SRR089582_SRR089584/hc_cohort.gvcf)]), (RodBindingCollectio n [(RodBinding name=variant7 source=../SRR089585_SRR089587/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant8 source=../SRR089588_SRR089590/hc_cohort.gvcf)])] out=org.broadinstitute.gatk.eng ine.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub bcf=org .broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub includeNonVariantSites=true annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=30. 0 standard_min_confidence_threshold_for_emitting=0.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 annotation=[InbreedingCoeff, FisherStrand, QualByDepth, ChromosomeCounts, GenotypeSummaries] dbsnp= (RodBinding name=dbsnp source=/home/rjain/software/gatk/resource_bundle/v2.5/dbsnp_137.hg19.vcf) filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false"> ##GATKCommandLine=<ID=HaplotypeCaller,Version=3.2-2-gec30cee,Date="Tue Oct 14 15:39:08 PDT 2014",Epoch=1413326348917,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/home/rjain/2014/2014_08_Exo meCompetition/SRP004917/bowtie2_v1/SRR089553_SRR089555/SRR089553_filtered_gatk/withrg_reorder_realign_recal.bam, /home/rjain/2014/2014_08_ExomeCompetition/SRP004917/bowtie2_v1/SRR089553_SRR089555/SRR089555_f iltered_gatk/withrg_reorder_realign_recal.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=[] excludeIntervals=null interval_set_rule=UNION interv al_merging=ALL interval_padding=0 reference_sequence=/home/rjain/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUT ES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_mise ncoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_ro ds=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationTy pe=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 logging_level=INFO log_to_file=null help=false version=false likelihoodCalc ulationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamOutput=null bam_compression=null disable_bam_indexing=null generate_md5=null simplifyBAM=null bamWriterType=CALLED_HAPLOTYP ES dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthP erSampleHC, StrandBiasBySample] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator, ChromosomeCounts, FisherStrand, QualByDepth] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GV CF annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 max_alternate_alleles=6 input_pri or=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exa ctcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecove rDanglingTails=false consensus=false GVCFGQBands=[5, 20, 60] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGl obalReadMismappingRate=45 maxNumHaplotypesInPopulation=128 mergeVariantsViaLD=false pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrect Kmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErr orCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSi ze=null bandPassSigma=null min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false"> ##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive) ##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed"> ##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes"> ##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities"> ##INFO=<ID=CCC,Number=1,Type=Integer,Description="Number of called chromosomes"> ##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership"> ##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered"> ##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?"> ##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval"> ##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias"> ##INFO=<ID=GQ_MEAN,Number=1,Type=Float,Description="Mean of all GQ values"> ##INFO=<ID=GQ_STDDEV,Number=1,Type=Float,Description="Standard deviation of all GQ values"> ##INFO=<ID=HWP,Number=1,Type=Float,Description="P value from test of Hardy Weinberg Equilibrium"> ##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes"> ##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation"> ##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed"> ##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed"> ##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality"> ##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads"> ##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities"> ##INFO=<ID=NCC,Number=1,Type=Integer,Description="Number of no-called samples"> ##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth"> ##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias"> ........... ##reference=file:///home/rjain/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA19238_SRR089553_SRR089555 NA19238_SRR089575_SRR089576 NA19238_SRR089585_SRR089587 NA19238_SRR089588_SRR089590 NA19240 _SRR089532_SRR089535 NA19240_SRR089538_SRR089541 NA19240_SRR089579_SRR089581 NA19240_SRR089582_SRR089584 chr1 30276 . C . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30277 . A . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30278 . A . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30279 . A . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30280 . G . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30281 . T . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30282 . C . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30283 . C . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 chr1 30284 . A . . . NCC=8 GT:DP 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 0/0:0 ...... chr1 874475 . T . . . DP=129;NCC=8 GT:DP 0/0:22 0/0:20 0/0:7 0/0:9 0/0:21 0/0:34 0/0:8 0/0:8 chr1 874476 . T . . . DP=129;NCC=8 GT:DP 0/0:22 0/0:20 0/0:7 0/0:9 0/0:21 0/0:34 0/0:8 0/0:8 chr1 874477 . C . . . DP=129;NCC=8 GT:DP 0/0:22 0/0:20 0/0:7 0/0:9 0/0:21 0/0:34 0/0:8 0/0:8 chr1 874478 . A . . . DP=129;NCC=8 GT:DP 0/0:22 0/0:20 0/0:7 0/0:9 0/0:21 0/0:34 0/0:8 0/0:8 chr1 874479 . G . . . DP=129;NCC=8 GT:DP 0/0:22 0/0:20 0/0:7 0/0:9 0/0:21 0/0:34 0/0:8 0/0:8 chr1 874480 . A . . . DP=129;NCC=8 GT:DP 0/0:22 0/0:20 0/0:7 0/0:9 0/0:21 0/0:34 0/0:8 0/0:8
Best Answers
-
Geraldine_VdAuwera Cambridge, MA admin
These are all invariant sites. In your final VCF, you will only see GQ in variant sites.
-
Geraldine_VdAuwera Cambridge, MA admin
@ymc We have added the capability to output a measure of confidence in hom-ref genotypes, represented by the new "RGQ" annotation (which stands for "reference genotype quality"). It will be annotated by default by GenotypeGVCFs. This is available in the current nightly builds and will be in the official release of GATK v 3.4 (to be released next week).
Answers
@newbie16
Hi,
You can use VariantAnnotator to add annotations to your vcf. Please read about VariantAnnotator here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php
-Sheila
Thanks Sheila
But my main issue is that I dont even see the GenotypeQuality. I dont think I can add using VariantAnnotator. Can I?
These are all invariant sites. In your final VCF, you will only see GQ in variant sites.
So GQ is not available at invariant sites.
Then how do we assess how confident the genotyper calls a homo ref site? Is there a better way then coverage count?
@ymc We have added the capability to output a measure of confidence in hom-ref genotypes, represented by the new "RGQ" annotation (which stands for "reference genotype quality"). It will be annotated by default by GenotypeGVCFs. This is available in the current nightly builds and will be in the official release of GATK v 3.4 (to be released next week).
Hi, I'm interested in producing a VCF file with annotations for the non-variant sites. I am able to create a gVCF with all sites using gatk HaplotypeCaller. However, when I use GenotypeGVCFs (GATK3 not 4 - which does not emit reference sites), I cannot get annotations for the variant sites. Is there a way to get these? Thank you!
java -jar ${GATK_PATH} \
-T GenotypeGVCFs \
-R ${REF_DIR}${ref} \
--variant ${gvcf} \
-ploidy 1 \
-allSites \
-o ${gvcf%.g.vcf}"_allsite.vcf" \
-A BaseQualityRankSumTest \
-A Coverage \
-A FisherStrand \
-A MappingQualityRankSumTest \
-A QualByDepth \
-A RMSMappingQuality \
-A ReadPosRankSumTest \
-A StrandOddsRatio \
-A DepthPerAlleleBySample \
-A RGQ
When I try to use VariantAnnotator as an additional step, I get MQ=Infinity.
@Katie
Hi,
What do you mean "I cannot get annotations for the variant sites."? Are all of the annotations not present in the final VCF, or do you get an error message? Note, some of those annotations (like RanSumTests) cannot be calculated for non-variant sites because they require a mix of ref and alt reads.
-Sheila
@Sheila,
Thanks for getting back to me - sorry about the vague description. I am trying to get an allsites VCF which includes MQ scores for the invariant sites. When I use the GenotypeGVCF tool, with the below commands, I get a VCF file that does not include MQ scores for invariant sites (GCA_000008585.1_10_1.bowtie2_gatk_allsite.vcf.gz, attached). When I subsequently use AnnotateVariants to try to annotate MQ onto the file, many but not all of the invariant sites have MQ=Infinity ( GCA_000008585.1_10_1.bowtie2_gatk_allann.vcf.gz, attached). Do you have suggestions about how best to get correct annotations for these invariant sites? I am using GATK3 GenotypeGVCF and VariantAnnotator because I couldn't figure out how to produce an all sites VCF with GATK4. Thank you!
My commands here:
java -jar ${GATK_PATH} \
-T GenotypeGVCFs \
-R ${REF_DIR}${ref} \
--variant ${gvcf} \
-ploidy 1 \
-allSites \
-o ${gvcf%.g.vcf}"_allsite.vcf" \
-A BaseQualityRankSumTest \
-A Coverage \
-A FisherStrand \
-A MappingQualityRankSumTest \
-A QualByDepth \
-A RMSMappingQuality \
-A ReadPosRankSumTest \
-A StrandOddsRatio \
-A DepthPerAlleleBySample \
-A RGQ
java -jar $GATK_PATH \
-T VariantAnnotator \
-R ${REF_DIR}${ref} \
-V ${gvcf%.g.vcf}"_allsite.vcf" \
-A BaseQualityRankSumTest \
-A Coverage \
-A FisherStrand \
-A MappingQualityRankSumTest \
-A QualByDepth \
-A RMSMappingQuality \
-A ReadPosRankSumTest \
-A StrandOddsRatio \
-A DepthPerAlleleBySample \
-I ${BAMS_DIR}${bam} \
-o ${gvcf%.g.vcf}"_allann.vcf"
@Katie
Hi,
Hmm. Can you confirm that the GVCFs have the MQ annotation in them? I am not sure why INFINITY would be present after VariantAnnotator. Unfortunately, I don't think the team will fix any bugs in GATK3, as the focus is on GATK4 now. Can you also try running GATK4 to produce GVCFs and see if the MQ annotation is correctly output in the GVCFs? Right now, GenotypeGVCFs in GATK4 does not have the ability to output all sites in the final VCF, but there are plans to do so in the future.
-Sheila