Genotype qualities/Qual scores for All sites

Hi

I am trying analyze 8 exome datasets. In the final output, I want to get calibrated QUAL or Genotype scores for each position (even if it is homozygous ref) in the target region.
I am following the following workflow to analyze this dataset: Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode.
As per the above document, I ran haplotypecaller on individual bams to generate gvcf. Then I run genotypeGVCF on all the gvcfs. However for each position, I do not see any annotations in the INFO field which will be then used by VQSR step. How can I add more annotations to the vcf? Below is a copy of the command I used and snapshot of the vcf. Please note that currently for testing I used only 8 datasets in genotypeGVCF but in final run, I plan to include 20 exome datsets from 1000 genomes as specified in best practices.

I appreciate any help.

/usr/java/jre1.7.0_67/bin/java -Xmx4g -jar ~/software/gatk/GenomeAnalysisTK.jar \
-R ~/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta \
-T GenotypeGVCFs \
-D ~/software/gatk/resource_bundle/v2.5/dbsnp_137.hg19.vcf \
--disable_auto_index_creation_and_locking_when_reading_rods \
-stand_call_conf 30 \
-stand_emit_conf 0 \
-L /home/target_hg19.bed \
--includeNonVariantSites \
--variant ../1hc_cohort.gvcf \
--variant ../2hc_cohort.gvcf \
.....
--variant ../8hc_cohort.gvcf \
-o genotyped.vcf


##fileformat=VCFv4.1
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele at this location">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=GenotypeGVCFs,Version=3.2-2-gec30cee,Date="Wed Oct 15 11:45:34 PDT 2014",Epoch=1413398734405,CommandLineOptions="analysis_type=GenotypeGVCFs input_file=[] showFullBamList=false read_buf
fer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=[/home/rjain/2014/2014_08_ExomeCompetition/SRP004917/TargetRegion/chips/AGWE1.0.target_hg19.bed] excludeIntervals=null interval_set_
rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=/home/rjain/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxR
untimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=1000 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allo
w_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qs
cores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locki
ng_when_reading_rods=true num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] ped
igreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=f
alse variant=[(RodBindingCollection [(RodBinding name=variant source=../SRR089532_SRR089535/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant2 source=../SRR089538_SRR089541/hc_cohort.gvcf)])
, (RodBindingCollection [(RodBinding name=variant3 source=../SRR089553_SRR089555/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant4 source=../SRR089575_SRR089576/hc_cohort.gvcf)]), (RodBindi
ngCollection [(RodBinding name=variant5 source=../SRR089579_SRR089581/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant6 source=../SRR089582_SRR089584/hc_cohort.gvcf)]), (RodBindingCollectio
n [(RodBinding name=variant7 source=../SRR089585_SRR089587/hc_cohort.gvcf)]), (RodBindingCollection [(RodBinding name=variant8 source=../SRR089588_SRR089590/hc_cohort.gvcf)])] out=org.broadinstitute.gatk.eng
ine.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub bcf=org
.broadinstitute.gatk.engine.io.stubs.VariantContextWriterStub includeNonVariantSites=true annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=30.
0 standard_min_confidence_threshold_for_emitting=0.0 max_alternate_alleles=6 input_prior=[] sample_ploidy=2 annotation=[InbreedingCoeff, FisherStrand, QualByDepth, ChromosomeCounts, GenotypeSummaries] dbsnp=
(RodBinding name=dbsnp source=/home/rjain/software/gatk/resource_bundle/v2.5/dbsnp_137.hg19.vcf) filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GATKCommandLine=<ID=HaplotypeCaller,Version=3.2-2-gec30cee,Date="Tue Oct 14 15:39:08 PDT 2014",Epoch=1413326348917,CommandLineOptions="analysis_type=HaplotypeCaller input_file=[/home/rjain/2014/2014_08_Exo
meCompetition/SRP004917/bowtie2_v1/SRR089553_SRR089555/SRR089553_filtered_gatk/withrg_reorder_realign_recal.bam, /home/rjain/2014/2014_08_ExomeCompetition/SRP004917/bowtie2_v1/SRR089553_SRR089555/SRR089555_f
iltered_gatk/withrg_reorder_realign_recal.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[] intervals=[] excludeIntervals=null interval_set_rule=UNION interv
al_merging=ALL interval_padding=0 reference_sequence=/home/rjain/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUT
ES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=250 baq=OFF baqGapOpenPenalty=40.0 refactor_NDN_cigar_string=false fix_misencoded_quality_scores=false allow_potentially_mise
ncoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6
globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_ro
ds=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationTy
pe=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=LINEAR variant_index_parameter=128000 logging_level=INFO log_to_file=null help=false version=false likelihoodCalc
ulationEngine=PairHMM heterogeneousKmerSizeResolution=COMBO_MIN graphOutput=null bamOutput=null bam_compression=null disable_bam_indexing=null generate_md5=null simplifyBAM=null bamWriterType=CALLED_HAPLOTYP
ES dbsnp=(RodBinding name= source=UNBOUND) dontTrimActiveRegions=false maxDiscARExtension=25 maxGGAARExtension=300 paddingAroundIndels=150 paddingAroundSNPs=20 comp=[] annotation=[ClippingRankSumTest, DepthP
erSampleHC, StrandBiasBySample] excludeAnnotation=[SpanningDeletions, TandemRepeatAnnotator, ChromosomeCounts, FisherStrand, QualByDepth] debug=false useFilteredReadsForAnnotations=false emitRefConfidence=GV
CF annotateNDA=false heterozygosity=0.001 indel_heterozygosity=1.25E-4 standard_min_confidence_threshold_for_calling=-0.0 standard_min_confidence_threshold_for_emitting=-0.0 max_alternate_alleles=6 input_pri
or=[] sample_ploidy=2 genotyping_mode=DISCOVERY alleles=(RodBinding name= source=UNBOUND) contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exa
ctcallslog=null output_mode=EMIT_VARIANTS_ONLY allSitePLs=true kmerSize=[10, 25] dontIncreaseKmerSizesForCycles=false allowNonUniqueKmersInRef=false numPruningSamples=1 recoverDanglingHeads=false doNotRecove
rDanglingTails=false consensus=false GVCFGQBands=[5, 20, 60] indelSizeToEliminateInRefModel=10 min_base_quality_score=10 minPruning=2 gcpHMM=10 includeUmappedReads=false useAllelesTrigger=false phredScaledGl
obalReadMismappingRate=45 maxNumHaplotypesInPopulation=128 mergeVariantsViaLD=false pair_hmm_implementation=VECTOR_LOGLESS_CACHING keepRG=null justDetermineActiveRegions=false dontGenotype=false errorCorrect
Kmers=false debugGraphTransformations=false dontUseSoftClippedBases=false captureAssemblyFailureBAM=false allowCyclesInKmerGraphToGeneratePaths=false noFpga=false errorCorrectReads=false kmerLengthForReadErr
orCorrection=25 minObservationsForKmerToBeSolid=20 pcr_indel_model=CONSERVATIVE activityProfileOut=null activeRegionOut=null activeRegionIn=null activeRegionExtension=null forceActive=false activeRegionMaxSi
ze=null bandPassSigma=null min_mapping_quality_score=20 filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##GVCFBlock=minGQ=0(inclusive),maxGQ=5(exclusive)
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=CCC,Number=1,Type=Integer,Description="Number of called chromosomes">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=GQ_MEAN,Number=1,Type=Float,Description="Mean of all GQ values">
##INFO=<ID=GQ_STDDEV,Number=1,Type=Float,Description="Standard deviation of all GQ values">
##INFO=<ID=HWP,Number=1,Type=Float,Description="P value from test of Hardy Weinberg Equilibrium">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NCC,Number=1,Type=Integer,Description="Number of no-called samples">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
...........
##reference=file:///home/rjain/software/gatk/resource_bundle/v2.8/ucsc.hg19.fasta
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA19238_SRR089553_SRR089555     NA19238_SRR089575_SRR089576     NA19238_SRR089585_SRR089587     NA19238_SRR089588_SRR089590     NA19240
_SRR089532_SRR089535    NA19240_SRR089538_SRR089541     NA19240_SRR089579_SRR089581     NA19240_SRR089582_SRR089584
chr1    30276   .       C       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30277   .       A       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30278   .       A       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30279   .       A       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30280   .       G       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30281   .       T       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30282   .       C       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30283   .       C       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
chr1    30284   .       A       .       .       .       NCC=8   GT:DP   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0   0/0:0
......
chr1    874475  .       T       .       .       .       DP=129;NCC=8    GT:DP   0/0:22  0/0:20  0/0:7   0/0:9   0/0:21  0/0:34  0/0:8   0/0:8
chr1    874476  .       T       .       .       .       DP=129;NCC=8    GT:DP   0/0:22  0/0:20  0/0:7   0/0:9   0/0:21  0/0:34  0/0:8   0/0:8
chr1    874477  .       C       .       .       .       DP=129;NCC=8    GT:DP   0/0:22  0/0:20  0/0:7   0/0:9   0/0:21  0/0:34  0/0:8   0/0:8
chr1    874478  .       A       .       .       .       DP=129;NCC=8    GT:DP   0/0:22  0/0:20  0/0:7   0/0:9   0/0:21  0/0:34  0/0:8   0/0:8
chr1    874479  .       G       .       .       .       DP=129;NCC=8    GT:DP   0/0:22  0/0:20  0/0:7   0/0:9   0/0:21  0/0:34  0/0:8   0/0:8
chr1    874480  .       A       .       .       .       DP=129;NCC=8    GT:DP   0/0:22  0/0:20  0/0:7   0/0:9   0/0:21  0/0:34  0/0:8   0/0:8

Best Answers

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @newbie16‌

    Hi,

    You can use VariantAnnotator to add annotations to your vcf. Please read about VariantAnnotator here: https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php

    -Sheila

  • Thanks Sheila

    But my main issue is that I dont even see the GenotypeQuality. I dont think I can add using VariantAnnotator. Can I?

  • ymcymc Member

    So GQ is not available at invariant sites.

    Then how do we assess how confident the genotyper calls a homo ref site? Is there a better way then coverage count?

  • KatieKatie United StatesMember

    Hi, I'm interested in producing a VCF file with annotations for the non-variant sites. I am able to create a gVCF with all sites using gatk HaplotypeCaller. However, when I use GenotypeGVCFs (GATK3 not 4 - which does not emit reference sites), I cannot get annotations for the variant sites. Is there a way to get these? Thank you!

    java -jar ${GATK_PATH} \
    -T GenotypeGVCFs \
    -R ${REF_DIR}${ref} \
    --variant ${gvcf} \
    -ploidy 1 \
    -allSites \
    -o ${gvcf%.g.vcf}"_allsite.vcf" \
    -A BaseQualityRankSumTest \
    -A Coverage \
    -A FisherStrand \
    -A MappingQualityRankSumTest \
    -A QualByDepth \
    -A RMSMappingQuality \
    -A ReadPosRankSumTest \
    -A StrandOddsRatio \
    -A DepthPerAlleleBySample \
    -A RGQ

  • KatieKatie United StatesMember

    When I try to use VariantAnnotator as an additional step, I get MQ=Infinity.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Katie
    Hi,

    What do you mean "I cannot get annotations for the variant sites."? Are all of the annotations not present in the final VCF, or do you get an error message? Note, some of those annotations (like RanSumTests) cannot be calculated for non-variant sites because they require a mix of ref and alt reads.

    -Sheila

  • KatieKatie United StatesMember
    edited February 26

    @Sheila,
    Thanks for getting back to me - sorry about the vague description. I am trying to get an allsites VCF which includes MQ scores for the invariant sites. When I use the GenotypeGVCF tool, with the below commands, I get a VCF file that does not include MQ scores for invariant sites (GCA_000008585.1_10_1.bowtie2_gatk_allsite.vcf.gz, attached). When I subsequently use AnnotateVariants to try to annotate MQ onto the file, many but not all of the invariant sites have MQ=Infinity ( GCA_000008585.1_10_1.bowtie2_gatk_allann.vcf.gz, attached). Do you have suggestions about how best to get correct annotations for these invariant sites? I am using GATK3 GenotypeGVCF and VariantAnnotator because I couldn't figure out how to produce an all sites VCF with GATK4. Thank you!

    My commands here:

    java -jar ${GATK_PATH} \
    -T GenotypeGVCFs \
    -R ${REF_DIR}${ref} \
    --variant ${gvcf} \
    -ploidy 1 \
    -allSites \
    -o ${gvcf%.g.vcf}"_allsite.vcf" \
    -A BaseQualityRankSumTest \
    -A Coverage \
    -A FisherStrand \
    -A MappingQualityRankSumTest \
    -A QualByDepth \
    -A RMSMappingQuality \
    -A ReadPosRankSumTest \
    -A StrandOddsRatio \
    -A DepthPerAlleleBySample \
    -A RGQ

    java -jar $GATK_PATH \
    -T VariantAnnotator \
    -R ${REF_DIR}${ref} \
    -V ${gvcf%.g.vcf}"_allsite.vcf" \
    -A BaseQualityRankSumTest \
    -A Coverage \
    -A FisherStrand \
    -A MappingQualityRankSumTest \
    -A QualByDepth \
    -A RMSMappingQuality \
    -A ReadPosRankSumTest \
    -A StrandOddsRatio \
    -A DepthPerAlleleBySample \
    -I ${BAMS_DIR}${bam} \
    -o ${gvcf%.g.vcf}"_allann.vcf"

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Katie
    Hi,

    Hmm. Can you confirm that the GVCFs have the MQ annotation in them? I am not sure why INFINITY would be present after VariantAnnotator. Unfortunately, I don't think the team will fix any bugs in GATK3, as the focus is on GATK4 now. Can you also try running GATK4 to produce GVCFs and see if the MQ annotation is correctly output in the GVCFs? Right now, GenotypeGVCFs in GATK4 does not have the ability to output all sites in the final VCF, but there are plans to do so in the future.

    -Sheila

Sign In or Register to comment.