Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.5 is out. See the GATK4 beta page for download and details.

Problem with allele specific annotation AS_QualByDepth (AS_QD) during variant calling

Hi GATK team,

First a big thank you for all your hard work in developing the tool and supporting the users!

I am trying out the allelic specific(AS) annotations in version 3.6. While I have gotten a few other AS annotations to properly show up in my VCF, I am having trouble getting the AS_QualByDepth in particular.

For example, I tried to call variant on a few samples at a specific locus with a "T" homopolymer run. I first ran HaplotypeCaller in the GVCF mode for each sample:

java -jar GenomeAnalysisTK.jar\
  -T HaplotypeCaller \
  --emitRefConfidence GVCF -variant_index_type LINEAR -variant_index_parameter 128000 \
  -R ref_fasta \
  -I sample_$i \
  -L chr1:10348759-10348801 \
  -A AS_StrandOddsRatio -A AS_FisherStrand -A AS_QualByDepth \
  -A AS_BaseQualityRankSumTest -A AS_ReadPosRankSumTest -A AS_MappingQualityRankSumTest
  -o sample_$i.gvcf

I then did GenotypeGVCFs on all the samples together:

java -jar GenomeAnalysisTK.jar\
  -T GenotypeGVCFs \
  -R ref_fasta \
  -V gvcf_list \
  -L chr1:10348759-10348801 \
  -A AS_StrandOddsRatio -A AS_FisherStrand -A AS_QualByDepth \
  -A AS_BaseQualityRankSumTest -A AS_ReadPosRankSumTest -A AS_MappingQualityRankSumTest
  -o out.vcf

In the final joint-called VCF header, the following AS annotations all showed up.

##INFO=<ID=AS_BaseQRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt Vs. Ref base qualities">
##INFO=<ID=AS_FS,Number=A,Type=Float,Description="allele specific phred-scaled p-value using Fisher's exact test to detect strand bias of each alt allele">
##INFO=<ID=AS_MQRankSum,Number=A,Type=Float,Description="Allele-specific Mapping Quality Rank Sum">
##INFO=<ID=AS_QD,Number=1,Type=Float,Description="Allele-specific Variant Confidence/Quality by Depth">
##INFO=<ID=AS_RAW_BaseQRankSum,Number=1,Type=String,Description="raw data for allele specific rank sum test of base qualities">
##INFO=<ID=AS_RAW_MQRankSum,Number=1,Type=String,Description="Allele-specific raw data for Mapping Quality Rank Sum">
##INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">
##INFO=<ID=AS_ReadPosRankSum,Number=A,Type=Float,Description="allele specific Z-score from Wilcoxon rank sum test of each Alt vs. Ref read position bias">
##INFO=<ID=AS_SB_TABLE,Number=1,Type=String,Description="Allele-specific forward/reverse read counts for strand bias tests">
##INFO=<ID=AS_SOR,Number=A,Type=Float,Description="Allele specific strand Odds Ratio of 2x|Alts| contingency table to detect allele specific strand bias">

However, in the INFO column, I only got the other AS annotations but not AS_QD.

chr1    10348779        .       AT      A,ATT   981.29  .       AC=4,2;AF=0.333,0.167;AN=12;AS_BaseQRankSum=-1.087,-2.521;AS_FS=3.986,7.378;AS_MQRankSum=-1.130,-2.349;AS_ReadPosRankSum=-1.192,-1.396;AS_SOR=0.415,0.254;BaseQRankSum=-6.350e-01;ClippingRankSum=0.00;DP=627;ExcessHet=14.6052;FS=6.378;MLEAC=4,2;MLEAF=0.333,0.167;MQ=59.95;MQRankSum=0.00;QD=1.94;ReadPosRankSum=-1.050e-01;SOR=0.352        GT:AD:DP:GQ:PL  0/1:44,9,7:63:81:81,0,1033,93,844,1165  0/1:71,11,8:99:47:47,0,1659,110,1414,1803       0/1:54,15,7:81:99:205,0,1239,280,1087,1635      0/1:69,25,12:106:99:311,0,1603,336,1306,2058    0/2:55,11,22:94:99:291,233,1636,0,943,1294      0/2:61,11,14:91:14:92,14,1473,0,1071,1468

I also checked the individual sample gVCFs. Similarly, there is AS_QD in the header but not in the INFO column. I wondering if this might be a bug or I am doing something wrong.

Another curious thing I noticed is that in the VCF header, the other AS annotations all have "Number=A" but AS_QD has "Number=1". Don't know if this might be causing some problem.

Issue · Github
by Sheila

Issue Number
976
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
chandrans

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rjiang03
    Hi,

    Yes, I just tried this myself and see the same thing. Let me check with the team and get back to you. I suspect the annotation needs to be "enabled" in the code.

    -Sheila

  • rjiang03rjiang03 bostonMember

    Thanks a lot for looking into this !
    rj

  • rjiang03rjiang03 bostonMember

    Hi Sheila,

    Just wondering if there is been any update on this issue. Thanks !

    rj

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rjiang03
    Hi RJ,

    Yes, it seems the annotation only appears in the final VCF. So, you will need to run GenotypeGVCFs on all of your GVCFs to get AS_QD.

    -Sheila

  • bgrenierbgrenier FranceMember

    Hi,

    I have the same observations than rjiang03 :
    1) when using only the parameter -A AS_QualByDepth during the HaplotypeCaller, CombineGVCFs and GenotypeGVCFs, the AS_QD values are missing in the final vcf. To have this annotation, you have to use -G AS_StandardAnnotation
    2) the vcf format of AS_QD as "Number=1" is wrong. This should be Number=A since we have one value per alternate allele. (By the way this is also the case of AS_MateMQRankSum annotation).

    For point 1, this is not so important because we can use the parameter -G AS_StandardAnnotation instead but the behavior of the parameter -A AS_QualByDepth is intriguing.

    For point 2, this is more important, especially for the AS_QD annotation, because it is used in the VQSR allele specific process. How can we be sure that this is handled correctly by VQSR ? Also, to normalize multi-allelic sites by splitting them into bi-allelic sites, this annotation will not be handled correctly by software like bcftools because of the wrong format and, as a result, we will have the full AS_QD values reported for every resulting bi-allelic variants.

    Best,

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @bgrenier
    Hi,

    I think there were some issues with annotations in earlier versions, but they should be fixed now. Are you using the latest version of GATK? Can you also try with GATK4 latest beta?

    Thanks,
    Sheila

  • bgrenierbgrenier FranceMember
    edited October 17

    @Sheila ,

    Hi,

    The fact that the AS_QD is not produce when only using the -A AS_QualByDepth parameter along with -G StandardAnnotation (and -G StandardHCAnnotation during the HaplotypeCaller step) without the -G AS_StandardAnnotation is indeed a bug but can be bypassed when producing all the standard and allele specific annotations (with the -G StandardAnnotation, -G AS_StandardAnnotation and -G StandardHCAnnotation). This is the case with the GATK-3.8 version.

    I tried to use the latest GATK4 beta version but this leads me to a bug at the GenomicsDBImport step before GenotypeGVCFs which is related to the incorrect format of some the AS_ annotations in the vcf header as pointed in my last post (point 2).

    For example, I produced the gvcf files of 2 samples using HaplotypeCaller of GATK4 and indexed them with IndexFeatureFile (because the index is not created by HaplotypeCaller, I know the bug is open).
    Then, I tried to combine those 2 samples using GenomicsDBImport which failed with the error :

    htsjdk.tribble.TribbleException: Could not decode field AS_RAW_MQ with value 74989.00|33586.00|0.00 of declared type Float

    As far as I understand, AS_RAW_xxx annotations represent intermediate annotations produced by HaplotypeCaller which be be used by GenotypeGVCFs to compute the "real" AS annotation (for example, AS_RAW_MQ will be used to compute the final AS_MQ and so on). These AS_RAW_xxx annotations have format where group of values are separated by the "|" character instead of the "," character.

    So in conclusion, all AS_RAW_xxx annotations must be declared as Number=1 and Format=String in the header. This is not the case for AS_RAW_MQ which is declared as Number=A and Format=Float both in GATK3.8 and GATK4 versions.

    Then I tried to correct the format in the header for AS_RAW_MQ. The GenomicsDBImport now throws the error :
    htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: Discordant field size detected for field AS_RAW_ReadPosRankSum at 22:16051453. Field had 35 values but the header says this should have 1 values based on header record INFO=<ID=AS_RAW_ReadPosRankSum,Number=1,Type=String,Description="allele specific raw data for rank sum test of read position bias">

    The content for these annotations ofr both sample are :

    AS_RAW_ReadPosRankSum=1,1,4,1,8,1,9,1,10,1,12,2,13,2,14,1,15,1,17,1,18,1,19,1,20,39|6,1,8,1,15,1,16,1,20,17|
    AS_RAW_ReadPosRankSum=7,1,20,16|2,1,6,1,11,1,12,1,20,12|
    

    This annotation is declared as a string format. Why does GenomicsDBImport throw this error ? The combineGVCFs of GATK3.8 does not complain about this.

    To sum up about format bugs :
    - AS_RAW_MQ should be declared as Number=1, Format=String
    - AS_QD should be declared as Number=A, Format=Float
    - AS_MateMQRankSum should be declared as Number=A, Format=Float

    To sum up about software problems
    - How does GenomicsDBImport deal with AS_RAW_ReadPosRanksum (which can also be the case with other AS or non AS annotations, I don't know)
    - How can we be sure that CombinGVCFs and GenotypeGVCFs of GATK3.8 handle correctly these AS annotations with these format errors ?

    I hope I was clear enough.

    Best,

    Post edited by bgrenier on
Sign In or Register to comment.