Variants with AD 0,0 and DP 0

OakmanOakman Member
edited November 7 in Ask the GATK team

I was investigating why some of the variants in my vcf produced with HaplotypeCaller miss some fields such as QD, DP, MQ, MQRankSum, BaseQRankSum.

I understood why by searching this forum, some of these values need other parameter to be present to be calculated, such as AD which is needed to calculate QD and so on.

So I realized that for all the variants that miss these fields I have the same GT:AD:DP values of 1/1:0,0:0, therefore AD is always 0,0 and DP 0

I read about informative and uninformative reads; some posts suggest that AD do not include uninformative reads, but DP does. By reading the documentation of DepthPerSampleHC seems like that DP in the FORMAT only includes informative reads like AD, but in the INFO has all unfiltered reads supporting that call.

why there is not the DP tag in the INFO column but only in the FORMAT for some calls?
why are these variants called if AD:DP 0,0:0 and DP in the INFO is missing?

I report an example:

Qrob_H2.3_Sc0001210 5578 . G T 18.59 . AC=2;AF=1.00;AN=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,0:0:3:45,3,0

A call with complete info is:
Qrob_H2.3_Sc0001407 1861 . A G 739.78 . AC=1;AF=0.500;AN=2;BaseQRankSum=2.371;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=32.87;MQRankSum=-1.787;QD=30.87;ReadPosRankSum=0.240;SOR=1.609 GT:AD:DP:GQ:PL 0/1:2,19:21:27:768,0,27

It is only one sample, I know it is not enough; I am actually trying and learning how to use the variant calling pipeline for the first time before I run it on all my data; it just a test but I would like to understand this.

I apologize if this is a repetitive question; I have searched the forums and got a few hints but did not obtain a satisfactory answer.

Thank you

Best Answers

Answers

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Oakman

    Here is a thread that will be helpful to you.

    Regards
    Bhanu

  • vruanovruano Member ✭✭
    edited November 8

    Hi @Oakman... I could start throwing in possible hypothesis that may explain the output (a part from the always lurking possibility of a bug) but I think it simply makes more sense for one of us to "debug" it and see what is going on exactly.

    Could you first make sure that you are using the latest GATK (I believe is 4.0.11), if not please try to reproduce this and report your findings here, either way.

    If this is still happening and If you can share your data with us only for the purpose of debugging it, then re-do the analysis with a small region around the variant(s) using -L (eg. Qrob_H2.3_Sc0001210:5378-5778). Hopefully the result is the same. Then extract that region from the input BAM/CRAM using PrintReads -L ... , and try again... hopefully the issue is still there. Now we have a small dataset that we can "play" with over here.

    Please post the location of the reference that you are using as it does not seem to be a standard one.

  • OakmanOakman Member
    edited November 9

    Hi @vruano,

    First of all Thank You!
    I am not using the latest version because I am working on a HPC and I do not have control over the installation of new software on it; the version that I am using is 4.0.8.1.

    After an initial check following the information in the thread pointed by bhanuGandham I can see that I have other variants in very close proximity (less than 10 bases) to my variant Qrob_H2.3_Sc0001210 5578, however these do not pass the hard filters because they have all annotations, the one who passes is the one I reported because it does not have the values I filter for.
    I will try to reproduce it on the small region like you suggested to check if it still persists.

    I am working on a non-model organism, European Oak, Quercus robur.
    The genome has been recently sequenced and assembled.
    The haploid assembly I am using is made of 12 chromosomes (constructed with 871 scaffolds) and 538 unassigned scaffolds; the variant I reported is in one of the unassigned scaffolds.

    The paper of the genome publication is : Oak Genome reveals facets of long lifespan
    Assembly download: .oakgenome.fr/?page_id=587

    Add www to it; I am new to the forum and I can not post link.

    It is the first file with ID PM1N

    Basically what I am trying to do is create a known set of variants and indels to use for BQSR and VQSR, that's why I ended up looking inside the variants generated for a sample; I found this as I was trying to plot QD and the other annotations with R to check how I can decide appropriate hard filters values but I realized some variants missed some annotations and my investigation started.

    Therefore I was wondering what I need to do with these calls, as they do not have the annotations I filter for therefore they would pass hard filters.
    If I may ask an extra question as all of this is actually related to it, how many samples would you use to create a known set of variants?

    I have 500 samples for about 9 tera of data (whole genome 22x coverage); I would like to avoid using all of them to find a true set of variants and then repeat analysis as I think it would be too computationally expensive.

    I was thinking to use 20 samples, call the variants without BQSR, hard filter variants and repeat a few times. Then using the resulting hard filtered vcf file from these 20 samples as known set for the analysis of the 500 samples.

    In the publication I posted they actually assessed the heterozygosity of the reference and they have a vcf but It has been constructed using the haploid reference before mapping to the chromosomes, I don t know why they did not map the reads to the anchored assembly. So these known variants are mapped to 1409 scaffolds; I mapped my reads to the reference with chromosomes therefore I can not use their set.
    Well, I could map my reads to the scaffolds like they did, but I am not too sure about that as we have the map.

  • OakmanOakman Member
    edited November 9

    I repeated the variant call on the small region as you suggested. There is a close variant after 30 nucleotides.

    In my previous post I said within 10 because I checked another of the variants which reported the same issues, in that case the adjacent variant was 7 bases away, in this example is 30.

    The results of the small region:

    Qrob_H2.3_Sc0001210 5578 . G T 18.59 . AC=2;AF=1.00;AN=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,0:0:3:45,3,0
    Qrob_H2.3_Sc0001210 5608 . G T 18.59 . AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=18.59;SOR=1.609 GT:AD:DP:GQ:PL 1/1:0,1:1:3:45,3,0
    Qrob_H2.3_Sc0001210 5623 . G A 18.59 . AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=18.59;SOR=1.609 GT:AD:DP:GQ:PL 1/1:0,1:1:3:45,3,0
    Qrob_H2.3_Sc0001210 5630 . G T 18.59 . AC=2;AF=1.00;AN=2;DP=1;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=18.59;SOR=1.609 GT:AD:DP:GQ:PL 1/1:0,1:1:3:45,3,0
    Qrob_H2.3_Sc0001210 5637 . T C 62.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
    Qrob_H2.3_Sc0001210 5645 . C G 62.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
    Qrob_H2.3_Sc0001210 5649 . C CAA 53.70 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=26.85;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
    Qrob_H2.3_Sc0001210 5661 . T TC 53.70 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=26.85;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
    Qrob_H2.3_Sc0001210 5662 . T TC 53.70 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=26.85;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
    Qrob_H2.3_Sc0001210 5666 . T C 62.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
    Qrob_H2.3_Sc0001210 5669 . A G 62.74 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=31.37;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,2:2:6:90,6,0
    Qrob_H2.3_Sc0001210 5704 . T A 87.28 . AC=2;AF=1.00;AN=2;DP=3;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=29.09;SOR=1.179 GT:AD:DP:GQ:PL 1/1:0,3:3:9:115,9,0
    Qrob_H2.3_Sc0001210 5770 . A G 27.75 . AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=23.00;QD=13.87;SOR=2.303 GT:AD:DP:GQ:PL 1/1:0,2:2:6:55,6,0

    It seems like the first 4 calls (the first is the investigated call, and the following 3) have been called using the same evidence (only 1 read). is this right?

    But the following 3 have all annotations, and they are discarded when I hard filters as the MQ is very low. The first however pass all filters because does not have all annotations.

  • OakmanOakman Member
    edited November 9

    This is the bamout of the region:

    HC1 16 Qrob_H2.3_Sc0001210 5426 60 337M * 0 0 AGGAATTAACTTTTTTAATTTTTTGAAAATTTTTGCTATTTTTTTTGGAATTTTCTAGCTCGGGTCTAGTATGAACGGAATTCGGGACTAAAATTTTTTTTTCTCTTTTTCGGCCTATCTCATCCCGTATTGGGTGAAAATAATGCCGGAAAGAGGTTTTTTATTTTTTTGCTATTTTTTTCGGAATTTTCTTCCCCGGGCCGTGTATGAATGGAATTCCGGACATTTTTTTTTTTTTTTTTCAGCCTATCTAATCCCGTTTTGGCAAAAAATAATGCTAGAAAGAGGTTTTTTAATTTTTTGAATTTTTTTGCTATTTGTTTCGGAATTTTCATAC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB HC:i:1569307690 RG:Z:ArtificialHaplotypeRG
    HC2 16 Qrob_H2.3_Sc0001210 5426 60 224M2I12M1I1M1I100M * 0 0 AGGAATTAACTTTTTTAATTTTTTGAAAATTTTTGCTATTTTTTTTGGAATTTTCTAGCTCGGGTCTAGTATGAACGGAATTCGGGACTAAAATTTTTTTTTCTCTTTTTCGGCCTATCTCATCCCGTATTGGGTGAAAATAATGCCGGAAATAGGTTTTTTATTTTTTTGCTATTTTTTTCTGAATTTTCTTCCCCAGGCCGTTTATGAACGGAATTCGGGACAAATTTTTTTTTTTCTCTTTCTCGGCCTATCTAATCCCGTTTTGGCAAAAAATAATGCAAGAAAGAGGTTTTTTAATTTTTTGAATTTTTTTGCTATTTGTTTCGGAATTTTCATAC BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB HC:i:-2038458152 RG:Z:ArtificialHaplotypeRG
    A00183:160:H2FWNDSXX:1:2653:31006:12727 163 Qrob_H2.3_Sc0001210 5579 23 71M2I12M1I1M1I62M = 5688 259 AGGTTTTTTATTTTTTTGCTATTTTTTTCTGAATTTTCTTCCCCAGGCCGTTTATGAACGGAATTCGGGACAAATTTTTTTTTTTCTCTTTCTCGGCCTATCTAATCCCGTTTTGGCAAAAAATAATGCAAGAAAGAGGTTTTTTAATTT FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFF5555555555555555555555,55,555555,5555 HC:i:-2038458152 MC:Z:150M MD:Z:29G14G6G6T7C14T1T3T2A34T20 RG:Z:1 XG:i:4 NM:i:14 XM:i:10 XN:i:0 XO:i:1 MQ:i:23 AS:i:-67 YS:i:-10 YT:Z:CP ms:i:5502
    A00183:160:H2FWNDSXX:4:2363:12210:13432 147 Qrob_H2.3_Sc0001210 5637 23 13M2I12M1I1M1I100M19H = 5482 -300 CGGAATTCGGGACAAATTTTTTTTTTTCTCTTTCTCGGCCTATCTAATCCCGTTTTGGCAAAAAATAATGCAAGAAAGAGGTTTTTTAATTTTTTGAATTTTTTTGCTATTTGTTTCGGAATTTTCATAC FFFFFFFF:FFF:F:,FFFFFFF,F,FFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF HC:i:-2038458152 MC:Z:149M MD:Z:0T7C5T0T13T2A34T65A11 RG:Z:1 XG:i:4 NM:i:12 XM:i:8 XN:i:0 XO:i:1 MQ:i:23 AS:i:-53 XS:i:-39 YS:i:-29 YT:Z:CP ms:i:5332
    A00183:160:H2FWNDSXX:1:2653:31006:12727 83 Qrob_H2.3_Sc0001210 5688 23 75M75H = 5579 -259 TGGCAAAAAATAATGCAAGAAAGAGGTTTTTTAATTTTTTGAATTTTTTTGCTATTTGTTTCGGAATTTTCATAC 5555555555555555555555555555555555555FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FF HC:i:-2038458152 MC:Z:72M4I74M MD:Z:16T65A67 RG:Z:1 XG:i:0 NM:i:2 XM:i:2 XN:i:0 XO:i:0 MQ:i:23 AS:i:-10 YS:i:-67 YT:Z:CP ms:i:5330
    A00183:160:H2FWNDSXX:1:2653:31006:12727 83 Qrob_H2.3_Sc0001210 5738 23 50H53M47H = 5579 -259 GCTATTTGTTTCGGAATTTTCATACCCGGGTCGAGTATGAACGGAATTCGGGA FFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF HC:i:-251543107 MC:Z:72M4I74M MD:Z:16T65A67 RG:Z:1 XG:i:0 NM:i:2 XM:i:2 XN:i:0 XO:i:0 MQ:i:23 AS:i:-10 YS:i:-67 YT:Z:CP ms:i:5330
    A00183:160:H2FWNDSXX:4:2363:12210:13432 147 Qrob_H2.3_Sc0001210 5738 23 101H44M = 5482 -300 GCTATTTGTTTCGGAATTTTCATACCCGGGTCGAGTATGAACGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF HC:i:-251543107 MC:Z:149M MD:Z:0T7C5T0T13T2A34T65A11 RG:Z:1 XG:i:4 NM:i:12 XM:i:8 XN:i:0 XO:i:1 MQ:i:23 AS:i:-53 XS:i:-39 YS:i:-29 YT:Z:CP ms:i:5332
    HC3 16 Qrob_H2.3_Sc0001210 5738 60 53M * 0 0 GCTATTTGTTTCGGAATTTTCATACCCGGGTCAAGTATGAACGGAATTCGGGA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB HC:i:-1519648585 RG:Z:ArtificialHaplotypeRG
    HC4 16 Qrob_H2.3_Sc0001210 5738 60 53M * 0 0 GCTATTTGTTTCGGAATTTTCATACCCGGGTCGAGTATGAACGGAATTCGGGA BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB HC:i:-251543107 RG:Z:ArtificialHaplotypeRG

    It seems like for the first 4 variants called there is only one read A00183:160:H2FWNDSXX:1:2653:31006:12727 right?

  • OakmanOakman Member

    I don't know if these snapshots can be of any help.
    I have to attach them as files because I haven t been here long enough to post links.

    The first alignment is the one from Haplotype caller bam-out.
    The second is my original alignment I have used in HaplotypeCaller.

    I selected the read close to the Variant which has problems (the first you can see in the alignment debugging.bam) and selected view as pairs.

    I see that in the original alignment and also in the HapCaller alignment that this read does not overlap the variant of interest? It starts one base after that.

    In the original alignment I have a few reads that reported that variant but those are not present in the alignment produced by Hap Caller which has only a few reads for that area.

    I am trying to understand this, forgive me if it get confusing. I hope I have provided enough info,

  • vruanovruano Member ✭✭
    edited November 9

    About your method questions. I bet there is some threads in the forum that address using BQSR/VQSR without known variants.... example: https://gatkforums.broadinstitute.org/gatk/discussion/1247/what-should-i-use-as-known-variants-sites-for-running-tool-x

    If you want to use those variants mapped to the scaffolds I would suggest that you map those scaffolds to the reference and then localize the variants on the reference thru the scaffold alignment.

  • OakmanOakman Member

    Yes I think it makes sense. Yes I do have many because plants genomes are full of repetitions and repeated elements, so some reads align to multiple places.

    I think your explanation makes sense, I am looking at IGV again and I can see that in my original Bam for the variant in question (Qrob_H2.3_Sc0001210 5578) I have 5 reads as evidence; 3 of them are in white, 1 has MQ=3 and 1 has MQ=23, and all of them are not present in the outbam alignemnt so I think they are disqualified.

    What I wonder however is, why it calls the genotype 1/1 when there is not evidence after realignment?

    Thanks, yes I apologize if I uploaded my data in the wrong way. I do have the data in this format (the original reference and the section of my original alignment) if you would like to reproduce it but I can't upload it here because the reference exceed the size limit to upload in a post.

  • OakmanOakman Member
    edited November 9

    @vruano said:
    About your method questions. I bet there is some threads in the forum that address using BQSR/VQSR without known variants.... example:

    LINK REMOVED

    If you want to use those variants mapped to the scaffolds I would suggest that you map those scaffolds to the reference and then localize the variants on the reference thru the scaffold alignment.

    Yes I am actually following these threads and the recommendation for creating a true set for non model organisms:
    "First, do an initial round of SNP calling on your original, unrecalibrated data. Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence"

    I am just a bit unsure on how many samples to use to be able to create a new set, as I found threads suggesting 10 as not many; anyway this is another story, let's just keep this thread about the annotation question and thanks for pointing the threads, I ll search some more about this.

    And thanks for the idea to map scaffolds to reference and then localize the variants; I did not think about that but could actually work, even though I still have to assess whether the heterozygosity set available is to be trusted.

    Thanks for everything and if you are interested in reproducing my issue(bug or whatever we can call it) I can provide you with the small bam of the area in question and reference with dictionary already created, I will search the forum to see if I find how to upload debug data.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Oakman

    To send us your data please follow all the instructions provided here. Also once you do that please tag us: @bhanuGandham and @vruano to let us know that you have uploaded the data.

    Regards
    Bhanu

  • OakmanOakman Member

    Hi @bhanuGandham and @vruano ,

    Sorry for the wait.

    I have uploaded the data in your ftp server.
    The archive is: annotation_issue.zip

    I only included a small snippet of the area which we discussed in this thread in which the variant in question is Qrob_H2.3_Sc0001210 5578 . G T 18.59 . AC=2;AF=1.00;AN=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;SOR=0.693 GT:AD:DP:GQ:PL 1/1:0,0:0:3:45,3,0; snippet generated from my original bam with:

    gatk PrintReads -I original.bam -L Qrob_H2.3_Sc0001210:5300-5800 -O subset.bam

    I included reference genome with index and dictionary in the archive.

    Let me know how it goes once you have time to try it.

  • vruanovruano Member ✭✭

    Hi @Oakman,

    Thanks for the upload. As a note the reference provided in the zip file seems to be somehow corrupted
    as the fasta (.fa) file only includes a singe sequence (Qrob_Chr01) and most of it would contain nucleotide lines of different sizes (including blank lines).

    Fortunately I found it online https://urgi.versailles.inra.fr/download/oak/Qrob_PM1N.fa.gz. So perhaps you can confirm that it is alright to use that version of the reference.

    Is important that before uploading the zip/tar.gz you put yourself in our position; and try to reproduce the error using only the files in zip rather than your originals.

  • vruanovruano Member ✭✭

    Hi @Oakman

    I can confirm that in the enclosing active region HC only consider two read pairs (*:12210:13432, and *:3106:1272) and the reason seems to be very low MQ (0s, 2s 6s ...) for the other reads.

    Of those four, 1 is discarded because is too different vs the reference. (*:1:2363:12210:13432). You have only 3 reads.

    There is a total number of 2 haplotypes, one is the reference and the other is the one containing all the variants in the regions so they all seem to be totally linked. That is way the same PLs.

    Now it seems that we may have a bug here.... the only read that allegedly supports the 5578 variant star mapping location in in fact 5579.... this does not mean that is wrong to consider it but it seems that we do it inconsistently .... for the PLs we apply a padding around the variant location to consider a read as relevant (and 1bp distance is clearly within the margins) but it comes to AD/DP there is no padding. That explains the discrepancy that you are observing here. If these were consistency either the AD/DP would have that read in or the variant would never be output. This also explains and would resolve the lack of DP and other annotations in the INFO field...

    Unfortunately I cannot say what is the best option of the two here.... but I'm confident that we should seek to make it consistent.

    As for the lack of INFO annotations ... if there is no DP in the genotype then these are not included. Ensuring consistency on the padding overlap for PL and AD would solve the issue here as well.

  • vruanovruano Member ✭✭

    I have filed the issue. Eventually someone will address this. I put a note that when that happens that someone will come back to this thread to give the final answer on the matter.

  • vruanovruano Member ✭✭
    edited November 19

    Please whatever the course of action, report it back on the forum thread using the link above.
    Oups... posted in the wrong place and there is no way to delete it, ... well.

  • OakmanOakman Member
    edited November 19

    Hi @vruano ,

    I apologize for the broken reference; I have just checked the zipped folder which I uploaded, and after extracting the reference is fine on my side. I am not sure how this happened, maybe some issue in the upload.

    Anyway yes, you downloaded the correct reference, thank you.

    Thank you for following up on this, I will be waiting to hear from your team.

  • phhphh Member

    Hi @vruano and @bhanuGandham

    The command I used is:

    gatk --java-options "-Xmx40g" HaplotypeCaller -A AlleleFraction -R ref_genome.fasta -I input.bam -O output.vcf --minimum-mapping-quality 30 -mbq 30

    I also observed the same bug. Here are some examples:

    Chr1 83934 . A G 1697.77 . AC=2;AF=1.00;AN=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;SOR=0.693 GT:AD:AF:DP:GQ:PL 1/1:0,0:NaN:0:99:1726,129,0
    Chr1 83935 . A G 62.74 . AC=2;AF=1.00;AN=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;SOR=0.693 GT:AD:AF:DP:GQ:PL 1/1:0,0:NaN:0:6:90,6,0
    Chr5 450249 . C G 10.26 . AC=2;AF=1.00;AN=2;DP=102;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=56.86;QD=0.10;SOR=0.693 GT:AD:AF:DP:GQ:PL 1/1:0,0:NaN:0:18:38,18,0
    Chr5 927458 . A ATATACTATACTATAC 54.19 . AC=1;AF=0.500;AN=2;DP=129;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.04;QD=0.42;SOR=0.693 GT:AD:AF:DP:GQ:PL 0/1:0,0:NaN:0:4:90,0,4

    The version I used is GATK 4.0.11.0. Thanks,

Sign In or Register to comment.