Recommendation / best practices GVCF GQ bands

rernstrernst UMC UtrechtMember
edited February 2016 in Ask the GATK team

Dear GATK team,

Do you have any recommendation or best practice guidelines on the GVCFGQBands setting? This used to be (gatk < 3.2) [0,5,20,60] but has been changed one block per GQ value in gatk > 3.3. We would like to decrease the file size of the g.vcf files without loosing to much resolution / sensitivity. In the gvcf docs the 'old' defaults are still mentioned: https://www.broadinstitute.org/gatk/guide/article?id=4017

Thanks,
Robert

Tagged:

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    We don't provide recommendations for the blocks, as it is up to you to determine what works for your analysis. The reason we changed from the original default blocks was to provide more resolution. However, if the file is too big with the current default blocks, you can try experimenting with the band size to find a size that gives you good resolution and small file size.

    Good luck!

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You may notice that the bands are only single position up to a certain multiple of 10 (I think 50 but I could be misremembering). The idea is to find the optimal balance between resolution in the low-confidence areas, and good compression in the high-confidence areas. The happy trade off point will vary with the quality of your sequencing data, so don't hesitate to adapt the scheme to the data you have (look at distribution of GQ values in a bp_resolution dataset to get a sense of what you're dealing with).

    We will make a note to update the doc.

  • rernstrernst UMC UtrechtMember
    edited February 2016

    Thanks for the answers.

    Because I wanted to see how small a gvcf approximately could get I wanted to use the following bins:
    GVCFBlock0-100=minGQ=0(inclusive),maxGQ=100(exclusive)
    GVCFBlock100-2147483647=minGQ=100(inclusive),maxGQ=2147483647(exclusive)

    However I see a lot of positions that are not binned together even though they fall in the same bin, for example:
    1 12808 . G . . END=12994 GT:DP:GQ:MIN_DP:PL 0/0:23:21:16:0,21,615
    1 12995 . C . . END=12996 GT:DP:GQ:MIN_DP:PL 0/0:34:99:34:0,102,1460
    1 12997 . C . . END=12998 GT:DP:GQ:MIN_DP:PL 0/0:35:99:34:0,99,1485
    1 12999 . C . . END=13024 GT:DP:GQ:MIN_DP:PL 0/0:40:99:37:0,100,1438
    1 13025 . T . . END=13025 GT:DP:GQ:MIN_DP:PL 0/0:42:99:42:0,99,1721
    1 13026 . G . . END=13078 GT:DP:GQ:MIN_DP:PL 0/0:52:99:43:0,114,1750

    I also see this behaviour when using a more sophisticated binning strategy (I only show 2 blocks):
    GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
    GVCFBlock90-2147483647=minGQ=90(inclusive),maxGQ=2147483647(exclusive)
    1 12872 . G . . END=12881 GT:DP:GQ:MIN_DP:PL 0/0:24:61:24:0,61,969
    1 12882 . C . . END=12882 GT:DP:GQ:MIN_DP:PL 0/0:25:60:25:0,60,1057

    1 14457 . A . . END=14457 GT:DP:GQ:MIN_DP:PL 0/0:35:90:35:0,90,1398
    1 14458 . G . . END=14463 GT:DP:GQ:MIN_DP:PL 0/0:36:92:35:0,92,1364

    To my understanding these all fall in the same block, nevertheless they are printed as separate blocks. Do you have any clue what the cause of this is?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Can you post the list of blocks defined in the GVCF header?

  • rernstrernst UMC UtrechtMember

    For the first example they were already in my previous post (now in code blocks):
    ##GVCFBlock0-100=minGQ=0(inclusive),maxGQ=100(exclusive)
    ##GVCFBlock100-2147483647=minGQ=100(inclusive),maxGQ=2147483647(exclusive)

    Second example:
    ##GVCFBlock0-1=minGQ=0(inclusive),maxGQ=1(exclusive)
    ##GVCFBlock1-2=minGQ=1(inclusive),maxGQ=2(exclusive)
    ##GVCFBlock10-11=minGQ=10(inclusive),maxGQ=11(exclusive)
    ##GVCFBlock11-12=minGQ=11(inclusive),maxGQ=12(exclusive)
    ##GVCFBlock12-13=minGQ=12(inclusive),maxGQ=13(exclusive)
    ##GVCFBlock13-14=minGQ=13(inclusive),maxGQ=14(exclusive)
    ##GVCFBlock14-15=minGQ=14(inclusive),maxGQ=15(exclusive)
    ##GVCFBlock15-16=minGQ=15(inclusive),maxGQ=16(exclusive)
    ##GVCFBlock16-17=minGQ=16(inclusive),maxGQ=17(exclusive)
    ##GVCFBlock17-18=minGQ=17(inclusive),maxGQ=18(exclusive)
    ##GVCFBlock18-19=minGQ=18(inclusive),maxGQ=19(exclusive)
    ##GVCFBlock19-20=minGQ=19(inclusive),maxGQ=20(exclusive)
    ##GVCFBlock2-3=minGQ=2(inclusive),maxGQ=3(exclusive)
    ##GVCFBlock20-25=minGQ=20(inclusive),maxGQ=25(exclusive)
    ##GVCFBlock25-30=minGQ=25(inclusive),maxGQ=30(exclusive)
    ##GVCFBlock3-4=minGQ=3(inclusive),maxGQ=4(exclusive)
    ##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
    ##GVCFBlock4-5=minGQ=4(inclusive),maxGQ=5(exclusive)
    ##GVCFBlock40-50=minGQ=40(inclusive),maxGQ=50(exclusive)
    ##GVCFBlock5-6=minGQ=5(inclusive),maxGQ=6(exclusive)
    ##GVCFBlock50-60=minGQ=50(inclusive),maxGQ=60(exclusive)
    ##GVCFBlock6-7=minGQ=6(inclusive),maxGQ=7(exclusive)
    ##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
    ##GVCFBlock7-8=minGQ=7(inclusive),maxGQ=8(exclusive)
    ##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
    ##GVCFBlock8-9=minGQ=8(inclusive),maxGQ=9(exclusive)
    ##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
    ##GVCFBlock9-10=minGQ=9(inclusive),maxGQ=10(exclusive)
    ##GVCFBlock90-2147483647=minGQ=90(inclusive),maxGQ=2147483647(exclusive)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I wanted to see what was in the GVCF header for confirmation that your requested blocks had been interpreted correctly. Looks ok.

    I'm not sure why this is happening; would you say this affects a lot of sites or a minority?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Also, are you using any sort of parallelism for execution?

  • rernstrernst UMC UtrechtMember
    edited February 2016

    Well GQ never gets higher then 99, so I was expecting to have only 1 bin between each variant in the first example. However I have 4588077 positions where there is evidence for a variant and 186349328 blocks where there is no evidence. So thats about 40 blocks per variant.

    For the second example it is a bit harder to quantify, unless we do some scripting. But I would see that in the first 500 lines or so there are 10/20 bins that can be merged (just some eyeballing). So yes, I think it is a lot.

    Yes, I am using queue and scattering (1000). However these 'breaks' in bins are not only located at the 'breaks' of scatters, this was my first thought as well.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    I just tried changing the GVCF block size, and it worked correctly for me. I added -GQB 50 and -GQB 99 to my usual HaplotypeCaller command. Can you tell us the exact command you are running?
    Can you also confirm you are using the latest version of GATK?

    Thanks,
    Sheila

  • rernstrernst UMC UtrechtMember

    @Sheila
    Yes I am using 3.5.

    This is my queue command:
    java -Xmx4G -Djava.io.tmpdir=/path/to/tmp -jar /path/to/GenomeAnalysisTK-3.5/Queue.jar -jobQueue all.q -jobNative " -P cog_bioinf -pe threaded 2 -q all.q -l h_rt=4:0:0,h_vmem=9G" -jobRunner GridEngine -jobReport /path/to/VariantCaller.jobReport.txt -memLimit 5 -S /path/to/HaplotypeCaller_gVCF.scala -R /path/to/genome.fasta -O test_run -mem 5 -nct 2 -nsc 1000 -stand_call_conf 30 -stand_emit_conf 15 -I /path/to/sample.bam -D /path/to/dbsnp.vcf -retry 1 -run

    This is the scala script: https://github.com/CuppenResearch/IAP/blob/master/QScripts/HaplotypeCaller_gVCF.scala to which I added the line: haplotypeCaller.GVCFGQBands = List(100)

    This is translated to the following command (logs queue scatter):
    -T HaplotypeCaller -I /path/to/sample.bam -L /path/to/tmp/.queue/scatterGather/HaplotypeCaller_gVCF-1-sg/temp_0909_of_1000/scatter.intervals -R /path/to/genome.fasta -nct 2 -variant_index_type LINEAR -variant_index_parameter 128000 -o /path/to/sample.g.vcf.gz -ERC GVCF -stand_call_conf 30.0 -stand_emit_conf 15.0 -ploidy 2 -GQB 100

    I also tried running the folowing gatk command:
    java -Xmx4G -jar /path/to/GenomeAnalysisTK-3.5/GenomeAnalysisTK.jar -T HaplotypeCaller -I sample.bam -R /path/to/genome.fasta -o test.g.vcf.gz RC GVCF -stand_call_conf 30.0 -stand_emit_conf 15.0 -ploidy 2 -GQB 100
    This gives the same result as my queue script:
    1 12808 . G . . END=12994 GT:DP:GQ:MIN_DP:PL 0/0:23:21:16:0,21,615
    1 12995 . C . . END=12996 GT:DP:GQ:MIN_DP:PL 0/0:34:99:34:0,102,1460
    1 12997 . C . . END=12998 GT:DP:GQ:MIN_DP:PL 0/0:35:99:34:0,99,1485
    1 12999 . C . . END=13024 GT:DP:GQ:MIN_DP:PL 0/0:40:99:37:0,100,1438
    1 13025 . T . . END=13025 GT:DP:GQ:MIN_DP:PL 0/0:42:99:42:0,99,1721
    1 13026 . G . . END=13078 GT:DP:GQ:MIN_DP:PL 0/0:52:99:43:0,114,1750

    I also added a -GQB 50 just like you, but this will also give non binned positions that should be binned:
    1 12995 . C . . END=12996 GT:DP:GQ:MIN_DP:PL 0/0:34:99:34:0,102,1460
    1 12997 . C . . END=12998 GT:DP:GQ:MIN_DP:PL 0/0:35:99:34:0,99,1485
    1 12999 . C . . END=13024 GT:DP:GQ:MIN_DP:PL 0/0:40:99:37:0,100,1438
    1 13025 . T . . END=13025 GT:DP:GQ:MIN_DP:PL 0/0:42:99:42:0,99,1721
    1 13026 . G . . END=13078 GT:DP:GQ:MIN_DP:PL 0/0:52:99:43:0,114,1750

    Issue · Github
    by Sheila

    Issue Number
    668
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • rernstrernst UMC UtrechtMember
    edited March 2016

    @Sheila
    Can you give an update on this issue? Do you require more information from my side?

    Thanks,
    Robert

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    Sorry for the delay. We have been busy with a workshop. I have tried all kinds of combinations of your command and I still get the correct results! Let me ask Geraldine to have a look. She will get back to you soon.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @rernst, I think I know what's happening. What's being used to bin the calls are the GQs as they are before capping at 99. If you look at the corresponding PL values, they alternate neatly between being below or equal/above 100. I'll check with the devs whether that's the desired behavior or just an oversight.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Actually I just realized I don't need to ask them -- it's just not something they would have considered. In our default blocking scheme the highest block is "90 and above" so it wouldn't matter either way. I don't think we'd want to change the behavior since it does allow anyone who wants to get more resolution in the higher confidence levels to do so.

  • rernstrernst UMC UtrechtMember

    Hi @Geraldine_VdAuwera , thanks for your answer.

    In your default blocking scheme, the highest block is 99 and above, when using queue at least it is. I do understand what your are saying but I don't think that this explains everything I see. The problem is that it also happens in 'intermediate blocks', which can not be caused by capping of GQ scores. In the lines below (they are also in my 2nd post in this topic), both the GQ and PL of the positions fall in the same block.

    Furthermore, how would someone get more resolution at sides with a GQ >99 if they are capped anyway? Do you use PL's when genotyping the gvcf's or can the capping value be set using a parameter?

    A more sophisticated binning strategy (I only show 2 blocks):
    GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
    GVCFBlock90-2147483647=minGQ=90(inclusive),maxGQ=2147483647(exclusive)
    1 12872 . G . . END=12881 GT:DP:GQ:MIN_DP:PL 0/0:24:61:24:0,61,969
    1 12882 . C . . END=12882 GT:DP:GQ:MIN_DP:PL 0/0:25:60:25:0,60,1057

    1 14457 . A . . END=14457 GT:DP:GQ:MIN_DP:PL 0/0:35:90:35:0,90,1398
    1 14458 . G . . END=14463 GT:DP:GQ:MIN_DP:PL 0/0:36:92:35:0,92,1364

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Robert, yes genotyping uses the PLs. GQ is really just a convenience annotation that is used later in the pipeline. And when we have the final VCF we don't normally care about higher resolution where GQ is above 99, but if we want it we could still get it from the PLs.

    For those other sites, I can't say with certainty but I expect this to be an effect from rounding. The decision to block sites together or keep them separate is made before the PLs are rounded to integers, if I recall correctly.

    None of this seems particularly concerning to me, to be honest. Do you have any particular worries about this, beyond the superficial inconsistency (which I agree is not ideal)? Do you see a lot of these cases or is it fairly anecdotal?

  • rernstrernst UMC UtrechtMember

    Hi Geraldine,

    I see quite a lot of cases, in the first 100 lines, I see about 10 lines that could be merged. I did not systematically check this, just eye balling, this number increases dramatically if I lower the number of bins. I don't know how these bins are used when performing the genotypegvcf step and as this clearly is not the behaviour as described in the docs so I thought it would be good idea to report it.

    The initial goal of this undertaking was to see whether we can decrease the file size of the gvcf files. Although we can decrease the file size by using a binning strategy instead of none, there is stil quite some 'space' to win if binning was performed in the way it is described. If the current behaviour however is better for genotyping etc, than that is fine for us as well and we should leave it as it is.

    Issue · Github
    by Sheila

    Issue Number
    703
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, maybe I dismissed this a little too quickly. Would you be able to give us a test case to reproduce this and debug locally? Instructions are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

  • rernstrernst UMC UtrechtMember
    edited March 2016

    I have uploaded a test case to your ftp: rernst_31032016_gvcf_bin_test.zip.

    It contains a slice of bam, my scala script and the final g.vcf.

    It should contain the following (wrongly binned) lines:
    ##GVCFBlock0-5=minGQ=0(inclusive),maxGQ=5(exclusive)
    ##GVCFBlock10-15=minGQ=10(inclusive),maxGQ=15(exclusive)
    ##GVCFBlock15-20=minGQ=15(inclusive),maxGQ=20(exclusive)
    ##GVCFBlock20-25=minGQ=20(inclusive),maxGQ=25(exclusive)
    ##GVCFBlock25-30=minGQ=25(inclusive),maxGQ=30(exclusive)
    ##GVCFBlock30-40=minGQ=30(inclusive),maxGQ=40(exclusive)
    ##GVCFBlock40-50=minGQ=40(inclusive),maxGQ=50(exclusive)
    ##GVCFBlock5-10=minGQ=5(inclusive),maxGQ=10(exclusive)
    ##GVCFBlock50-60=minGQ=50(inclusive),maxGQ=60(exclusive)
    ##GVCFBlock60-70=minGQ=60(inclusive),maxGQ=70(exclusive)
    ##GVCFBlock70-80=minGQ=70(inclusive),maxGQ=80(exclusive)
    ##GVCFBlock80-90=minGQ=80(inclusive),maxGQ=90(exclusive)
    ##GVCFBlock90-2147483647=minGQ=90(inclusive),maxGQ=2147483647(exclusive)
    1 14038 . G <NON_REF> . . END=14038 GT:DP:GQ:MIN_DP:PL 0/0:10:30:10:0,30,413
    1 14039 . G <NON_REF> . . END=14039 GT:DP:GQ:MIN_DP:PL 0/0:10:30:10:0,30,421
    1 14040 . C <NON_REF> . . END=14051 GT:DP:GQ:MIN_DP:PL 0/0:12:30:10:0,30,433

    1 14441 . C <NON_REF> . . END=14446 GT:DP:GQ:MIN_DP:PL 0/0:15:33:15:0,33,495
    1 14447 . A <NON_REF> . . END=14447 GT:DP:GQ:MIN_DP:PL 0/0:15:30:15:0,30,584

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    I just submitted a bug report. I found a similar issue without using -GQB. I made a note that the issue also reproduces with -GQB. Let's see what the team says. In the meantime, you can monitor the issue here.

    -Sheila

  • rernstrernst UMC UtrechtMember

    Thanks for looking into this!

  • rernstrernst UMC UtrechtMember

    Hi,

    Can you give an update on this issue (fortunately the bug tracker does not allow us to view progress)?

    Thanks,
    Robert

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    Sorry, it looks like not much progress has been made. I know the developer assigned is working on a fix that will probably fix many bugs at once. However, I'm not sure if this bug is one of those many bugs. I will bring this up at our meeting on Tuesday and see what we can do to expedite the fix.

    -Sheila

  • rernstrernst UMC UtrechtMember

    Hi Sheila,

    Great! Any news after the meeting?

    Robert

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    Yes, one of the developers will take this bug on right after he finishes a quick other fix :smiley:

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    The fix is in! You can download the nightly build tomorrow :smiley:

    -Sheila

  • rernstrernst UMC UtrechtMember

    Hi Sheila,

    Thanks for the fix. I have run my test sample again and it seems to work. The example lines above are now correctly merged.

    However there are still some cases left, I think this has to do with setting GQ values which are > 99 to 99 after creating the bins.

    For example:
    ##GVCFBlock0-100=minGQ=0(inclusive),maxGQ=100(exclusive)
    ##GVCFBlock100-2147483647=minGQ=100(inclusive),maxGQ=2147483647(exclusive)
    1 13705 . C <NON_REF> . . END=13710 GT:DP:GQ:MIN_DP:PL 0/0:38:99:36:0,105,1575
    1 13711 . C <NON_REF> . . END=13712 GT:DP:GQ:MIN_DP:PL 0/0:38:88:38:0,88,1493
    1 13713 . G <NON_REF> . . END=13812 GT:DP:GQ:MIN_DP:PL 0/0:59:99:38:0,105,1496

    These lines are correctly binned when setting the bins to:
    ##GVCFBlock0-99999999=minGQ=0(inclusive),maxGQ=99999999(exclusive)
    ##GVCFBlock99999999-2147483647=minGQ=99999999(inclusive),maxGQ=2147483647(exclusive)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    Ah, I see what you are saying. Okay. I will get back to the team and see what they think.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    This should be fixed in the nightly build. Try downloading the nightly build tomorrow, and let us know how it goes :smile:

    -Sheila

  • rernstrernst UMC UtrechtMember

    @Sheila

    Hi Sheila,

    Thanks again for the update, I have run my test set again. Unfortunately when I try to bin everything together I get the following error:

    ERROR MESSAGE: Invalid command line: Argument GVCFGQBands has a bad value: The value 100 in the list of GQ partitions is greater than VCFConstants.MAX_GENOTYPE_QUAL = 99.

    The error itself is quite logical however gatk interprets the -GQB settings as 'create a bin until (not including) the gqb value'. Therefor it is at the moment not possible to create a bin which includes all quality values or for example a bin from 90 to and including 99. Setting the -GQB settings to 99 results in the following bins:

    GVCFBlock0-99=minGQ=0(inclusive),maxGQ=99(exclusive)
    GVCFBlock99-100=minGQ=99(inclusive),maxGQ=100(exclusive)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert.

    Ah, so you do not care about resolution at all!

    I am going to talk to the team and will get back to you asap.

    -Sheila

  • rernstrernst UMC UtrechtMember

    @Sheila

    Actually I do care, but asking for one bin is really useful for debugging this problem... With this last 'bug' it is also not possible to create a gvcf with for example bins with steps of 10, you will always have one extra bin from 99 to 100 which you can't define / change.

    Robert

  • sleeslee Member, Broadie, Dev

    @rernst

    Apologies for the last "bug"---I've put in a pull request that bumps up the maximum allowed GQB value to 100, which should resolve the problem. I overlooked your use case when making the last fix, so thanks for getting back to us about it!

  • rernstrernst UMC UtrechtMember

    @slee

    Great! Can you let me know when it is accepted and in the nightly release? I just tested the one from 08-05 and there it seems not yet fixed.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @rernst
    Hi Robert,

    As soon as someone reviews the fix, it will be in the nightly. Very soon!

    -Sheila

  • rernstrernst UMC UtrechtMember

    @Sheila @slee

    I just tested the latest nightly (GenomeAnalysisTK-nightly-2016-08-11-g9a77889) and the bug seems fixed! Thanks! Do you have an eta on a new release containing this fix?

    p.s. It would be great if we/users are able to track issues and get notifications when things change in the nightly! It saves 'us' testing nightly's and you answering questions about when bugs are fixed.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @rernst Glad to hear your problem is resolved by the fix. I'm not sure we have enough new material to justify cutting 3.7 anytime soon (most effort is going into GATK4), but we may do a point release (3.6-x) to get a few of these fixes out to people as a supported version, as well as adding support for SRA. There's one more fix I'd like to get in before we do that though, so it could be a few more weeks before the release happens.

    We'd love to be able to get you direct notifications re: nightlies, but doing it would be difficult given the development setup we have for GATK3. When we switch to GATK4 it will be much easier.

Sign In or Register to comment.