Attention: Want an end-to-end pipelining solution for GATK Best Practices?


Check out Terra here! For more details on whether this is the right fit for you checkout our blogs here.

Minimum depth requirement when creating gVCFs files?

Dear all,

I am interested in creating genome-wide gVCFs from various sources, including exome samples. The rationale is that it is not completely clear what the target region actually is all the time, and it would be good to keep some flexibility about what I capture.

Nevertheless, these single sample gVCF files can become quite large, but most of the lines in the file are used to store very low depth data ( DP <= 2) which is probably not critical for downstream analysis. I would be interested to have, in the HaplotypeCaller gVCF data generation process, an additional parameter that sets as missing regions with, say, 2 reads or less. That would considerably reduce the storage at little cost in terms of discarded information.

Does such a parameter exist already? Would it make sense at all to add this? I am not sure but right now it seems to me it might be useful.

Vincent

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Vincent,

    Currently there is no parameter for setting a minimum depth requirement for HC gVCFs. One way to achieve what you want would be to do a preprocessing step with DepthOfCoverage or DiagnoseTargets and generate an intervals file that includes regions with coverage that satisfy your requirements.

    If there is interest from others in the community for a built-in HC parameter to do this, we can discuss a possible feature request, but my instinct is that this runs counter to the purpose of the gVCF format, which is to have data about every single site (individually with BP_RESOLUTION or as part of a GQ-based band with GVCF) in the gVCF.

    This brings to mind an alternative option if your purpose is just to reduce file size. You could increase the compression level by specifying different thresholds for the GQ-based banding. You'd lose some resolution but in the lower ranges you may not care, since the lower GQs may be associated with low DP.

  • vplagnolvplagnol Member

    OK I think I see your point and this is helpful, thanks.

    So I looked a little bit (and I will experiment more) but I am however a bit puzzled by the way the GQ based compression works. Can you confirm the default values for these bands? For example I see things like this in my gVCF:

    10 103528682 . G . . END=103528682 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,33

    10 103528683 . G . . END=103528684 GT:DP:GQ:MIN_DP:PL 0/0:2:6:2:0,6,54

    10 103528685 . G . . END=103528687 GT:DP:GQ:MIN_DP:PL 0/0:2:0:1:0,0,15

    I find this inconsistent with what I think the default is. Line 1 and 2 have GQ of 3 and 6. If I believe the haplotype caller doc page, the bands are (1, 10, 20, 30, 40, 50). So why does it break down between line 1 and 2? Even though one remains within the same specified GQ band?

    V

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I agree this looks inconsistent -- I'll check with the team. FYI you can also check the VCF header for the definition of the GQ banding thresholds that were applied to your data.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, it turns out the doc is out of date -- the bands are (0,5,20,60). We're going to remove the values in the doc paragraph. The up-to-date default values are given right below that (see the grey rectangle). And of course like I said before, you can check the VCF header to be sure.

  • vplagnolvplagnol Member

    Yes I was about to reply. Indeed the default values from the headers are: [5, 20, 60].

    I think that if I replace the 5 with a 10 or 15 it should do the job for me. I'll try that out.
    Again, many thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Great, glad that's sorted. Let me know what differences you observe in the file sizes, if you don't mind. This trick may be useful to include in our documentation as a pro-tip.

  • vplagnolvplagnol Member

    And sorry to be dumb, but how do you specify the band values as a GATK argument?
    I tried various combinations of
    --GVCFGQBands [10,20,60] or
    --GVCFGQBands 10 20 60
    and a bit more but I can't figure it out from the documentation. I must read it poorly but I don't think it is very clearly stated.

Sign In or Register to comment.