Trimming a GVCF with "-L"

GATK team,

I currently have many WES gVCFs called with GATK 3.x HaplotypeCaller, and I'm now looking to combine them and run GenotypeGVCFs. Unfortunately, I forgot to add the "-L" argument to HC to reduce the size of the resulting gVCFs, and CombineGVCFs looks like it's taking much longer than I expect it to.

Is there any potential problem with using the "-L" argument to SelectVariants to reduce the size of my gVCFs and then use those smaller gVCFs in the CombineGVCFs stage (and beyond), or do I have to re-call HaplotypeCaller again? Would it be better to extend the boundaries of the target file by a certain amount to avoid recalling HaplotypeCaller?

Thanks,

John Wallace

Issue · Github
by Geraldine_VdAuwera

Issue Number
878
State
closed
Last Updated
Closed By
vdauwera

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @johnwallace123 ,

    The simplest may be to run the CombineGVCFs step with -L. But if your priority is to trim down the GVCFs, you can indeed use SelectVariants with -L. You may want to add some interval padding using the -ip argument (see engine args) because off the top of my head I don't recall whether the tool will include block records that start before the interval starts. Should be easy to test of course.

  • thibaultthibault Broad InstituteMember, Broadie, Moderator, Dev ✭✭

    Yes, it's important to be careful when using -L on GVCFs, for that reason. -L only considers the start of GVCF blocks, so it will miss any blocks that start before the interval.

  • johnwallace123johnwallace123 Member ✭✭

    @thibault, @Geraldine_VdAuwera,

    Thanks so much for the answers; they confirm my suspicion about how gVCFs and SelectVariants work. I'm not sure that using the -L with CombineGVCFs will work well, as it seems that "-L" makes GenotypeGVCFs run much less efficiently. I've seen other forum posts to that effect, and it mirrors my initial testing.

    With that in mind, there are a few potential workarounds that come to mind (in order of increasing complexity):

    1. We can use the -ip argument to include padding. Is there a maximum size of a gVCF band, either technical or practical? I think it would be OK if we missed out on up to 1% of the leading blocks.
    2. CombineGVCFs comes with a "-bpResoution" to convert a set of merged gVCFs to base pair resolution, which we could then pass through SelectVariants. However, we would still like to save the disk space, so is there a way to go back to banded mode?
    3. Is it feasible to add a "--gvcfMode" argument to SelectVariants so that when using the "-L" argument, it also emits the previous variant if the leading variant in any given "-L" block is not aligned with the start of the block? I'd be willing to give it a shot, but I don't want to open that can of worms if it wouldn't be possible given the architecture of the SelectVariants command.

    Thanks so much for all of your help!

    -John Wallace

Sign In or Register to comment.