To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Trimming a GVCF with "-L"

GATK team,

I currently have many WES gVCFs called with GATK 3.x HaplotypeCaller, and I'm now looking to combine them and run GenotypeGVCFs. Unfortunately, I forgot to add the "-L" argument to HC to reduce the size of the resulting gVCFs, and CombineGVCFs looks like it's taking much longer than I expect it to.

Is there any potential problem with using the "-L" argument to SelectVariants to reduce the size of my gVCFs and then use those smaller gVCFs in the CombineGVCFs stage (and beyond), or do I have to re-call HaplotypeCaller again? Would it be better to extend the boundaries of the target file by a certain amount to avoid recalling HaplotypeCaller?

Thanks,

John Wallace

Issue · Github
by Geraldine_VdAuwera

Issue Number
878
State
closed
Last Updated
Closed By
vdauwera

Best Answers

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @johnwallace123 ,

    The simplest may be to run the CombineGVCFs step with -L. But if your priority is to trim down the GVCFs, you can indeed use SelectVariants with -L. You may want to add some interval padding using the -ip argument (see engine args) because off the top of my head I don't recall whether the tool will include block records that start before the interval starts. Should be easy to test of course.

  • thibaultthibault Broad InstituteMember, Broadie, Dev

    Yes, it's important to be careful when using -L on GVCFs, for that reason. -L only considers the start of GVCF blocks, so it will miss any blocks that start before the interval.

  • @thibault, @Geraldine_VdAuwera,

    Thanks so much for the answers; they confirm my suspicion about how gVCFs and SelectVariants work. I'm not sure that using the -L with CombineGVCFs will work well, as it seems that "-L" makes GenotypeGVCFs run much less efficiently. I've seen other forum posts to that effect, and it mirrors my initial testing.

    With that in mind, there are a few potential workarounds that come to mind (in order of increasing complexity):

    1. We can use the -ip argument to include padding. Is there a maximum size of a gVCF band, either technical or practical? I think it would be OK if we missed out on up to 1% of the leading blocks.
    2. CombineGVCFs comes with a "-bpResoution" to convert a set of merged gVCFs to base pair resolution, which we could then pass through SelectVariants. However, we would still like to save the disk space, so is there a way to go back to banded mode?
    3. Is it feasible to add a "--gvcfMode" argument to SelectVariants so that when using the "-L" argument, it also emits the previous variant if the leading variant in any given "-L" block is not aligned with the start of the block? I'd be willing to give it a shot, but I don't want to open that can of worms if it wouldn't be possible given the architecture of the SelectVariants command.

    Thanks so much for all of your help!

    -John Wallace

Sign In or Register to comment.