Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Haplotype Caller Active Region

pauljpembertonpauljpemberton Posts: 6Member

I am getting the following output on the progress monitor of Haplotype Caller:

INFO 13:50:30,687 ProgressMeter - 17:73500593 0.00e+00 19.9 h 15250.3 w 100.0% 19.9 h 0.0 s

WARN 13:50:56,090 DiploidExactAFCalc - this tool is currently set to genotype at most 6 alternate alleles in a given context, but the context at 17:73335276 has 7 alternate alleles so only the top alleles will be used; see the --max_alternate_alleles argument

INFO 13:51:30,706 ProgressMeter - 17:73500593 0.00e+00 19.9 h 15250.3 w 100.0% 19.9 h 0.0 s

Why is it that the active region is listed as 17:73500593, yet apparently Haplotype Caller is looking at 17:73335276? It seems like ~20,000 bases away would not be considered active. What does the active region column (-f3) actually display/represent.

Thanks, -Paul Pemberton

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Hi there, sorry to get back to you so late, your post slipped through my net.

    The progress meter is showing how far the ActiveRegionTraversalEngine has progressed. Those active regions are added to a queue and then sent to the HaplotypeCaller's map function. What gets printed out is the region where the action was happening at the time the ProgressMeter call was triggered; not every region gets mentioned in that output, if that makes sense. If you have a list of regions A, B, C, D, E, F, G, it is quite possible, depending on rate of progression, that you might get the following output:

    INFO ... ProgressMeter ... region A
    INFO ... ProgressMeter ... region D
    WARN ... problem somewhere in region E
    INFO ... ProgressMeter ... region G
    

    Geraldine Van der Auwera, PhD

  • pauljpembertonpauljpemberton Posts: 6Member

    That makes perfect sense. My question is more along the lines of why it is possible for the progress meter to output something like the following:

    Assume that we have the same regions (A, B, C, D, E, F, G). If progress meter outputs

    INFO ... ProgressMeter ... region A
    INFO... ProgressMeter ... region G
    INFO ... ProgressMeter ... region G
    INFO ... ProgressMeter ... region G
    INFO ... ProgressMeter ... region G
    WARN ... problem somewhere in region B
    INFO ... ProgressMeter ... region G
    .......

    Where region B is ~200,000 bp away from region G, isn't it slightly strange that the progress meter jumped ahead and then went back to region B after analyzing region G. Could it be due to multiple threads moving a different rates but all reporting to the same progress meter? It just seemed like a strange output and did not allow for accurate progress prediction (although I understand that it is difficult to ever have accurate progress prediction, I thought that the active region output may be closer to reality than 200,000 bp). Does this make sense, and could the multi-threading be the issue?

    Thanks,

    -Paul

  • pdexheimerpdexheimer Posts: 297Member ✭✭✭

    pedaling through the sauerkraut

    I love it! I've never heard that one before, the imagery is fabulous

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,230Administrator, GSA Member admin

    Heh, it's from my native French, "pédaler dans la choucroute". It's too good a phrase to keep confined to a single language.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.