Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

What exactly does the --minReadsPerAlignmentStart flag specify in HaplotypeCaller?

Specifically, what does the 'start' component of this flag mean? Do the reads all have to start in exactly the same location? Alternatively, does the flag specify the total number of reads that must overlap a putative variant before that variant will be considered for calling?

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @gwilymh‌

    Hi,

    It means that the reads will all have to start at the beginning of the active region. Please read more about active regions here: https://www.broadinstitute.org/gatk/guide/article?id=4147

    For example, let's say we have an active region at position 1-200. If you set --minReadsPerAlignmentStart 250, 250 reads that start at position 1 will be used. If you have more than 250 reads, random downsampling will occur. If you have less than 250 reads at position 1, all the reads will be used. Please note, not all reads that are used may span the entire active region.

    -Sheila

  • chlangleychlangley UCDMember

    hello:

    The default minReadsPerAlignmentStart is 5.
    Is there a explanation/discussion about this choice and under what circumstances another value would be optimal.

    In a deep WGS with PCR duplicates removed, 5 reads starting at the boundary of the active region (indeed any random position) should be well below 5.

    Am I missing something; I did look at https://www.broadinstitute.org/gatk/guide/article?id=4147 .

    Is there a procedure for optimizing the minReadsPerAlignmentStart parameter ?

    Thanks for the high quality and timely support.

    Cheers,
    Chuck

  • AlexanderVAlexanderV BerlinMember

    I have a followup on that.

    Is -minReadsPerAlignStart used to decide if a region is defined as active?
    Or just as a threshold for the downsampling algorithm?

    If first, I have following situation (---- are reads, good base quality):

        -------|-----------------------------|----
               |-----------------------------|-----------
      ---------|-----------------------------|--
             --|-----------------------------|---------
    -----------|-----------------------------|
    

    Will this be considered as active region?
    The marked area has 5 reads covering it. But just 1 is starting at the active regions start.
    So, what is the case? I would not consider this as a bad site.

    I think this is also what @chlangley meant:

    In a deep WGS with PCR duplicates removed, 5 reads starting at the boundary of the active region (indeed any random position) should be well below 5.

    @Sheila, reading your answer (Nr. 2) again, it seems that it ist just for protecting the region from too much downsampling.

    As a followup to this: Is there a minimum # of reads to trigger an active region?

    [...] the per-position score is the probability that the position contains a variant as calculated using the reference-confidence model applied to the original alignment.

    Is there something written about this reference-confidence model, what can answer my question?

    Thanks!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited August 2015

    @AlexanderV
    Hi,

    No minReadsPerAlignStart is not used to determine if a region is active. Have a look at how active regions are defined here: https://www.broadinstitute.org/gatk/guide/article?id=4147
    The minReadsPerAlignStart is simply to make sure that the start position of an active region is covered by a certain number of reads. So, it is possible that a region could be marked as active, but not output because there are not enough reads at the start position.

    In your example, the region would be marked as active. The reads themselves do not have to start at the active region start, they simply have to cover the active region start.

    I think that to tag a region as active, only 1 read is necessary, however I will have to check and get back to you.

    A document on the Reference Confidence Model is in progress.

    -Sheila

    Issue · Github
    by Sheila

    Issue Number
    148
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @AlexanderV
    Hi again.

    With only one read, the calculation will be done to see if it's active, but one read may not be enough evidence for triggering an active region.

    -Sheila

Sign In or Register to comment.