CollectSequencingArtifactMetrics PreAdapterDetailMetrics question

Dear GATK support,
I'm trying to figure out how the error_rate and qscore is calculated in the Picard sequencing artifact metrics. I notice that in the PreAdapterDetailMetrics, for two reverse compliment context, the pro and con ref bases and alt bases are exact opposite, which lead to one to be scored 100 and the other in the 30s. And I got confused now are the pro_ref_bases and con_ref_bases generated? Shouldn't they be the same because they are just reverse comp copy of each other? Example shown below.

REF_BASE ALT_BASE CONTEXT PRO_REF_BASES PRO_ALT_BASES CON_REF_BASES CON_ALT_BASES ERROR_RATE QSCORE
C T TCT 33803861 34212 33838997 62648 0 100
G A AGA 33838997 62648 33803861 34212 0.00042 34

Another related question is, is there any guideline that we can use as recommended cutoff score to define whether a certain base change is artifact instead of true variant?

Thanks,
Wen

Best Answers

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @wenluo,

    What version of Picard are you using and are you saying the results of PreAdapterDetailMetrics are unexpected? If so, and if you are not using a newer version of Picard, can you please try the latest release and see if you get the same results?

    Picard metrics are explained at https://broadinstitute.github.io/picard/picard-metric-definitions.html. Look under the section titled SequencingArtifactMetrics.PreAdapterDetailMetrics.

    To answer your second question, we have a tool that filters variant calls based on orientation bias. The tool is FilterByOrientationBias. Its use is explained in the context of somatic variant discovery, in section 5 of Tutorial#11136. It takes as an input the results of CollectSequencingArtifactMetrics.pre_adapter_detail_metrics.

  • wenluowenluo Member

    Hi Shlee,

    Thanks for pointing the FilterByOrientationBias tools to me!
    I'm not saying anything wrong with picard PreAdapterDetailMetrics. I'm just trying to understand the mechanism of how it is defining PRO_REF_BASES, PRO_ALT_BASES, CON_REF_BASES, CON_ALT_BASES, that two reverse complementary context have opposite number of bases for pro and con.

    Thanks,
    Wen

    Issue · Github
    by shlee

    Issue Number
    1179
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    meganshand
  • joneskm4joneskm4 NCIMember

    Thanks so much for this detailed explanation!

    Wen and I work together, and while this helps, we are still trying to wrap our heads around something.

    If you look at the numbers for the example, it is clearly using the same genomic locations for both the C>T and G>A changes. It is just binning them in different categories and thus coming up with very different Qscores.

    The problem is that conceptually this doesn't really make sense. This seems to be telling us that for G>A base changes (with the context AGA), we may be concerned about a high rate of artifactual changes. But for C>T changes (with the context TCT), that the rate of artifactual changes would be much lower. This example shown is from an FFPE sample, and when we perform variant calling on this sample, we see highly elevated levels of C>T and G>A base changes relative to other types of base changes/variants called. This is also a well-documented artifact in FFPE samples. So why would we see elevated base changes in the variant calling for both types (C>T, G>A), but only the scores for the G>A base changes reflect that (all C>T base changes for all contexts in this example are 100, while the G>A scores are in the 20s/30s). If we are seeing evidence that both C>T and G>A base changes are potentially problematic (due to the unexpectedly high variant counts for these types of changes), why does the qscore for only 1 set reflect that?

    Thank you!

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @joneskm4
    Hi,

    I will ask Megan to get back to you soon.

    -Sheila

  • mshandmshand Member, Broadie, Dev ✭✭

    Hello @wenluo and @joneskm4,

    Exactly as you've observed, this metric is balancing the error rates of an original (eg before adapters or PCR) C->T error vs G->A error, sometimes referred to as an Orientation Bias error (for example the artifact you're observing in the FFPE sample).

    One side of this particular artifact (C->T error on a single strand of DNA, before adapters or PCR) is eventually sequenced either as C->T on F1R2 reads or G->A on F2R1 reads depending on if it came from the top strand or bottom strand (a detailed explanation of this point is in the attached slides). As you can see in the slides, this comes from the fact that since this error only occurred on one strand it is only propagated through PCR, after adapters are ligated, on either F1R2 or F2R1 reads.

    All we can observe is the number of times we see a C->T in F1R2 vs a C->T on F2R1 (which would indicate different original (pre-adapter) artifacts [either C->T or G->A respectively]). The final reported metric shows the error rate for a pre-adapter C->T error (which will show up in the sequenced data as both C->T and G->A artifacts) balanced against a pre-adapter G->A error (which will also show up in sequenced/aligned data as both C->T and G->A artifacts). In other words, what we know is that we see one more than the other. Therefore the metric is reported as the error rate for the more likely original error (G->A in your example case) and since the original error of C->T is less likely (and therefore somewhat unobservable since the error rate turns out to be negative) it is rounded to an arbitrarily high Qscore of 100.

    You could potentially try using GATK's orientation bias filter (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_exome_FilterByOrientationBias.php) to see if it will filter both of your observed artifacts (C->T and G->A). This particular filter is an area of active development so there might also be a new orientation bias filter in GATK in the future.

Sign In or Register to comment.