How can I filter variants at the end of reads?

airtimeairtime Member
edited March 2014 in Ask the GATK team

Hello,
I have a dataset with 12 samples and many variants.
We have validated approximately 15 Mutation by Sanger and 6 of them are false positives.
All of them are located at the end of the reads, the last 5 bases.
I searched for a parameter for filtering this variants, but I oversee it or isn't there one?
I don't want to clip all reads at the end, I only want to filter out variants which are called only by the last bases of the reads.
Is this possible?

I read about the ReadPosRankSum value but samples with good reads at this position have a big influence and so I couldn't remove the false positives by this parameter.

Best airtime

Best Answer

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    If the other samples don't have the false positive bases then it shouldn't affect the ReadPosRankSum (or rather, it should only count to help the reference bases).

  • airtimeairtime Member

    Hi,
    so I should try to increase the ReadPosRankSum value to filter out false positive variants which are at the end of the reads?
    Or is there another parameter?
    best airtime

  • airtimeairtime Member

    Hi,
    I experimented with the values of MQRankSum and ReadPosRankSum for the VariantFiltration.
    At first it looks good, but after another validation by Sanger again we get to many false positives.

    Our first validation_list_1 included 25 candidates (11 false positives and 14 were validated)
    Depending on that results I decided to set ReadPosRankSum to -1.5 and MQRankSum to -4.5 for the Filtration step.
    The new filtered_list contains 15 of the previous candidates (2 false positives, 13 are validated and 1 previous validated is missed)

    Our second validation_list_2 contains till now 7 candidates (2 false positives and 5 are validated).
    This subresult also doesn't sounds so bad, but again the false positives are mainly located at the end of reads and all the time candidates which are identified in more than one sample result in false positives (in validation_list_1 10 of 11 and in validation_list_2 2 of 2 till now). Further results of validation_list_2 would be finished in the next days, but I expect that again half of our candidates woulde be false positives, because they are identified in more than one sample.
    Did I oversee something in the case of multiple hits?

    Best,
    airtime

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @airtime, it's very difficult to comment on this without seeing your data. Maybe you need to be more stringent when filtering on read position. I'm not sure there's much you can do about the false positives that are found in several samples. You could decided to filter out things that are not unique, if you know that you expect only unique variants, but I don't know how safe that is -- it sounds like a question of study design.

Sign In or Register to comment.