We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Does Unified Genotyper use information from soft clipped portions of reads?

Hello all,

I have put 9 samples through a variant calling pipeline closely following the standard recommended one and am puzzling over some insertion calls from GATK 2.3-9 that are called as het for samples that appear to have no support in the reads. There are some interesting features:

  • several of the other samples called concurrently contain reads supporting the insertion
  • there are two reads in the sample in question that were soft clipped, and the soft clipped portion could be realigned in such a way as to support the insertion
  • if I call the sample on its own, no insertion call is made

It seems to me that either GATK is using information from the soft clipped portion of the reads (and performing a realignment on them) and additionally it is inferring evidence for the insertion from haplotyping and linkage with other samples. I couldn't find any information from the documentation about how soft clipping is handled, so I was wondering if anybody could clarify if soft clipped regions are used and, of course, any other thoughts on how calls like this might arise.

Many thanks!


Best Answers


  • ssadedinssadedin Member

    Hi Geraldine,

    Thanks for the answer, if GATK is using the soft clipped portions of the reads then that definitely helps to explain what I'm seeing.

    However I'm still puzzled why I get the calls if I call this sample along with other samples but not if I call it by itself. Can you give me a hint about how this interaction between samples happens?


  • ssadedinssadedin Member

    @Geraldine_VdAuwera said:
    The UG keeps track of when the same variant is present in many samples, which it takes to mean that the variants is more likely to be real than an artefact. ... Maybe that's what's happening with your samples?

    That definitely would explain this, and I can see it would increase sensitivity. It poses a bit of a conundrum for us because in some sense it means our results are not reproducible at the sample level, only when we call all the original samples together. I would be interested to know if there's a way to disable this feature (at the cost, obviously, of sensitivity).

    Thanks again for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No, if you don't want this feature you'll need to run your samples separately.

    Glad to be of help!

Sign In or Register to comment.