Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

reduceReads removes reads that should not be removed

vplagnolvplagnol Posts: 34Member

I am having issues making sense of the behaviour of reducereads in an example where I know that the call is real (Sanger validated...). In the non-reduced pileup the variant is pretty solid:

11 47364249 G 12 aA..A.A,.A.^~.

The compressed pileup looks like: 11 47364249 G 2 .A

which I guess is OK but the depth associated with the A call is 1 instead of the expected 5 (as evidenced when I run a samtools view). Accordingly, after running the UG, I get 7 reads supporting the reference and 1 supporting the alternative. 0/1:7,1:8:15:15,0,203 But really it should be 7/5. I am losing 4 precious reads that turn this variant into missing.

I really can't make sense of it. I tweaked all the options I could find to keep as many reads as possible when reducing: -minqual 0 -minmap 0 --dont_hardclip_low_qual_tails -noclip_ad

I am using GenomeAnalysisTK-2.7-4, and I can upload the slide of the BAM file to illustrate the problem. I also attach my debugging code (as a txt file) if someone wants to see if I am missing a key option.

Is that a bug? Or am I missing something obvious?

txt
txt
debug_reduce.txt
763B
Tagged:

Best Answer

Answers

  • ebanksebanks Posts: 671GSA Member mod

    Hi there, How do you know that the reduced depth is only 1 at that position? (FYI If you look at it in IGV it will tell you the representative depth)

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • vplagnolvplagnol Posts: 34Member

    Good question. I guessed that when running samtools view, there is a long set of values at the end which looks like this: RR:B:c,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1, ...

    and for the other synthetic read RR:B:c,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,

    so I presumed that this must be the depth at every position, it must be stored somewhere and the numbers look right (also consistent with the pileup)

  • ebanksebanks Posts: 671GSA Member mod

    Good. Those numbers are the depths relative to the first position in the read. So a 1 in the middle of the read means "1 more than the number in the first position". It looks then like you have a depth of 3 at the position in question, right?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • vplagnolvplagnol Posts: 34Member

    So the reads with the 1s and 2s is the read with the reference call, pretty sure of that. It eventually goes up to 6-7 and that matches the reference read count of 7. So that ones makes sense.

    The problem is the second read. I will paste below the full sequence but it never goes above 1, so the depth varies between 1 and 2 I presume. Nothing like the 5 I expect.

    RR:B:c,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi Vincent,

    Try writing the reads that end up missing to a separate file and validate that file, or run them through RR and look for the filter summary. This is to check if there is something wrong with those reads that leads to them getting filtered out somehow.

    Geraldine Van der Auwera, PhD

  • vplagnolvplagnol Posts: 34Member
    edited November 2013

    Thanks Geraldine, on it now. Just sorry to be dumb but what validation, do you mean the Picard validation tools or something part of GATK? By filter summary I presume you mean extract my 12 reads (7 reference and 5 alternate) and run reduce reads? I get the following very clean output:

    INFO 18:35:23,731 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12 total reads (0.00%) INFO 18:35:23,731 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 18:35:23,731 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 18:35:23,731 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 18:35:23,732 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 18:35:23,732 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter INFO 18:35:23,732 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter INFO 18:35:24,964 GATKRunReport - Uploaded run statistics report to AWS S3

    So no filtering is happening.

    Post edited by vplagnol on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Yep, that's what I meant - and that is indeed a very clean output. So filtering's not the issue, ok. By validation I meant Picard ValidateSAMFile, yes. If that also runs cleanly, I'll ask you to upload your snippet with the 12 reads so we can have a closer look at it.

    Geraldine Van der Auwera, PhD

  • vplagnolvplagnol Posts: 34Member
    edited November 2013

    Yes, picard validation is OK too. Every single read gives me ERROR: Read name HWUSI-EAS483_0025_FC:1:4:2500:9893#TGACCA, Mate not found for paired read

    but that is for all the reads, simply because I only sliced one read of each pair. I presume this is normal.

    I uploaded the BAM file + the test scripts to your public ftp just now. The file name is Plagnol_debug_GATK_team.tar.gz

    Hopefully it is all straightforward, I put the arguments on top. Just run the reduceReads command line, you will see what I mean. I am guessing this is a bug, but I give 50% chance that it's me being stupid.

    Post edited by vplagnol on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Good news, the devs have a proposed fix for this bug. It might take a little longer than usual to go through code review and into the codebase due to the impending holiday, but it should make it into the nightlies early next week.

    Geraldine Van der Auwera, PhD

  • vplagnolvplagnol Posts: 34Member

    Thanks Geraldine. Obviously curious to hear the reason. Let me know as soon as the fix has made its way to the nightly builds. That will make some clinicians happy.

  • ebanksebanks Posts: 671GSA Member mod

    Mapping Quality was being treated as a byte instead of an int (which is fine for most aligners, but not Novoalign).

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • vplagnolvplagnol Posts: 34Member

    Thanks Eric. I suspected indeed it was quite specific to my setup (otherwise it would have been found before).

Sign In or Register to comment.