The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

Member Posts: 51

I am having issues making sense of the behaviour of reducereads in an example where I know that the call is real (Sanger validated...).
In the non-reduced pileup the variant is pretty solid:

11 47364249 G 12 aA..A.A,.A.^~.

The compressed pileup looks like:
11 47364249 G 2 .A

which I guess is OK but the depth associated with the A call is 1 instead of the expected 5 (as evidenced when I run a samtools view).
Accordingly, after running the UG, I get 7 reads supporting the reference and 1 supporting the alternative.
0/1:7,1:8:15:15,0,203
But really it should be 7/5. I am losing 4 precious reads that turn this variant into missing.

I really can't make sense of it. I tweaked all the options I could find to keep as many reads as possible when reducing:
-minqual 0 -minmap 0 --dont_hardclip_low_qual_tails -noclip_ad

I am using GenomeAnalysisTK-2.7-4, and I can upload the slide of the BAM file to illustrate the problem.
I also attach my debugging code (as a txt file) if someone wants to see if I am missing a key option.

Is that a bug? Or am I missing something obvious?

Tagged:

Hi there,
How do you know that the reduced depth is only 1 at that position? (FYI If you look at it in IGV it will tell you the representative depth)

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 51

Good question. I guessed that when running samtools view, there is a long set of values at the end which looks like this:
RR:B:c,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1, ...

and for the other synthetic read
RR:B:c,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,1,1,1,1,

so I presumed that this must be the depth at every position, it must be stored somewhere and the numbers look right (also consistent with the pileup)

Good. Those numbers are the depths relative to the first position in the read. So a 1 in the middle of the read means "1 more than the number in the first position". It looks then like you have a depth of 3 at the position in question, right?

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Member Posts: 51

So the reads with the 1s and 2s is the read with the reference call, pretty sure of that. It eventually goes up to 6-7 and that matches the reference read count of 7. So that ones makes sense.

The problem is the second read. I will paste below the full sequence but it never goes above 1, so the depth varies between 1 and 2 I presume. Nothing like the 5 I expect.

RR:B:c,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0

Hi Vincent,

Try writing the reads that end up missing to a separate file and validate that file, or run them through RR and look for the filter summary. This is to check if there is something wrong with those reads that leads to them getting filtered out somehow.

Geraldine Van der Auwera, PhD

• Member Posts: 51
edited November 2013

Thanks Geraldine, on it now. Just sorry to be dumb but what validation, do you mean the Picard validation tools or something part of GATK?
By filter summary I presume you mean extract my 12 reads (7 reference and 5 alternate) and run reduce reads? I get the following very clean output:

INFO 18:35:23,731 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 12 total reads (0.00%)
INFO 18:35:23,731 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter
INFO 18:35:23,731 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter
INFO 18:35:23,731 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter
INFO 18:35:23,732 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter
INFO 18:35:23,732 MicroScheduler - -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter
INFO 18:35:23,732 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter
INFO 18:35:24,964 GATKRunReport - Uploaded run statistics report to AWS S3

So no filtering is happening.

Yep, that's what I meant - and that is indeed a very clean output. So filtering's not the issue, ok. By validation I meant Picard ValidateSAMFile, yes. If that also runs cleanly, I'll ask you to upload your snippet with the 12 reads so we can have a closer look at it.

Geraldine Van der Auwera, PhD

• Member Posts: 51
edited November 2013

Yes, picard validation is OK too.

but that is for all the reads, simply because I only sliced one read of each pair. I presume this is normal.

I uploaded the BAM file + the test scripts to your public ftp just now. The file name is
Plagnol_debug_GATK_team.tar.gz

Hopefully it is all straightforward, I put the arguments on top. Just run the reduceReads command line, you will see what I mean.
I am guessing this is a bug, but I give 50% chance that it's me being stupid.

Good news, the devs have a proposed fix for this bug. It might take a little longer than usual to go through code review and into the codebase due to the impending holiday, but it should make it into the nightlies early next week.

Geraldine Van der Auwera, PhD

• Member Posts: 51

Thanks Geraldine. Obviously curious to hear the reason. Let me know as soon as the fix has made its way to the nightly builds. That will make some clinicians happy.