Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Why does -dcov with PrintReads not filter out any reads from my amplicon data?

kjprkjpr LondonMember
edited October 2015 in Ask the GATK team

This question seems to have been asked before http://gatkforums.broadinstitute.org/discussion/3361/dcov-on-a-bam-file-to-generate-bam-file-output but after reading to the end of the thread I did not see an answer to the final question namely if you have amplicon data which show a number of reads which all start from the same position why does the dcov setting not filter down these reads?

I have run PrintReads as such:

java -Xmx20g \ -jar GenomeAnalysisTK.jar \ -T PrintReads \ -R GRCh37.fa \ -I examplesort.bam \ -o exampledownsample.bam \ -dcov 1

and the output given is:
INFO 16:57:05,223 ProgressMeter - Total runtime 262.65 secs, 4.38 min, 0.07 hours INFO 16:57:05,228 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 5965722 total reads (0.00%) INFO 16:57:05,229 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 16:57:05,229 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 16:57:06,215 GATKRunReport - Uploaded run statistics report to AWS S3

When I look at the reads that I have in certain highly covered regions I can see

image

Is there a reason why these reads are not being filtered down? Possibly I am not understanding how the dcov function works.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi,

    A couple of misunderstandings here. First, the downsampling system is distinct from the filtering system, so reads discarded due to downsampling would not be reported in the filtering summary anyway. Also, filtering is done upfront and reported overall, whereas downsampling is done per site or region (depending on the tool) which cannot be clearly reported as an overall result of the run.

    But you should see fewer reads in the output of that command, unless I'm missing something. Can you show the region viewed after downsampling?

  • kjprkjpr LondonMember

    Thank you Geraldine, that at least explains the filtering summary. However there is no reduction in my file size if anything it has slightly increased from 838.9MB to 856.8MB. When I view both bams I see no reduction in these high coverage regions
    Here are the two regions before and after (dcov 1). As you can see the overall read depth is the same 38722reads.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, now that I think about it the dcov functionality may not work as you expect it for this use case. The system may not be able to write out a downsampled bam; it's only intended to handle short regions internally. But the dfrac function should work. That's a fractional downsampling so it may not be what you want, though.

  • kjprkjpr LondonMember

    Ok that's a shame thank you for getting back to me so quickly. The sample I am trying to downsample relates to amplicon data for which I am trying to level off the peaks in read depth in high coverage regions so that I can then use dfrac to generate downsamples bams that are reflective of a specific coverage across all regions. Any suggestions how I might do this?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Not in any straightforward way, sorry. Others may have tools to do this that I'm not thinking of. Good luck!

Sign In or Register to comment.