Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

Cannot downsample using printReads an -dcov parameter. No effect over original bam observed.

David_UnaDavid_Una SpainMember

Cannot downsample using printReads an -dcov parameter (and -dt ALL_READS or BY_SAMPLE). No effect over original bam observed. Only -dfrac seems work. I'm using GATK 3.1-1-g07a4bf8. I read it is posible. What could be the problem?

Thanks a lot.

 David.
Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    Hi David,

    Could you please tell me a little more about what you tried and what you observed? Specifically, do you have a before/after screenshot of a region that you think should get downsampled at a certain value of dcov (please specify what that is), but doesn't?

  • droazendroazen ✭✭ Cambridge, MAMember, Broadie, Dev ✭✭

    From the -dcov documentation:

    The downsampling process takes two different forms depending on the type of analysis it is used with. For locus-based traversals (LocusWalkers like UnifiedGenotyper and ActiveRegionWalkers like HaplotypeCaller), downsample_to_coverage controls the maximum depth of coverage at each locus. For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position. For ReadWalkers you will typically need to use much lower dcov values than you would with LocusWalkers to see an effect.

    PrintReads is a read-based traversal, so it falls into the second case.

  • wchenwchen Member

    Can I use -dfrac 1 in indel realignment and base recalibration for amplicon based sequencing data? Thanks!

  • wchenwchen Member

    @droazen

    What is the default dcov value? Can I use -dfrac 1 in indel realignment and base recalibration for amplicon based sequencing data? Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    @‌wchen

    Each tool has its own -dcov downsampling defaults; see the per-tool docs for each. You can override those defaults by setting either -dcov or -dfrac to the desired value.

    Using -dfrac 1 will not do anything. If you want to turn off downsampling, you can use -dt NONE. But note that we don't recommend modifying the defaults for HaplotypeCaller because of some unwanted interactions between the engine downsampler and the HC's internal requirements.

  • KStammKStamm Member

    Same problem with -dcov still seems to exist with GATK version 3.2-2-gec30cee

    I want to downsample a bam to a given coverage depth. This is like whole-exome, where we have many regions of zero depth, and some of 100 or sometimes 1,000 reads over an exon. I need to cap high coverage regions to about 100. Because the coverage is so varied, a direct -dfrac 10% (or Picard downsample by fraction, or samtools downsample by fraction) won't treat regions of 100 and 1000 the way I want to.

    I found PrintReads -dcov, which seems to do what I want, cap any site to a given coverage. It doesn't seem to work.

    I used this command:

    java -Xmx4g -jar $JAR \
    -R $REF \
    -T PrintReads \
    -o sub.bam \
    -I input.bam \
    -dcov 200

    It runs for a moment and the output file is the same as the input file. It started with 367312 reads mapped out of 395908, and the output is also 367312 reads mapped out of 395908.

    So it doesn't look like PrintReads did anything with the -dcov.

    The run log is attached.

  • pdexheimerpdexheimer ✭✭✭✭ Member ✭✭✭✭

    @KStamm‌ -

    Given droazen's earlier comment in this thread, I'm not too surprised. It sounds like you don't have any locations where 250 reads start at the same place, which is what -dcov looks for in PrintReads (and all other read-based walkers). Assuming that you have uniform-ish coverage in your "exons", you would probably want to specify 250 / (read length)

  • KStammKStamm Member

    @pdexheimer‌ I shouldn't have said it was like whole-exome.

    samtools view input.BAM | cut -f4 | sort | uniq | wc
        794     794    7291
    

    These 395 thousand reads have in total less than 800 alignment start positions (so if they were uniform, about 500x coverage, but in reality some are 50x and other 10,000x). Either way, shocking that the input and output files have the same read count. the dcov didn't trigger anywhere.

    I'll try a lower threshold to confirm.

    samtools flagstat   input.BAM | head -n 1
    395908 + 0 in total (QC-passed reads + QC-failed reads)
    
     java -Xmx4g -jar $JAR \
       -R $REF \
       -T PrintReads \
       -o output_100.BAM \
       -I input.BAM \
       -dcov 100
    
    samtools flagstat   output_100.BAM | head -n 1
    395908 + 0 in total (QC-passed reads + QC-failed reads)       
    
     java -Xmx4g -jar $JAR \
       -R $REF \
       -T PrintReads \
       -o output_002.BAM \
       -I input.BAM \
       -dcov 2
    
    samtools flagstat   output_002.BAM | head -n 1
    395889 + 0 in total (QC-passed reads + QC-failed reads)
    

    Haha! wow, nineteen reads dropped. Okay, I'll try to go lower than dcov=2.

     java -Xmx4g -jar $JAR \
       -R $REF \
       -T PrintReads \
       -o output_0002.BAM \
       -I input.BAM \
       -dcov 0.2
    
     ERROR MESSAGE: Invalid command line: Failed to parse value org[email protected]2937e38c for argument downsampleCoverage.  This is most commonly caused by providing an incorrect data type (e.g. a double when an int is required)
    

    Nope. There's definitely something going on.

  • Geraldine_VdAuweraGeraldine_VdAuwera admin Cambridge, MAMember, Administrator, Broadie admin

    The usage of -dcov for non-locus walkers is kind of tricky. There's probably another subtlety at work here which hopefully @droazen‌ can help us address when he gets back from his vacation. But to be clear, your last attempt (with 0.2) is not working for a good reason: decimal values are not a valid value type for that argument (as stated in the error message). Nothing fishy going on there.

  • TechnicalVaultTechnicalVault ✭✭✭ Cambridge, UKMember ✭✭✭

    Be warned that last time I tried -dcov I realised that it doesn't downsample pairs, it downsamples individual reads, so you may end up with unpairable reads in your output BAMs. If you're trying to simulate lower coverages by running all the way through your pipeline this is a "bad thing" as this will mess with your mapper.

Sign In or Register to comment.