We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Downsampling Experiment

TristanTristan La Jolla, CAMember ✭✭


Trying to downsample in an orderly fashion in the name of experimentation, and in doing so would like to specify just one chromosome for the experiment - so I picked chromosome 17 with -L and a coverage of 30x with -dcov 30. This came up:

ERROR MESSAGE: Locus-based traversals (ie., Locus and ActiveRegion walkers) require a minimum -dcov value of 200 when downsampling to coverage. Values less than this can produce problematic downsampling artifacts while providing only insignificant improvements in memory usage in most cases.

I was hoping to poke through results from using the HaplotypeCaller with many different simulated depths of coverage for several samples. I read that one can use -dfrac instead, and that it might even be more appropriate, though I was hoping to find out what level of coverage led to what level of results and using -dfrac feels much less specific as it appears to toss a fraction of however many reads where at a given position, rather then tossing reads over a certain coverage. Thus with -dfrac, I could say that my sample had an average of 30x for this chromosome and I tossed half so theoretically I've simulated 15x depth of coverage...

Which approach would be more representative of reality? Using -dfrac to simulate a certain depth of coverage, or -dcov assuming I didn't have the 200 restriction?

Thanks for any help/discussion!

Best Answers


  • TristanTristan La Jolla, CAMember ✭✭

    Thank you both! It does seem to make more sense in that, while trying to simulate "20x coverage" that normally implies that certain regions are going to have 100x and others 5x, so finding the average coverage for my chromosome and using -dfrac appropriately makes the most sense. The only issue now is that I'm hoping to show this for a group of samples and the group has a variable depth of coverage per sample. This still makes reasonable sense as much of my group of samples has the same DoC, except for the included na12878 which is much deeper.

    I'm assuming I should use the -dt (--downsampling_type) BY_SAMPLE to ensure that only 10 or 50 percent of the reads per sample are kept, and I'm assuming that there isn't a way to specify a -dfrac for each sample individually.

    A way around this that I'm not aware of would be a tool to perhaps create bam files of only a certain level of coverage (akin to -dfrac), sort of like the SelectVariants tool but for BAM files. Do we know of any such tool?

    Thanks again!

  • pdexheimerpdexheimer Member ✭✭✭✭

    -dfrac is an engine-level argument, so you could use it with PrintReads

  • TristanTristan La Jolla, CAMember ✭✭
    edited August 2013

    Excellent, it appears that -L for intervals is also an engine-level argument. Very well, I can run PrintReads with -dfrac and -L to obtain samples from my chromosome of interest at the various Depths of Coverage. That should take a tad longer but result in better science, I'll make sure to share the results when we're done, thank you both!

  • vyellapavyellapa Member
    edited December 2016

    When using dfrac to down-sample a WGS bam, can I set the seed value. I want to get multiple sub-sampled bams, each of which are different.

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @vyellapa I think if you use --nonDeterministicRandomSeed that should do what you want, but I can't guarantee it. Try it out and let me know?

Sign In or Register to comment.