The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Downsampling Experiment

La Jolla, CAPosts: 14

Hello!

Trying to downsample in an orderly fashion in the name of experimentation, and in doing so would like to specify just one chromosome for the experiment - so I picked chromosome 17 with -L and a coverage of 30x with -dcov 30. This came up:

##### ERROR MESSAGE: Locus-based traversals (ie., Locus and ActiveRegion walkers) require a minimum -dcov value of 200 when downsampling to coverage. Values less than this can produce problematic downsampling artifacts while providing only insignificant improvements in memory usage in most cases.

I was hoping to poke through results from using the HaplotypeCaller with many different simulated depths of coverage for several samples. I read that one can use -dfrac instead, and that it might even be more appropriate, though I was hoping to find out what level of coverage led to what level of results and using -dfrac feels much less specific as it appears to toss a fraction of however many reads where at a given position, rather then tossing reads over a certain coverage. Thus with -dfrac, I could say that my sample had an average of 30x for this chromosome and I tossed half so theoretically I've simulated 15x depth of coverage...

Which approach would be more representative of reality? Using -dfrac to simulate a certain depth of coverage, or -dcov assuming I didn't have the 200 restriction?

Thanks for any help/discussion!
-Tristan

Tagged:

• La Jolla, CAPosts: 14

Thank you both! It does seem to make more sense in that, while trying to simulate "20x coverage" that normally implies that certain regions are going to have 100x and others 5x, so finding the average coverage for my chromosome and using -dfrac appropriately makes the most sense. The only issue now is that I'm hoping to show this for a group of samples and the group has a variable depth of coverage per sample. This still makes reasonable sense as much of my group of samples has the same DoC, except for the included na12878 which is much deeper.

I'm assuming I should use the -dt (--downsampling_type) BY_SAMPLE to ensure that only 10 or 50 percent of the reads per sample are kept, and I'm assuming that there isn't a way to specify a -dfrac for each sample individually.

A way around this that I'm not aware of would be a tool to perhaps create bam files of only a certain level of coverage (akin to -dfrac), sort of like the SelectVariants tool but for BAM files. Do we know of any such tool?

Thanks again!

• Posts: 544 ✭✭✭✭

-dfrac is an engine-level argument, so you could use it with PrintReads

• La Jolla, CAPosts: 14
edited August 2013

Excellent, it appears that -L for intervals is also an engine-level argument. Very well, I can run PrintReads with -dfrac and -L to obtain samples from my chromosome of interest at the various Depths of Coverage. That should take a tad longer but result in better science, I'll make sure to share the results when we're done, thank you both!

• Posts: 33
edited December 2016

When using dfrac to down-sample a WGS bam, can I set the seed value. I want to get multiple sub-sampled bams, each of which are different.

Thank you,
Teja

• Cambridge, MAPosts: 11,743 admin

@vyellapa I think if you use --nonDeterministicRandomSeed` that should do what you want, but I can't guarantee it. Try it out and let me know?

Geraldine Van der Auwera, PhD