The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Did we ask for a bug report?

Then follow instructions in Article#1894.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Downsampling with HaplotypeCaller

Washington DC

Hello,

I'm running the Haplotype Caller on 80 bams, multithreaded (-nt 14), with either vcfs or bed files as the intervals file. I am trying to downsample to approximately 200 reads (-dcov 200). However, my output vcfs clearly have much higher depth (over 300-500 reads) in some areas. Also, HC drastically slows down whenever it hits a region of very deep pile-ups (over 1000 reads).

Is there any way to speed up HC on regions of very deep pileups? Is there a way to make downsampling stricter?

Thank you for your help.
Elise

Tagged:

• Washington DC

I am running GATK v2.6-5-gba531bd, compiled 2013/07/18.

• Cambridge, MA

Hi Elise,

To clarify, the downsampling is applied per sample, so if you have 80 samples you should downsample quite a bit more than to -dcov 200, since that allows up to 200 reads per sample.

• Washington DC

Sorry, I wasn't clear. I am seeing over 300-500 reads per sample. And the regions of very deep pileups that I mentioned are over 1000 reads per sample.

• Cambridge, MA

I see. That shouldn't happen; downsampling can lead to a somewhat smaller number of reads than the targeted coverage in some conditions, but never more. Can you tell me how you're assessing the depth in the VCF? Also, were your bam files compressed with ReduceReads?

• Washington DC

I'm assessing the depth by using the DP annotation in the sample columns of the VCF. They seem to match with the AD columns as well. Can you see the screenshot that I uploaded with my original post? There were some examples there.

I thought that the problem might have been that there were multiple bams per sample, and HC was taking 200 reads per bam. However, even my samples with only one bam associated with them have coverage over 500 in some locations.

• Cambridge, MA

I'm not seeing any attached file in your original post, can you please upload it again?

Can you also post your command line and console output?

• Washington DC

Can you see the photo now? It's a png.

• Cambridge, MA

Yes, I see it now. Can you also post your command line and console output?

• Washington DC

Here they are. There was an error which we have been intermittently getting in the console output (identical runs will produce it one time but not another), but that's an issue for another day. :-)

• Cambridge, MA

OK, you might have a bug here. Could you isolate a snippet of the bam file that reproduces the downsampling issue and upload it to our FTP server? We'll need to take a closer look at this.

• Washington DC

Yeah, sure. I'll upload that later today.

• Washington DC

Okay, six .bam snippets, a ped file, a vcf intervals file, and the command line argument I used are all uploaded in the folder Downsampling_with_HC on the ftp server. I reproduced the errors myself with those six bam files in their entirety using that VCF as the intervals file, so I think you should be able to reproduce them as well.

Please let me know if there is anything else that you need, and thanks again for your help.

• Washington DC

Did you find the files on the ftp-server? Any progress on the issue?

• Cambridge, MA

Ack, my bad, I failed to log these into the tracker. Will do asap. Sorry for the delay.

• Cambridge, MA

@eflynn90, I found your files but it seems the bams are invalid. Samtools view returns

[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[main_samview] fail to read the header from "IND1/IND1.short.bam".
`

The files must have been damaged in transfer. Can you please re-upload these, but this time zip the whole lot of them into an archive?

• Washington DC

Hi, sorry, my fault! I sent you unheadered files, just with the reads of interest. I'll upload headered files by the end of the day.

• Cambridge, MA

Ah, no worries. I probably won't get to them anymore today, but I'll process them first thing on Monday.

• Washington DC

Oops, forgot to do this Friday afternoon. It should all be on the ftp server now though. Downsampling_with_HC.tar.gz

• Cambridge, MA

It looks like there's a problem with your new files too. I tried to generate index files for the bams since you didn't upload any, but both samtools and Picard flipped out over them. Even Picard ValidateSAMFile errors out on the files. Please validate and test your files fully before uploading them to us.

• Washington DC

Sorry for wasting your time. I checked the new files so they should work, and I included the index files. Downsampling_with_HC.tar.gz

• Cambridge, MA

I think your new upload might not have overwritten the previous one, because the archive named "Downsampling_with_HC.tar.gz" still contains broken bams without index files. Can you please try uploading again with a different file name?

• Washington DC

Oh, my. Alright, Downsampling_with_HC_2013-08-27.tar.gz

• Cambridge, MA

Hurray, they work

OK, I've been able to replicate your issue, so I'll have the devs look at it more closely. I'll let you know what they find.

• Washington DC

Yayy! Thank you!

• Cambridge, MA

Hi @eflynn90,

Sorry it's taken a while, but we have finally found the time to confirm that this is indeed a bug in downsampling that specifically affects HaplotypeCaller processing. We'll let you know when we fix the issue.

• I can see in the documentation (https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php) that HaplotypeCaller has a "--maxReadsInRegionPerSample 1000" default setting. However, changing this parameter has no effect on the output. Is this a work in progress parameter?

• Cambridge, MA

@freeseek No, it should be functional. Keep in mind that this applies to the number of reads in an entire ActiveRegion, not per site.