Downsampling with HaplotypeCaller

eflynn90eflynn90 Washington DCPosts: 56Member

Hello,

I'm running the Haplotype Caller on 80 bams, multithreaded (-nt 14), with either vcfs or bed files as the intervals file. I am trying to downsample to approximately 200 reads (-dcov 200). However, my output vcfs clearly have much higher depth (over 300-500 reads) in some areas. Also, HC drastically slows down whenever it hits a region of very deep pile-ups (over 1000 reads).

Is there any way to speed up HC on regions of very deep pileups? Is there a way to make downsampling stricter?

Thank you for your help.
Elise

dcov_2013-08-13.png
1366 x 335 - 99K

Best Answer

Answers

  • eflynn90eflynn90 Washington DCPosts: 56Member

    I am running GATK v2.6-5-gba531bd, compiled 2013/07/18.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    Hi Elise,

    To clarify, the downsampling is applied per sample, so if you have 80 samples you should downsample quite a bit more than to -dcov 200, since that allows up to 200 reads per sample.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Sorry, I wasn't clear. I am seeing over 300-500 reads per sample. And the regions of very deep pileups that I mentioned are over 1000 reads per sample.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    I see. That shouldn't happen; downsampling can lead to a somewhat smaller number of reads than the targeted coverage in some conditions, but never more. Can you tell me how you're assessing the depth in the VCF? Also, were your bam files compressed with ReduceReads?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    I'm assessing the depth by using the DP annotation in the sample columns of the VCF. They seem to match with the AD columns as well. Can you see the screenshot that I uploaded with my original post? There were some examples there.

    I thought that the problem might have been that there were multiple bams per sample, and HC was taking 200 reads per bam. However, even my samples with only one bam associated with them have coverage over 500 in some locations.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    I'm not seeing any attached file in your original post, can you please upload it again?

    Can you also post your command line and console output?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Can you see the photo now? It's a png.

    60ac05bf347ea3d8e94faacbd3ed14.png
    1366 x 335 - 99K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    Yes, I see it now. Can you also post your command line and console output?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Here they are. There was an error which we have been intermittently getting in the console output (identical runs will produce it one time but not another), but that's an issue for another day. :-)

    txt
    txt
    CommandLine.txt
    1K
    txt
    txt
    ConsoleOutput.txt
    205K
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    OK, you might have a bug here. Could you isolate a snippet of the bam file that reproduces the downsampling issue and upload it to our FTP server? We'll need to take a closer look at this.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Yeah, sure. I'll upload that later today.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Okay, six .bam snippets, a ped file, a vcf intervals file, and the command line argument I used are all uploaded in the folder Downsampling_with_HC on the ftp server. I reproduced the errors myself with those six bam files in their entirety using that VCF as the intervals file, so I think you should be able to reproduce them as well.

    Please let me know if there is anything else that you need, and thanks again for your help.

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Did you find the files on the ftp-server? Any progress on the issue?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    Ack, my bad, I failed to log these into the tracker. Will do asap. Sorry for the delay.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    @eflynn90, I found your files but it seems the bams are invalid. Samtools view returns

    [bam_header_read] EOF marker is absent. The input is probably truncated.
    [bam_header_read] invalid BAM binary header (this is not a BAM file).
    [main_samview] fail to read the header from "IND1/IND1.short.bam".
    

    The files must have been damaged in transfer. Can you please re-upload these, but this time zip the whole lot of them into an archive?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Hi, sorry, my fault! I sent you unheadered files, just with the reads of interest. I'll upload headered files by the end of the day.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    Ah, no worries. I probably won't get to them anymore today, but I'll process them first thing on Monday.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Oops, forgot to do this Friday afternoon. It should all be on the ftp server now though. Downsampling_with_HC.tar.gz

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    It looks like there's a problem with your new files too. I tried to generate index files for the bams since you didn't upload any, but both samtools and Picard flipped out over them. Even Picard ValidateSAMFile errors out on the files. Please validate and test your files fully before uploading them to us.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Sorry for wasting your time. I checked the new files so they should work, and I included the index files. Downsampling_with_HC.tar.gz

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    I think your new upload might not have overwritten the previous one, because the archive named "Downsampling_with_HC.tar.gz" still contains broken bams without index files. Can you please try uploading again with a different file name?

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Oh, my. Alright, Downsampling_with_HC_2013-08-27.tar.gz

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    Hurray, they work :)

    OK, I've been able to replicate your issue, so I'll have the devs look at it more closely. I'll let you know what they find.

    Geraldine Van der Auwera, PhD

  • eflynn90eflynn90 Washington DCPosts: 56Member

    Yayy! Thank you!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,122Administrator, GATK Developer admin

    Hi @eflynn90,

    Sorry it's taken a while, but we have finally found the time to confirm that this is indeed a bug in downsampling that specifically affects HaplotypeCaller processing. We'll let you know when we fix the issue.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.