Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Why is HaplotypeCaller dropping half of my reads?

Hi
I have been trying HaplotypeCaller to find SNPs and INDELS in viral read data (haploid) but am finding that it throws away around half of my reads and I don't understand why. A small proportion (8%) are filtered out duplicates and 0.05% fail on mapping quality but I can't account for the majority of lost reads. I appreciate that GATK wasn't built for viral sequences but would you have an idea of what could be causing this?
I use the following command after marking duplicates and realigning around indels:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R Ref.fasta -I realigned_reads.bam --genotyping_mode DISCOVERY -ploidy 1 -bamout reassembled.bam -o rawvariants.vcf
I have also tried the same file with UnifiedGenotype and I get the result I expect i.e. most of my reads are retained and I have SNP calls that agree with a VCF constructed in a different program so I assume the reads are lost as part of the local realignment?

Thanks
Kirstyn

Tagged:

Issue · Github
by Geraldine_VdAuwera

Issue Number
852
State
closed
Last Updated
Closed By
vdauwera

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Kirstyn,

    There are two features that could account for your read loss:

    1) Downsampling of very high coverage (but that doesn't seem likely if UG doesn't do the same thing)

    2) HC's more stringent quality filters: most likely given your original post. Do you have a plot of distribution of the mapping quality of your data?

  • kirstynkirstyn Member

    Hi

    Thanks for your quick reply! The mapping quality of my data appears to be very good (I've attached a couple of plots).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    A, those do look decent -- that's not what I was expecting. Can you tell me what makes you think HC is throwing away reads? If it's because of depth annotations, can you please tell me what version you're using, what command line, and post a few records showing the problem?

  • kirstynkirstyn Member

    I am using version 3.3-0 with the command line:
    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam --genotyping_mode DISCOVERY -ERC BP_RESOLUTION -variant_index_type LINEAR -variant_index_parameter 128000 -ploidy 1 -bamout test1.bam -o test1.vcf
    I have also used -ERC GVCF. I've actually confused myself even more now because I seem to have been wrong about UG retaining all my reads- it also seems to drop a lot of reads from the original bam file but still accurate calls SNPS. I have shown an example below for a SNP that is present in nearly 100% of reads but has not been called by genotypeCaller.

    1. GVCF
      TzDg_1643_genome5FULL.seq 6888 . C . . END=6888 GT:DP:GQ:MIN_DP:PL 0:346:0:346:0,0

    2. BP
      TzDg_1643_genome5FULL.seq 6888 . C . . . GT:AD:DP:GQ:PL 0:1,345:346:0:0,0

    3. Unified genotyper
      TzDg_1643_genome5FULL.seq 6888 . C T 10459 . AC=1;AF=1.00;AN=1;DP=249;Dels=0.00;FS=0.000;HaplotypeScore=6.9017;MLEAC=1;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.85;SOR=0.802 GT:AD:DP:GQ:PL 1:0,249:249:99:10489,0

    It seems to me that this position is not being considered as an active region in HC as the bamout file has no reads at this region. The alternative base T is not even present in the gvcf but the GQ is zero and both PL for ref and non-ref are zero, which shows uncertainty in the call? I think I'm not properly understanding this output so if you can help to clarify I would greatly appreciate it!

    Many thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ok, the good news is that I think the "read loss" you're seeing is just from normal downsampling. Over a certain amount of coverage, the tools downsample reads so that they have enough data but not so much as to get unreasonably slowed down. This is by design and not a cause for concern.

    But you're right that that call is problematic. Your interpretation of the GQ and PL is correct: for some reason the program didn't process that region as active and was unable to determine the sample genotype with confidence (hence the zeros).

    There are a few advanced arguments that can force HC to process a region as active, but since it's a tricky process and the haploid functionality is very new anyway, I'd like to put this through our debugging process in case it's something we can improve without requiring you to fiddle with individual sites. Would you be able to share a snippet of the data with us? Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

  • kirstynkirstyn Member

    Hi Geraldine
    I have uploaded "kirstyn_hc_snpProblem.tar.gz" to the ftp site.
    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks @kirstyn‌, we'll try to get this processed quickly.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kirstyn‌

    Hi,

    I am trying to run your files, but I do not have the reference fasta file and the reference index file. Also, the snippet of the bam file you sent is not in bam format. Can you please upload those?

    Thanks,
    Sheila

  • kirstynkirstyn Member

    Hi Sheila
    Sorry, I don't know what happened there, I've uploaded the archive again

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited December 2014

    @kirstyn‌

    Hi,

    I submitted a bug report. I cannot make any promises, but hopefully this can be taken care of.
    The reason this is a difficult bug is because the region is full of SNPs and potentially a SNV.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @kirstyn
    Hi,

    Can you please upload your reference file again? It got deleted somehow.

    Thanks,
    Sheila

  • NardusNardus GlasgowMember

    I appear to have run into the same problem – regions with very high numbers of SNPs end up having no reads in the -bamout file, presumably because the region is not being processed.

    As in Kirstyn's post, this results in calls with almost no reads supporting them:
    GT:AD:DP:GQ:PL 0:1,435:436:0:0,0

    The screenshot below shows one such region, with the top track showing the input bam file, while the second track shows the -bamout file from running the following command line:

    java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -I 71_all_dedup+realign.bam -R TanzaniaDog_RV2772_KF155002.1.fasta -o Test71.g.vcf -bamout Test1_bamout.bam --genotyping_mode DISCOVERY --emitRefConfidence BP_RESOLUTION --sample_ploidy 1 --annotateNDA --pcr_indel_model AGGRESSIVE --activeRegionOut Test1_activeregions.tab --activityProfileOut Test1_raw_activity_profile.tab

    In this case the raw activity profile seems to pick up at least some of this variation, so is the problem that one cannot have too many adjacent active regions? It's not clear to me what changing the maximum size of the active region will do (apart from slow things down), but making it arbitrarily large (in this case slightly larger than the actual genome; --activeRegionMaxSize 12000) produces the file shown in the bottom track.

    This fixes the problem for all 3 affected regions in this sample, but will it have other unintended consequences?


    image

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I think this is a side effect of our focus on human genetics; the program doesn't expect quite so much divergence from the reference, and so some of the logic that controls how to cut up the problem into manageable chunks is not entirely appropriate for data like yours.

    It should be safe to run systematically with a large max active region size setting. Considering you're dealing with small genomes, the performance hit should be manageable.

Sign In or Register to comment.