HaplotypeCaller does not filter duplicate reads, why?

k_vlcakk_vlcak Czech RepublicMember

Im running HaplotypeCaller on a server this way:
java -XX:ParallelGCThreads=8 -Xmx80g -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I a2tl1_14_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o a2tl1_14_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350

And I can not figure out why no duplicate reads are being filtered out although they were marked by Picard (with option TAGGING_POLICY=All ) and I also see around 20% of duplicates in corresponding samtools flagstat.

The beginning of the stdout looks like this:

INFO 04:37:54,634 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:54,986 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.7-0-gcfedb67, Compiled 2016/12/12 11:21:18 INFO 04:37:54,986 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute INFO 04:37:54,986 HelpFormatter - For support and documentation go to https://software.broadinstitute.org/gatk INFO 04:37:54,986 HelpFormatter - [Tue Nov 07 04:37:54 CET 2017] Executing on Linux 3.16.0-4-amd64 amd64 INFO 04:37:54,986 HelpFormatter - Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 INFO 04:37:54,990 HelpFormatter - Program Args: -T HaplotypeCaller -I h2al1_21_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf DuplicateRead -rf BadMate -rf BadCigar -R JIC_reference/alygenomes.fasta -o h2al1_21_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350 INFO 04:37:55,002 HelpFormatter - Executing as vlkofly@zigur17.cerit-sc.cz on Linux 3.16.0-4-amd64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27. INFO 04:37:55,003 HelpFormatter - Date/Time: 2017/11/07 04:37:54 INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- INFO 04:37:55,003 HelpFormatter - -------------------------------------------------------------------------------- WARN 04:37:55,009 GATKVCFUtils - Creating Tabix index for h2al1_21_HC1.g.vcf.gz, ignoring user-specified index type and parameter INFO 04:37:55,237 GenomeAnalysisEngine - Strictness is SILENT INFO 04:37:56,044 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 500 INFO 04:37:56,051 SAMDataSource$SAMReaders - Initializing SAMRecords in serial WARNING: BAM index file /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bai is older than BAM /scratch/vlkofly/job_386162.wagap-pro.cerit-sc.cz/h2al1_21_final.bam INFO 04:37:56,221 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.17 INFO 04:37:56,244 HCMappingQualityFilter - Filtering out reads with MAPQ < 25 INFO 04:37:56,289 MicroScheduler - Running the GATK in parallel mode with 8 total threads, 8 CPU thread(s) for each of 1 data thread(s), of 8 processors available on this machine INFO 04:37:57,903 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files INFO 04:37:58,090 GenomeAnalysisEngine - Done preparing for traversal INFO 04:37:58,090 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 04:37:58,091 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 04:37:58,091 ProgressMeter - Location | active regions | elapsed | active regions | completed | runtime | runtime INFO 04:37:58,091 HaplotypeCaller - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output INFO 04:37:58,092 HaplotypeCaller - All sites annotated with PLs forced to true for reference-model confidence output WARN 04:37:58,411 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. INFO 04:37:58,510 HaplotypeCaller - Using global mismapping rate of 45 => -4.5 in log10 likelihood units INFO 04:37:58,511 PairHMM - Performance profiling for PairHMM is disabled because the program is being run with multiple threads (-nct>1) option

And the info lines showing no duplicates removed:

INFO 16:53:05,155 ProgressMeter - Total runtime 130507.06 secs, 2175.12 min, 36.25 hours INFO 16:53:05,155 MicroScheduler - 46705813 reads were filtered out during the traversal out of approximately 149962396 total reads (31.15%) INFO 16:53:05,155 MicroScheduler - -> 0 reads (0.00% of total) failing BadCigarFilter INFO 16:53:05,156 MicroScheduler - -> 13334530 reads (8.89% of total) failing BadMateFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing DuplicateReadFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter INFO 16:53:05,156 MicroScheduler - -> 31823278 reads (21.22% of total) failing HCMappingQualityFilter INFO 16:53:05,156 MicroScheduler - -> 0 reads (0.00% of total) failing MalformedReadFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter INFO 16:53:05,157 MicroScheduler - -> 1548005 reads (1.03% of total) failing NotPrimaryAlignmentFilter INFO 16:53:05,157 MicroScheduler - -> 0 reads (0.00% of total) failing UnmappedReadFilter

Best Answers


  • k_vlcakk_vlcak Czech RepublicMember

    with the redundant filters removed the report shows correct percentage of reads failing DuplicateReadFilter.

    The question is whether I can be sure that previous runs of haplotypecaller (with the redundant filters) removed the duplicates and the only problem is the snafu of the report?

    Many thanks,

  • k_vlcakk_vlcak Czech RepublicMember

    Hi, the results are quite different.
    Without the redundant filters 20% more reads are being filtered out and the number of variants found is also different, although sometimes I get more, sometimes less.
    I will submit the bug report soon.

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • edited November 2017

    @shlee said:
    Hi @k_vlcak,

    HaplotypeCaller already excludes duplicate reads from consideration and so there is no need for you to additionally ask to exclude these with the -rf engine parameter. You can see the list of read filters each GATK tool applies on the tool documentation, e.g. HaplotypeCaller lists these under Additional Information > Read filters and these happen to be:


    That being said, additionally telling the tool to apply a read filter should not make a difference in the run and your duplicate reads should be excluded by the analysis. Let's rule out whether this could be some snafu in the the way the INFO field reports filtered reads for default tool settings versus user-specified parameters.

    Can you let me know if you run with just the additional read filters not on the list above what you get for the INFO read filter metrics? I think the BadMate filter is the only odd duck out so the command would look like:

    java -XX:ParallelGCThreads=8 -Xmx80g -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller -I a2tl1_14_final.bam --min_base_quality_score 25 --min_mapping_quality_score 25 -rf BadMate  -R JIC_reference/alygenomes.fasta -o a2tl1_14_HC1.g.vcf.gz -ploidy 2 -stand_call_conf 25 -ERC GVCF --pcr_indel_model NONE -nct 8 --max_num_PL_values 350


    Helped me a lot. Very useful reference. Thanks a lot for the solution @shlee.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @k_vlcak--great. I'll be on the lookout. If you have the time, can you see if you observe this behavior in GATK4? I can also check this for you once I have your test data from the bug report. The reason you might want to test this is so you can use GATK4 instead with your commands if GATK4 doesn't exhibit the same bug.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Any news on the bug report @k_vlcak? It would be nice to get this to the developers soon.

  • k_vlcakk_vlcak Czech RepublicMember

    Hi @shlee, I have uploaded the bugreport to your server on 28. Nov under name bugreport_redundant_filters_HC_Jakub_Vlcek.tar.gz. Excuse me I have not notify you earlier in here. Anyway thank you for solving this problem. I have rerun haplotypecalling with GATK3.7 without redundant filters and it works alright.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Thanks for letting us know. Glad to see the problem is resolved :smile:


  • shleeshlee CambridgeMember, Broadie, Moderator

    Just seeing this @k_vlcak. Thanks for coming back with test data. Based on @Sheila's response, I think perhaps this discrepancy will not get fixed in GATK3, especially since our developers have moved on to GATK4. Rather, we hope that researchers will be wary and not specify filters redundantly. Thanks again for reporting this and for taking the time to submit a bug report.

Sign In or Register to comment.