Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GBS data and duplicates marking

GuillefriisGuillefriis SpainMember
edited May 2016 in Ask the GATK team

Hi, I just read this in the Best Practices guide:

"Duplicate marking should NOT be applied to amplicon sequencing data or other data types where reads start and stop at the same positions by design."

That's somehow what happens with GBS data since the sequenced fragments are obtained by cutting at the same place across the genome with a restriction enzyme for all the samples, have I been wrong when dedupping my data before the genotyping? Many thanks!


Best Answer


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Guillefriis,

    When you dedup your data, what percent duplication does the metrics file give you? Is it a high number? I'm not familiar with GBS so could you tell us more? Specifically, could you tell us more about the experimental design, e.g. are the preparations PCR-free, are the RE target sites degenerate, etc. Or better yet, could you provide us an IGV screen shot of a typical alignment site?

    Targeted amplicon sequencing by design give deep stacks of short reads whose start and stop positions become indistinguishable and for which the application of MarkDuplicates, which defines duplicates as having the same start and stop positions in alignment, provide little value.

  • GuillefriisGuillefriis SpainMember

    Hi @shlee, sorry it has taken so long for me to reply. GBS is basically a simplified RAD-seq technique, it has a PCR step in library preparation to add Illumina adapters although followed by a cleaning step (sensu Elshire 2011 where the technique is described, Cornell University staff performed the sequencing thorugh their GBS service), RE is PstI and it isn't degenerated. Unluckily and don't have access now to the metrics file and I have never used IGV but I have duplication levels of around 85% as measured with FastQC. During the haplotype calling around half of the reads are filtered out because of failing DuplicateReadFilter.
    I't not like targeting an amplicon, but basically we analyzed sequences corresponding consistely to the same genomics fraction of the genome across all the samples, as obtained by digesting the DNA. Hope this helps you to help me, thanks a lot!


  • GuillefriisGuillefriis SpainMember

    Thanks a lot @shlee, I have been getting really low depth coverages in my SNPcalling and this is likely one of the reasons. My barcoding is not random, we barcode each individual with a specific sequence that I use for demultiplexing them before mapping, so for the Picard MarkDuplicates step the reads don't carry the barcode anymore. Just a couple of last questions:

    A. By using the -drf DuplicateRead I disable the duplicate filtering even when I've already marked them with PicardTools? It will have no effect or should I rerun the pipeline skipping the dedupping step?

    B. These are the GATK tools I tipically use in my pipeline, do I need to disable the duplicate filtering in all of them?
    -T RealignerTargetCreator
    -T IndelRealigner
    -T HaplotypeCaller
    -T CombineGVCFs
    -T GenotypeGVCFs
    -T SelectVariants

    Thanks one more time, this can be a really break-through in my work.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    The -drf DuplicareRead command will tell tools to ignore the Sam flags that identify reads as duplicates, so you don't need to do any reprocessing.

    You only need to include it in tools that read bam files, so in your pipeline, up to (including) HaplotypeCaller.
  • GuillefriisGuillefriis SpainMember
    edited June 2016

    Thanks @Geraldine_VdAuwera, I'm trying it at this very moment, is the way of disabling filters new in the actual GATK version? I use GATK 3.3-0 (as available in the specialized server we use in my group) and I'm getting the error "Argument with name 'drf' isn't defined', any ideas which could be the problem? Thanks a lot. The code:

    for bams in $bam_files

    filename=$(basename "$bams" .bam)
    java -jar $gatk \
        -T RealignerTargetCreator \
        -I $bams \
        -R $genome \
        -drf DuplicateRead \
        -o $output/${filename}.intervals \
        --minReadsAtLocus 4


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited June 2016


    I tested the -drf DuplicateRead option using v3.3-0 and I get the same error for RealignerTargetCreator and for HaplotypeCaller:

    wmd2a-330:blog_indel_realignment_GATKv3.5 shlee$ java -jar /Users/shlee/Desktop/GenomeAnalysisTK-3.3-0-g37228af.jar -T HaplotypeCaller -R ../../../ref/human_g1k_v37_decoy.fasta -I 7156_snippet.bam -o test_20160608_3.3.vcf -drf DuplicateRead -L 10:96000000-97000000
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 3.3-0-g37228af): 
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: Argument with name 'drf' isn't defined.
    ##### ERROR ------------------------------------------------------------------------------------------

    I'm guessing the option was added for later GATK releases. Since it's always a good idea to use the latest release of GATK possible, given we are constantly improving the tools and fixing bugs, is it possible for your group to provide an updated version, e.g. to v3.6? This is especially desirable if you are just starting your project given you'd want to stick with the same version per experiment.

    Alternatively, if you still have the BAM prior to duplicate marking, use this instead. If you do not, certain external forums outline ways to remove the duplicate flag.

    On a technical note, IndelRealigner does not use the DuplicateReadFilter although RealignerTargetCreator does. However, because MarkDuplicates (assuming you used this tool to flag duplicates) defines duplicates on the insert's unclipped alignment start and unclipped alignment end, the alignment that is between--whether they contain mismatches or indels--matters not. What this means is that, for your GBS data you should definitely make sure to disable the DuplicateReadFilter. Otherwise RealignerTargetCreator may skip inclusion of target regions with potential indels. The same applies for HaplotypeCaller with potential SNPs and indels. In the end you will miss calling variants and/or increase false positive calls.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    If I recall correctly we added it in 3.4.

  • GuillefriisGuillefriis SpainMember

    Thanks you both @shlee and @Geraldine_VdAuwera, I'm basically working in phylogenomics and population genomics and the variant recovery (and consequently the resolution of my analyses) has improved so much now that I can filtered by depth coverage, which has increased by a factor of ~3, and because I think they are more reliable, as you foresaw @shlee. I had to stick for the moment to the 3.3 version and bams previous to MarkDuplicates ran because there are problems with Java 1.8 implementation in the server and GATK 3.6 can't work under older versions, right? At least I couldn't make it work.
    I found a couple of strange things though: first of all, there are several sites across the variant matrix that show the same heterozygous genotype (let's say R for instance) for all or nearly all the samples, usually consecutively. They really seem to be artifacts, is there any known source in GATK for these kind errors?
    Also, after using SelectVariants tool to filter out monomorphic sites (my reference genome is from a different species and I'm interested only in the variants within my sample set, so I set restrictAllelesTo to MULTIALLELIC as proposed by @Sheila) strangely long indels seems to appear in my dataset, have you dealt with this before by chance? Thanks again, you support is great and really appreaciated !


  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @Guillefriis Can you tell us more about these consecutive heterozygous R variants and why you think they are artifacts?

    Are you saying after SelectVariants, indels not present in the VCF before are now appearing?

  • BegaliBegali GermanyMember ✭✭



    I have one question as I am working on RADSeq and I do not have knownsite_indel so it is nesseray to do indel-realignment step or not also should I , you should disable the DuplicateReadFilter when running GATK tools with adding -drf DuplicateRead because I call my variants directly after preprocesses steps ( map my reads.sample by BWA with mem for single read to obtain sam .then sort index my bam.file ) , also I can not apply BSQR so skip those will not affect my result at the end ..aim of my project to discover Varaints at the end and link them to phenotypes of plants for multiple samples ..

    Thanks so much

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Begali,

    You can still perform indel realignment without a known indel sites file. See this tutorial.

    Three input choices are technically feasible in creating a target intervals list: you may provide RealignerTargetCreator (i) one or more VCFs of known indels each passed in via -known, (ii) one or more alignment BAMs each passed in via -I or (iii) both. We recommend the last mode, and we use it in the example command. We use these same input files again in the realignment step.

    In step 2, use RealignerTargetCreator in the alignment-only mode and omit specifying the known sites, e.g.

    java -jar GenomeAnalysisTK.jar \
        -T RealignerTargetCreator \
        -R human_g1k_v37_decoy.fasta \
        -L 10:96000000-97000000 \
        -I 7156_snippet.bam \
        -o 7156_realignertargetcreator.intervals
  • BegaliBegali GermanyMember ✭✭
    edited June 2017

    [email protected]

    thanks so much for your quick reply even if without -L which is not known I am working on plants RADSeq and how about -drf DuplicateRead as I should not use Mark_duplicate however for -drf DuplicateRead should I apply ...

    Thanks in advance

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @Begali,

    For an explanation of standard engine parameters see https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_engine_CommandLineGATK.php.

    You can use MarkDuplicates if you want. The -drf DuplicateRead option allows downstream tools to include the duplicate flagged reads in analyses. If you don't mark duplicates, then there is no need for the -drf DuplicateRead option in downstream tools.

  • evetcevetc Member
    > @Geraldine_VdAuwera said:
    > The -drf DuplicareRead command will tell tools to ignore the Sam flags that identify reads as duplicates, so you don't need to do any reprocessing.
    > You only need to include it in tools that read bam files, so in your pipeline, up to (including) HaplotypeCaller.

    Hello, I also have GBS data so I have skipped the step to MarkDuplicates as I am expecting a lot of duplicates that are not simply do to PCR. If I have not marked the duplicates do i need to run -drf DuplicateRead?

    If I do run it with -drf DuplicateRead like this:
    ~/bin/gatk/gatk --java-options "-Xmx4g" HaplotypeCaller -drf DuplicateRead -R ~/Pararge_aegeria/Pararge_CW/Pararge_aegeria_v2.softmasked.fa -I trimmed-index1-etcGBS_barcode_01.bam -O trimmed-index1-etcGBS_barcode_01.g.cf.gz -ERC GVCF
    I get this error:
    "A USER ERROR has occurred: d is not a recognized option"

    It does not produce an error when I do not run it with -drf DuplicateRead

    I am using GATK4.0

    Thanks for your help!
  • evetcevetc Member
    Hi @bhanuGandham

    Thank you for your response.
    Does this mean I should change it to -DF DuplicateRead?
    What would happen if I simply run it without this?

    Thank you,
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 24

    Hi @evetc

    Here are a list and explanations of read filters you can use with -DF option:https://software.broadinstitute.org/gatk/documentation/tooldocs/current/picard_vcf_MendelianViolations_FindMendelianViolations.php#ReadFilters

    PS: Checkout Terra for end-to-end GATK pipelining solutions and let us know what more pipelines we can add that will make using GATK easier for you! For more details on whether this is the right fit for you checkout our blog page.

Sign In or Register to comment.