Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Any downstream effect of setting OPTICAL_DUPLICATE_PIXEL_DISTANCE to 2500 for Patterned Flow Cell?

I used to work on data from NextSeq, so I don't need to
set OPTICAL_DUPLICATE_PIXEL_DISTANCE for MarkDuplicates.
Now I started to work on patterned flow cell data. I was
told that I should set OPTICAL_DUPLICATE_PIXEL_DISTANCE
to 2500.
I did that. I am getting higher optical duplicates and higher
library complexity from the MarkDuplicates metrics file.
However, there is no difference in the downstream files.
Is this expected? Or is there any way to generate better
results?

Thanks a lot in advance.

Answers

  • xiuczxiucz Member

    good question, any more comments will be appreciated.

  • ymcymc Member

    http://seqanswers.com/forums/showthread.php?t=41057&page=32

    bbmap suggests 12000 for NovaSeq. Does GATK team also agree with
    this value?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @ymc That is an interesting question. I just want to check that by "patterned flow cell data" you are referring to HiSeq 4000?

    The information contained in this Tutorial might be helpful.

  • ymcymc Member

    My data is NovaSeq, so I would like to know whether the value should be increased.

    As to my original question, it does seem that changing the parameter
    can change results according to the following excerpts from the tutorial.
    However, I am not observing any difference in reality...

    "Pair orientation F1R2 is distinct from F2R1 for optical duplicates
    Here we refer you to a five minute video illustrating what happens at the molecular level in a typical sequencing by synthesis run.

    What I would like to highlight is that each strand of an insert has a chance to seed a different cluster. I will also point out, due to sequencing chemistry, F1 and R1 reads typically have better base qualities than F2 and R2 reads.

    Optical duplicate designation requires the same pair orientation.

    Let us work out the implications of this for a paired end, unstranded DNA library. During sequencing, within the flow cell, for a particular insert produced by sample preparation, the strands of the insert are separated and each strand has a chance to seed a different cluster. Let's say for InsertAB, ClusterA and ClusterB and for InsertCD, ClusterC and ClusterD. InsertAB and InsertCD are identical in sequence and length and map to the same loci. It is possible InsertAB and InsertCD are PCR duplicates and also possible they represent original inserts. Each strand is then sequenced in the forward and reverse to give four pieces of information in total for the given insert, e.g. ReadPairA and ReadPairB for InsertAB. The pair orientation of these two pairs are reversed--one cluster will give F1R2 and the other will give F2R1 pair orientation. Both read pairs map exactly to the same loci. Our duplicate marking tools consider ReadPairA and ReadPairB in the same duplicate set for regular duplicates but not for optical duplicates. Optical duplicates require identical pair orientation."

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @ymc

    I did some more research and found an interesting discussion that describes the distribution of duplicate reads versus pixel size.

    It appears from this information that setting the filter at 2500 removes machine generated artifacts for patterned flow cells. Setting the value lower would probably result in a reduction in detected duplicates. Setting the value higher would probably not result in a difference in detection.

    Another point to consider on the wet lab side is optimizing the loading concentration on the flowcell to reduce the generation of large clusters or re-seeding events.

    Let us know if this information helps.

  • ymcymc Member

    Dear Adealaide

     Well, initially I learned about this issue from 
    

    the "discussion" you just cited.

     That's why I gave it a try. However, while
    

    the library_size in the metrics file increased.
    All downstream output remains identical.
    I just want to know if this the correct behavior
    or not in my original post.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    I believe that would be the correct behavior if all duplicates were within that distance. What do you think @ymc?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Did you see this discussion? @ymc I thought the images of where the duplicates are in relation to each other was helpful.

  • ymcymc Member

    @AdelaideR said:
    I believe that would be the correct behavior if all duplicates were within that distance. What do you think @ymc?

    Are you saying all the duplicates between distance of 100 and 2500
    are considered as PCR duplicates when I set it to 100? This can only
    happen when all their orientations are the same according to the paragraphs
    I quoted above

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @ymc

    Please look at the histogram in the original "discussion" I posted. The likelihood of encountering optical duplicates is not directional, but based on distance.

    Here is [another paper] (https://sequencing.qcfail.com/articles/illumina-patterned-flow-cells-generate-duplicated-sequences/) Lowering your setting to 100 would increase the number of duplicates that would NOT be detected and increase the number of reads generated by the machine instead of the true biology.

    As for GATK using patterned flow cells, the best practice is to set the optical duplicates to 2500 unless the technology changes again. (Also, to set loading concentrations to reduce the production of excessive duplicates in the first place.)

    If you are doing a reference-free assembly or another type of analysis that depends on a more exact number of counts, that would not be related to GATK procedures, then you could try one of the suggestions in the comments on that article. Or you could map them using the script in the second link I provided.

    Think about it this way, if your alternative allele count was increased due to allowing machine duplicates into your downstream analysis, which loci would be the most affected? The rare ones with low coverage. Other filters set in your analysis are probably removing these low frequency and low coverage sites, so the affect of allowing the duplicates to move into the downstream is probably mitigated by these other filters.

  • ymcymc Member

    I just tried 12000 for my NovaSeq data. Library complexity is about 10% higher in the metrics output.
    Downstream output remains identical.
    Does that mean 12000 is better for NovaSeq as bbmap said?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin
    edited December 3

    @ymc can you please provide some example data from your library complexity metrics? How are you measuring it? Is there one program in particular?

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    @ymc I reached out to the development team about your request, here is the response:

    "It would be helpful to know their workflow, in particular: 1) the values they've specified for TAGGING_POLICY, REMOVE_DUPLICATES, and REMOVE_SEQUENCING_DUPLICATES; 2) what downstream tools are not showing improved results. It's possible a non-GATK tool might not respect sam tag 1024, and thus might require duplicates be removed rather than simply tagged.

    Probably the best thing would be to provide us with the complete command they're using to invoke MarkDuplicates, as well as the downstream tool that is not showing improved results."

    If you could please provide this information, that would be helpful.

  • ymcymc Member

    metrics file from the 12000 run

    htsjdk.samtools.metrics.StringHeader

    MarkDuplicates INPUT=[181127_A00719_0011_AH3YJ2DRXX/1/MS-0720-138--F-TG2.1-A_S1/MS-0720-138--F-TG2.1-A_S1.sorted.bam] OUTPUT=181127_A00719_0011_AH3YJ2DRXX/1/MS-0720-138--F-TG2.1-A_S1/MS-0720-138--F-TG2.1-A_S1.marked.bam METRICS_FILE=181127_A00719_0011_AH3YJ2DRXX/1/MS-0720-138--F-TG2.1-A_S1/MS-0720-138--F-TG2.1-A_S1.metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=12000 MAX_RECORDS_IN_RAM=5000000 CREATE_INDEX=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX= MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

    htsjdk.samtools.metrics.StringHeader

    Started on: Fri Nov 30 18:47:19 HKT 2018

    METRICS CLASS picard.sam.DuplicationMetrics

    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
    181127_A00719_0011_AH3YJ2DRXX 90353 19079462 101526 117765 69144 9646322 1872283 0.5062 12724005

    HISTOGRAM java.lang.Double

    BIN VALUE
    1.0 1.047735
    2.0 1.281637
    3.0 1.333854
    4.0 1.345512
    5.0 1.348114
    6.0 1.348695
    7.0 1.348825
    8.0 1.348854
    9.0 1.34886
    10.0 1.348862

  • ymcymc Member

    metrics file from my 2500 run:

    htsjdk.samtools.metrics.StringHeader

    MarkDuplicates INPUT=[181127_A00719_0011_AH3YJ2DRXX/1/MS-0720-138--F-TG2.1-A_S

    1/MS-0720-138--F-TG2.1-A_S1.sorted.bam] OUTPUT=181127_A00719_0011_AH3YJ2DRXX/1/M
    S-0720-138--F-TG2.1-A_S1/MS-0720-138--F-TG2.1-A_S1.marked.bam METRICS_FILE=18112
    7_A00719_0011_AH3YJ2DRXX/1/MS-0720-138--F-TG2.1-A_S1/MS-0720-138--F-TG2.1-A_S1.m
    etrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 MAX_RECORDS_IN_RAM=5000000 CREATE_I
    NDEX=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_REA
    D_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=fal
    se REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true ADD_P
    G_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORIN
    G_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_
    NAME=MarkDuplicates READ_NAME_REGEX= MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INF
    O QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 CREATE_MD5_FILE=f
    alse GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INF
    LATER=false

    htsjdk.samtools.metrics.StringHeader

    Started on: Wed Nov 28 12:38:34 HKT 2018

    METRICS CLASS picard.sam.DuplicationMetrics

    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTA
    RY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES
    READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
    181127_A00719_0011_AH3YJ2DRXX 90353 19079462 101526 117765 69144
    9646322 1502579 0.5062 12491962

    HISTOGRAM java.lang.Double

    BIN VALUE
    1.0 1.036749
    2.0 1.26184
    3.0 1.310711
    4.0 1.321321
    5.0 1.323625
    6.0 1.324125
    7.0 1.324233
    8.0 1.324257
    9.0 1.324262
    10.0 1.324263

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    HI @ymc

    Without knowing what downstream tool you are using, it's difficult for us to say why you are not seeing improved results, however MarkDuplicates does not remove duplicates unless you set --REMOVE_DUPLICATES true or --REMOVE_OPTICAL_DUPLICATES true. If you do not remove duplicates and the downstream tool does not look at the SAM flag set by MarkDuplicates, there will be no effect. Also, setting --TAGGING_POLICY All is a good idea when you want to disambiguate the optical and non-optical duplicates.

    Regards
    Bhanu

  • ymcymc Member

    When I say the downstream results are the same, I mean the bam outputs
    from MarkDuplicates are the same.> @bhanuGandham said:

    HI @ymc

    Without knowing what downstream tool you are using, it's difficult for us to say why you are not seeing improved results, however MarkDuplicates does not remove duplicates unless you set --REMOVE_DUPLICATES true or --REMOVE_OPTICAL_DUPLICATES true. If you do not remove duplicates and the downstream tool does not look at the SAM flag set by MarkDuplicates, there will be no effect. Also, setting --TAGGING_POLICY All is a good idea when you want to disambiguate the optical and non-optical duplicates.

    Regards
    Bhanu

    My downstream is RealignmentTargetCreator and IndelRealignment from GATKLite-2.3.9 followed by
    BQSR of GTAK4. I compared the bam files from ApplyBQSR and they are completely identical

    I tried REMOVE_SEQUENCING_DUPLICATES=true and still the same.

    I haven't tried "--TAGGING_POLICY All". Do you think it can help?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @ymc,

    Please apply all the changes that were suggested in previous posts and send us the revised command please.

    Regards
    Bhanu

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @ymc,

    Markduplicates is only used to identify and not filter the duplicate reads.
    The downstream tools we were referring to are the variant calling tools. Those tools look at the marked duplicates and disregard them in the process. You can find more information on a similar issue in this thread: https://gatkforums.broadinstitute.org/gatk/discussion/10866/duplicates-are-not-filtered-out-by-picard-2-2-1-markduplicates.
    I hope this was helpful.

    Regards
    Bhanu

  • ymcymc Member

    @bhanuGandham said:
    Hi @ymc,

    Markduplicates is only used to identify and not filter the duplicate reads.
    The downstream tools we were referring to are the variant calling tools. Those tools look at the marked duplicates and disregard them in the process. You can find more information on a similar issue in this thread: https://gatkforums.broadinstitute.org/gatk/discussion/10866/duplicates-are-not-filtered-out-by-picard-2-2-1-markduplicates.
    I hope this was helpful.

    Regards
    Bhanu

    If the bam output from ApplyBQSR is completely identical by diffing.
    I don't think it is possible for any downstream tools to produce different results.

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin
    edited December 6

    Hi @ymc

    Please apply all the changes that were suggested in previous posts and send us the revised command please.
    Also please send us the bam files you are referring to.

    Regards
    Bhanu

  • ymcymc Member

    @bhanuGandham said:
    Hi @ymc

    Please apply all the changes that were suggested in previous posts and send us the revised command please.
    Also please send us the bam files you are referring to.

    Regards
    Bhanu

    How do I send the bams? My bams are several gigabytes each.

    I can re-run with TAGGING_POLICY=true but it can take several days. Are you
    sure BaseRecalibrator and ApplyBQSR of GATK4 can make use of the DT tag?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @ymc,

    Please use the following link for information on how to provide us your data.https://software.broadinstitute.org/gatk/guide/article?id=1894

    Regards
    Bhanu

Sign In or Register to comment.