Question on Article#8017, mapping reads to grch38

SystemSystem Administrator admin
This discussion was created from comments split from: (How to) Map reads to a reference with alternate contigs like GRCh38.

Comments

  • Hello,

    I have a question regarding this tutorial. I have several .fastq files. Paired illumina sequencing (two fastq files per sample). I'd like perform analysis based on this tutorial to detect only somatic (variants)

    Everything is fine for me, however what about illumina adapter clip or mark them? I didn't see any reference about this here... Shall I add additional step with MarkIlluminaAdapters?
    Moreover is there any additional step that I have to keep in mind?

    Thank you.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited September 2018

    Hi @Adam_U0,

    I've moved your question to a new thread. We are trying to encourage researchers to post their questions to fresh threads instead of adding to existing long threads.

    however what about illumina adapter clip or mark them? I didn't see any reference about this here... Shall I add additional step with MarkIlluminaAdapters?

    It's always a good idea to see the extent of short inserts, which MarkIlluminaAdapters tabulates. For the somatic tutorials, when I developed the MuTect2/Mutect2 tutorial (originally for Article#9183, which uses GATK3 MuTect2), I performed MarkIlluminaAdapters. In fact, I decided to go so far as to remove all reads with adapter sequences from the dataset before mutation calling. Again, in lieu of clipping adapter sequences out of the data, I chose to remove the adapter-read through pairs entirely. There are strong reasons to do so for somatic mutation calling, given we call low allele fraction variants and we can expect tumor samples to be heterogeneous etc. I would not want to mistake adapter sequence in reads for a somatic mutation.

    If your sequencing facility or tool-chain is not already hard-clipping adaptor sequences out, then I highly recommend that you do so for somatic calling, especially if you care about low allele fraction calls. What you can do is MarkIlluminaAdapters and remove all reads with the XT tag post-marking. If you do not care so much for low AF calls, then removing these reads is less of a concern.

    I'm not aware of any workflow we offer that removes these for you.

    Post edited by shlee on
  • Thank You for comprehensive and quick answer.
    I think that I'll mark and then remove the potential adapter sequences.

    Currently I'm testing the pipeline with data obtained from ENA database (.fastq). My goal is to decipher somatic mutation signatures for glioblastoma (signature 1, 5 and 11 - results here). If my final results will be comparable to those on the cosmic, then I'll know that workflow is correct. I have 4 WES samples, illumina paired reads, each sample - tumor and blood.
    I decided to use the following tutorials to create a complete pipeline:
    1. (How to) Map reads to a reference with alternate contigs like GRCh38 - with postalt
    2. (How to) Call somatic mutations using GATK4 Mutect2 - tumor-control mode (PoN creation)
    3. Decipher signatures using deconstructSigs.
    I hope

    Thank you again and best regards,
    Adam U

  • Dear @shlee ,
    we are looking for somatic variants only too.
    We were following:
    1) https://gatkforums.broadinstitute.org/gatk/discussion/6484
    2) https://gatkforums.broadinstitute.org/gatk/discussion/6483
    in our pre-processing.

    means we did FastqToSam > MarkIlluminaAdapters > SamToFastq > BWA > MergeBamAlignment

    isn't MarkIlluminaAdapters enough for us? I just can not understand "remove all reads with adapter sequences from the dataset before mutation calling". How to perform this technically?

    I was sure that MarkIlluminaAdapters was a safe alternative of trimming the read's ends

    Thank You in advance!

    Sergey

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    You're welcome @Adam_U0 and thank you for sharing an outline of your workflow as well as decontructSigs. I will check it out. Validating somatic calls against COSMIC somatic mutation signatures is a great idea. I just want to mention that GATK's Funcotator (in BETA) allows for annotation using COSMIC. The tool annotates with cancer mutation frequency.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @sergey_ko13,

    Last I checked (and this was a while ago) MarkIlluminaAdapters adds an XT tag that gives the 5' start of the 3' adapter sequence. Checking the code history, there haven't been changes to the how the tool works in 4-5 years. Downstream tools can then process the XT tag, e.g. SamToFastq. You may find my comment in Discussion#6304, which covers different "clipping" approaches, helpful.

    As far as I know, we do not have a tool to remove reads with the XT tag entirely from a data file. For this, if I recall correctly the steps I took back in January-February of 2017, I believe I used basic UNIX/LINUX tools like grep -v.

  • sergey_ko13sergey_ko13 Member
    edited October 2018

    Dear @shlee,

    thank You for clarification! My biggest mistake was to expect MarkIlluminaAdapters to keep adapter seqences away from downstream tests including variant calling.
    So we were working with raw untrimmed reads and now it turns pretty unsafe.

    Following https://gatkforums.broadinstitute.org/gatk/discussion/6483 we left MergeBamAlignemnt CLIP_ADAPTERS=false.
    1) Isnt mutect2 later aware of XT tag or it considers mismatches - adapter vs reference - as a variant?

    As we are working with exome sequencing 8% of targeted genomic intervals are shorter than 150bp (S31285117_Covered.bed says, Agilent SureSelect).
    So we have to deal with DNA fragments that are shorter than the read length and we have to keep that reads (grepping out XT-ed reads will ignore >8% of WES targets).

    2) If we trim (trimmomatic) adapter sequences away from read's end and have a read pair completely covering, let's say, 100bp long DNA fragment - how does GATK4 deal with reverse read being complete complement of a front read end-to-end?
    Does MergeBamAlignemnt CLIP_OVERLAPPING_READS=true just clip both read mates in the middle? Is mutect2 later aware of such softclip?

    With Kind Regards,
    Sergey

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @sergey_ko13,

    I have an issue ticket I opened over a year ago asking for a feature in GATK to convert XT tagged base qualities to two at https://github.com/broadinstitute/gatk/issues/3540. If you have ideas on how best you'd like to manage XT tagged data, then I encourage you to voice your needs in the same issue ticket. As you can see from the description in the ticket, none of our variant/mutation callers handle the XT tag.

    As for your second question:

    2) If we trim (trimmomatic) adapter sequences away from read's end and have a read pair completely covering, let's say, 100bp long DNA fragment - how does GATK4 deal with reverse read being complete complement of a front read end-to-end?

    Picard/GATK has logic built in to reduce the evidence it uses to one read in overlapping pairs. Specifically, MergeBamAlignment (--CLIP_OVERLAPPING_READS default is true) has this logic (unmaps one of the reads in the pair if the overlap is complete), and, if I recall correctly, HaplotypeCaller (and therefore MuTect2/Mutect2, which uses the same assembly machinery) should have such logic (to only use one of the reads in the overlapping pair towards calling).

Sign In or Register to comment.