The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.
Register now for the upcoming GATK Best Practices workshop, Feb 20-22 in Leuven, Belgium. Open to all comers! More info and signup at http://bit.ly/2i4mGxz

(How to) Generate an unmapped BAM from FASTQ or aligned BAM

shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin
edited July 2016 in Tutorials


image Here we outline how to generate an unmapped BAM (uBAM) from either a FASTQ or aligned BAM file. We use Picard's FastqToSam to convert a FASTQ (Option A) or Picard's RevertSam to convert an aligned BAM (Option B).

Jump to a section on this page

(A) Convert FASTQ to uBAM and add read group information using FastqToSam
(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

Tools involved

Prerequisites

  • Installed Picard tools

Download example data

Tutorial data reads were originally aligned to the advanced tutorial bundle's human_g1k_v37_decoy.fasta reference and to 10:91,000,000-92,000,000.

Related resources

  • Our dictionary entry on read groups discusses the importance of assigning appropriate read group fields that differentiate samples and factors that contribute to batch effects, e.g. flow cell lane. Be sure your read group fields conform to the recommendations.
  • This post provides an example command for AddOrReplaceReadGroups.
  • This How to is part of a larger workflow and tutorial on (How to) Efficiently map and clean up short read sequence data.
  • To extract reads in a genomic interval from the aligned BAM, use GATK's PrintReads.
  • In the future we will post on how to generate a uBAM from BCL data using IlluminaBasecallsToSAM.

(A) Convert FASTQ to uBAM and add read group information using FastqToSam

Picard's FastqToSam transforms a FASTQ file to an unmapped BAM, requires two read group fields and makes optional specification of other read group fields. In the command below we note which fields are required for GATK Best Practices Workflows. All other read group fields are optional.

java -Xmx8G -jar picard.jar FastqToSam \
    FASTQ=6484_snippet_1.fastq \ #first read file of pair
    FASTQ2=6484_snippet_2.fastq \ #second read file of pair
    OUTPUT=6484_snippet_fastqtosam.bam \
    READ_GROUP_NAME=H0164.2 \ #required; changed from default of A
    SAMPLE_NAME=NA12878 \ #required
    LIBRARY_NAME=Solexa-272222 \ #required 
    PLATFORM_UNIT=H0164ALXX140820.2 \ 
    PLATFORM=illumina \ #recommended
    SEQUENCING_CENTER=BI \ 
    RUN_DATE=2014-08-20T00:00:00-0400

Some details on select parameters:

  • For paired reads, specify each FASTQ file with FASTQ and FASTQ2 for the first read file and the second read file, respectively. Records in each file must be queryname sorted as the tool assumes identical ordering for pairs. The tool automatically strips the /1 and /2 read name suffixes and adds SAM flag values to indicate reads are paired. Do not provide a single interleaved fastq file, as the tool will assume reads are unpaired and the SAM flag values will reflect single ended reads.
  • For single ended reads, specify the input file with FASTQ.
  • QUALITY_FORMAT is detected automatically if unspecified.
  • SORT_ORDER by default is queryname.
  • PLATFORM_UNIT is often in run_barcode.lane format. Include if sample is multiplexed.
  • RUN_DATE is in Iso8601 date format.

Paired reads will have SAM flag values that reflect pairing and the fact that the reads are unmapped as shown in the example read pair below.

Original first read

@H0164ALXX140820:2:1101:10003:49022/1
ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA
+
-FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ

Original second read

@H0164ALXX140820:2:1101:10003:49022/2
TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG
+
AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F######################

After FastqToSam

H0164ALXX140820:2:1101:10003:49022      77      *       0       0       *       *       0       0       ACTTTAGAAATTTACTTTTAAGGACTTTTGGTTATGCTGCAGATAAGAAATATTCTTTTTTTCTCCTATGTCAGTATCCCCCATTGAAATGACAATAACCTAATTATAAATAAGAATTAGGCTTTTTTTTGAACAGTTACTAGCCTATAGA -FFFFFJJJJFFAFFJFJJFJJJFJFJFJJJ<<FJJJJFJFJFJJJJ<JAJFJJFJJJJJFJJJAJJJJJJFFJFJFJJFJJFFJJJFJJJFJJFJJFJAJJJJAJFJJJJJFFJJ<<<JFJJAFJAAJJJFFFFFJJJAJJJF<AJFFFJ RG:Z:H0164.2
H0164ALXX140820:2:1101:10003:49022      141     *       0       0       *       *       0       0       TGAGGATCACTAGATGGGGGAGGGAGAGAAGAGATGTGGGCTGAAGAACCATCTGTTGGGTAATATGTTTACTGTCAGTGTGATGGAATAGCTGGGACCCCAAGCGTCAGTGTTACACAACTTACATCTGTTGATCGACTGTCTATGACAG AA<FFJJJAJFJFAFJJJJFAJJJJJ7FFJJ<F-FJFJJJFJJFJJFJJF<FJJA<JF-AFJFAJFJJJJJAAAFJJJJJFJJF-FF<7FJJJJJJ-JA<<J<F7-<FJFJJ7AJAF-AFFFJA--J-F###################### RG:Z:H0164.2

back to top


(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam

We use Picard's RevertSam to remove alignment information and generate an unmapped BAM (uBAM). For our tutorial file we have to call on some additional parameters that we explain below. This illustrates the need to cater the tool's parameters to each dataset. As such, it is a good idea to test the reversion process on a subset of reads before committing to reverting the entirety of a large BAM. Follow the directions in this How to to create a snippet of aligned reads corresponding to a genomic interval.

We use the following parameters.

java -Xmx8G -jar /path/picard.jar RevertSam \
    I=6484_snippet.bam \
    O=6484_snippet_revertsam.bam \
    SANITIZE=true \ 
    MAX_DISCARD_FRACTION=0.005 \ #informational; does not affect processing
    ATTRIBUTE_TO_CLEAR=XT \
    ATTRIBUTE_TO_CLEAR=XN \
    ATTRIBUTE_TO_CLEAR=AS \ #Picard release of 9/2015 clears AS by default
    ATTRIBUTE_TO_CLEAR=OC \
    ATTRIBUTE_TO_CLEAR=OP \
    SORT_ORDER=queryname \ #default
    RESTORE_ORIGINAL_QUALITIES=true \ #default
    REMOVE_DUPLICATE_INFORMATION=true \ #default
    REMOVE_ALIGNMENT_INFORMATION=true #default

To process large files, also designate a temporary directory.

    TMP_DIR=/path/shlee #sets environmental variable for temporary directory

We invoke or change multiple RevertSam parameters to generate an unmapped BAM

  • We remove nonstandard alignment tags with the ATTRIBUTE_TO_CLEAR option. Standard tags cleared by default are NM, UQ, PG, MD, MQ, SA, MC, and AS tags (AS for Picard releases starting 9/2015). Additionally, the OQ tag is removed by the default RESTORE_ORIGINAL_QUALITIES parameter. Remove all other nonstandard tags by specifying each with the ATTRIBUTE_TO_CLEAR option. For example, we clear the XT tag using this option for our tutorial file so that it is free for use by other tools, e.g. MarkIlluminaAdapters. To list all tags within a BAM, use the command below.

    samtools view input.bam | cut -f 12- | tr '\t' '\n' | cut -d ':' -f 1 | awk '{ if(!x[$1]++) { print }}' 
    

    For the tutorial file, this gives RG, OC, XN, OP and XT tags as well as those removed by default. We remove all of these except the RG tag. See your aligner's documentation and the Sequence Alignment/Map Format Specification for descriptions of tags.

  • Additionally, we invoke the SANITIZE option to remove reads that cause problems for certain tools, e.g. MarkIlluminaAdapters. Downstream tools will have problems with paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Any paired reads file subset for a genomic interval requires sanitizing to remove reads with lost mates that align outside of the interval.

  • In this command, we've set MAX_DISCARD_FRACTION to a more strict threshold of 0.005 instead of the default 0.01. Whether or not this fraction is reached, the tool informs you of the number and fraction of reads it discards. This parameter asks the tool to additionally inform you of the discarded fraction via an exception as it finishes processing.

    Exception in thread "main" picard.PicardException: Discarded 0.787% which is above MAX_DISCARD_FRACTION of 0.500%  
    

Some comments on options kept at default:

  • SORT_ORDER=queryname
    For paired read files, because each read in a pair has the same query name, sorting results in interleaved reads. This means that reads in a pair are listed consecutively within the same file. We make sure to alter the previous sort order. Coordinate sorted reads result in the aligner incorrectly estimating insert size from blocks of paired reads as they are not randomly distributed.

  • RESTORE_ORIGINAL_QUALITIES=true
    Restoring original base qualities to the QUAL field requires OQ tags listing original qualities. The OQ tag uses the same encoding as the QUAL field, e.g. ASCII Phred-scaled base quality+33 for tutorial data. After restoring the QUAL field, RevertSam removes the tag.

  • REMOVE_ALIGNMENT_INFORMATION=true will remove program group records and alignment flag and tag information. For example, flags reset to unmapped values, e.g. 77 and 141 for paired reads. The parameter also invokes the default ATTRIBUTE_TO_CLEAR parameter which removes standard alignment tags. RevertSam ignores ATTRIBUTE_TO_CLEAR when REMOVE_ALIGNMENT_INFORMATION=false.

Below we show below a read pair before and after RevertSam from the tutorial data. Notice the first listed read in the pair becomes reverse-complemented after RevertSam. This restores how reads are represented when they come off the sequencer--5' to 3' of the read being sequenced.

For 6484_snippet.bam, SANITIZE removes 2,202 out of 279,796 (0.787%) reads, leaving us with 277,594 reads.

Original BAM

H0164ALXX140820:2:1101:10003:23460  83  10  91515318    60  151M    =   91515130    -339    CCCATCCCCTTCCCCTTCCCTTTCCCTTTCCCTTTTCTTTCCTCTTTTAAAGAGACAAGGTCTTGTTCTGTCACCCAGGCTGGAATGCAGTGGTGCAGTCATGGCTCACTGCCGCCTCAGACTTCAGGGCAAAAGCAATCTTTCCAGCTCA :<<=>@AAB@AA@AA>6@@A:>,*@A@<@??@8?9>@==8?:?@?;?:><??@>==9?>8>@:?>>=>;<==>>;>?=?>>=<==>>=>9<=>??>?>;8>?><?<=:>>>;4>=>7=6>=>>=><;=;>===?=>=>>?9>>>>??==== MC:Z:60M91S MD:Z:151    PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:0  MQ:i:0  OQ:Z:<FJFFJJJJFJJJJJF7JJJ<F--JJJFJJJJ<J<FJFF<JAJJJAJAJFFJJJFJAFJAJJAJJJJJFJJJJJFJJFJJJJFJFJJJJFFJJJJJJJFAJJJFJFJFJJJFFJJJ<J7JJJJFJ<AFAJJJJJFJJJJJAJFJJAFFFFA    UQ:i:0  AS:i:151

H0164ALXX140820:2:1101:10003:23460  163 10  91515130    0   60M91S  =   91515318    339 TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC :0;.=;8?7==?794<<;:>769=,<;0:=<0=:9===/,:-==29>;,5,98=599;<=########################################################################################### SA:Z:2,33141573,-,37S69M45S,0,1;    MC:Z:151M   MD:Z:48T4T6 PG:Z:MarkDuplicates RG:Z:H0164.2    NM:i:2  MQ:i:60 OQ:Z:<-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF###########################################################################################    UQ:i:49 AS:i:50

After RevertSam

H0164ALXX140820:2:1101:10003:23460  77  *   0   0   *   *   0   0   TGAGCTGGAAAGATTGCTTTTGCCCTGAAGTCTGAGGCGGCAGTGAGCCATGACTGCACCACTGCATTCCAGCCTGGGTGACAGAACAAGACCTTGTCTCTTTAAAAGAGGAAAGAAAAGGGAAAGGGAAAGGGAAGGGGAAGGGGATGGG AFFFFAJJFJAJJJJJFJJJJJAFA<JFJJJJ7J<JJJFFJJJFJFJFJJJAFJJJJJJJFFJJJJFJFJJJJFJJFJJJJJFJJJJJAJJAJFAJFJJJFFJAJAJJJAJ<FFJF<J<JJJJFJJJ--F<JJJ7FJJJJJFJJJJFFJF< RG:Z:H0164.2

H0164ALXX140820:2:1101:10003:23460  141 *   0   0   *   *   0   0   TCTTTCCTTCCTTCCTTCCTTGCTCCCTCCCTCCCTCCTTTCCTTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCTCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCTTCCCCTCTCCCACCCCTCTCTCCCCCCCTCCCACCC <-<-FA<F<FJF<A7AFAAJ<<AA-FF-AJF-FA<AFF--A-FA7AJA-7-A<F7<<AFF########################################################################################### RG:Z:H0164.2

back to top


Post edited by shlee on

Comments

  • aneekaneek Member Posts: 82
    edited April 2016

    Hi,

    How to perform FastqToSam function in picardtools when the fastq files are not interleaved i.e. forward (R1) and reverse (R2) fastq files separately. Picardtools is not taking two FASTQ inputs.

    Error message: "Option 'FASTQ' cannot be specified more than once."

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,443 admin

    @aneek
    Hi,

    If you look at the FastqToSam documentation, you will find your answer.

    -Sheila

  • aneekaneek Member Posts: 82
    edited April 2016

    @Sheila

    Thank you very much. I have to use FASTQ2 for the 2nd read.

    Another thing I would like to ask is, whether there is any need to use FastQC and Trimmomatic to correct the quality of the raw data and remove the bad reads, if I use these steps of making a clean BAM from short read sequence recommended by GATK.

    Also can I use threading option (-t) in all GATK commands to increase the speed of the tasks.

    Thank you.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,443 admin

    @aneek
    Hi,

    No, it is best to stick to the Best Practices. There should be no need to correct for anything before doing the Best Practices pre-processing steps.

    You can find which tools accept -nt in the tool documentation.

    -Sheila

  • aneekaneek Member Posts: 82

    @Sheila

    Hi, Thank you very much for all the informations.

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember Posts: 111
    edited July 2016

    @Sheila

    Hi,
    I cannot download the example data set "tutorial_6484_RevertSam.tar.gz"/"tutorial_6484_FastqToSam.tar.gz" from above links..
    In china, Because we don't have complete access to google, may be due to that reason i cannot download the data from google drive.. could you please make it available somewhere else? Like as you do for "bundle" as an "example" directory..

    Thanks

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin
    edited July 2016

    Hi @MUHAMMADSOHAILRAZA,

    I've uploaded the tutorial datasets to the ftp site. You can find the data for this and other tutorials at: ftp://gsapubftp-anonymous@ftp.broadinstitute.org/tutorials/datasets/. Like with the bundle, remember to keep the password field blank.

  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember Posts: 111
  • MUHAMMADSOHAILRAZAMUHAMMADSOHAILRAZA Beijing Institute of Genomics, CASMember Posts: 111

    @shlee

    Hi,
    I am curious Why RG tag is absent for the first read of above "After RevertSam" results section?
    and could you please explain what does 77 and 141 values represent in column 2?

    Thanks

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin

    @MUHAMMADSOHAILRAZA,

    You can check yourself why this may be with the command:

    samtools view 6484_snippet_revertsam.bam | grep 'H0164ALXX140820:2:1101:10003:23460'
    

    Then you'll see that both reads should have an RG tag. I've updated the example--thanks for pointing this out.

    For an explanation of SAM flags, I refer you to the following:

  • huangkhuangk NYMember Posts: 15
    edited September 2016

    Hi, I tried to convert my paired-end fastqs to SAM using FastqToSam, but got the error: "In paired mode, read name 1 (HWI-D00377:30:H8EJDADXX:1:2209:8491:93586) does not match read name 2 (HWI-D00377:30:H8EJDADXX:1:2104:20024:30303)". Is there anyway to fix it? Thanks!

    Issue · Github
    by Sheila

    Issue Number
    1302
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin

    This question has been answered here.

  • aborabor SwitzerlandMember Posts: 9

    Hi,
    I am using the script reported in paragraph "(B) Convert aligned BAM to uBAM and discard problematic records using RevertSam"
    As suggested, I set the SANITIZE option as true (indeed, the process has problems already at the revertsam-->revertmark step when setting it as false).
    However, for some older bam files, I loose as much as half of the reads. Meaning that my interleaved fastq file as well as my remapped bam is much smaller in size and script number. Is there a way to circumvent such a big loss?

    Thanks!

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin

    Hi @abor,

    When you say you lose up to half of the reads, do you mean alignment records or unique reads? I ask because an aligned BAM can represent the same read in multiple alignment records and the reversion to uBAM has the effect of paring these down to one unique read.

    Can you also describe further the problem you are running into? It would be helpful to us if you can post the tool error message.

  • aborabor SwitzerlandMember Posts: 9

    Hi, thanks. the err file reports:
    "INFO 2016-11-25 15:41:50 RevertSam Discarded 23263512 out of 61936806 (37.560%) reads in order to sanitize output."

    Consistently, the read count in my original bam file is 61,955,464, in my interleaved fastq is 38673294 and in my final recalibrated bam (after realignment/bqsr) is 38690072

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin
    edited November 2016

    Ok, @abor. So I think the question becomes why do you want to save the discarded reads? Consider what property of the discarded reads caused them to be discarded given SANITIZE discards paired reads with missing mates, duplicated records, and records with mismatches in length of bases and qualities. Then consider what impact these reads have on your analyses. E.g. do the remaining reads provide sufficient coverage or do you need all the coverage you can salvage? Do the discarded reads qualitatively differ, e.g. have lower base qualities, than the remaining reads?

    I think if they are qualitatively equivalent, then only the paired reads with missing mates would be worth salvaging. If this is the case, then we'd have to think about how to retain such reads to the exclusion of the other sanitized reads. So you would first RevertSam without sanitizing. Then, I believe you can filter records with mismatches in length of bases and qualities using GATK's MalformedReadFilter's optional parameter --filter_mismatching_base_and_quals. Because your BAM at this point is unaligned, and therefore not coordinate-sorted, you have to run PrintReads in --unsafe mode and discard the bai index that the tool outputs. As for duplicated records, a UNIX sort and uniq should remove these.

    For example, for your unaligned uBAM:

    samtools view ubam.bam | sort | uniq > ubam_uniquify.sam
    

    Will print the unique reads to a SAM format file. You'll have to reattach your header before converted again to BAM format. Then:

    java -jar GenomeAnalysisTK.jar -T PrintReads \
        -I uniquify_ubam.bam \
        -o ubam_filter_mismatching_base_and_quals.bam \
        -R your_reference.fasta \
        --filter_mismatching_base_and_quals \
        --unsafe
    

    Also, take a look at the ValidateSamFile document. I hope this is helpful.

    Post edited by shlee on
  • aborabor SwitzerlandMember Posts: 9
    edited January 17

    Hi @shlee, sorry for the long delay.
    I went through the steps you suggest: on my original BAM file I use RevertSam without SANITIZE, then use sort/uniq and then PrintReads with --filter_mismatching_base_and_quals --unsafe. The number of reads remains exactly the same I had in the original BAM; in addition, running ValidateSamFile MODE=SUMMARY gives me the following error: ERROR:MATE_NOT_FOUND, and the number of reads with this error is exactly the same number of reads that are discarded when I use RevertSam with SANITIZE=true. Turns out that the discarded reads are reads with missing mates (that were, however, regularly mapped in my original BAM), and that I would like to keep, given that they are as much as 15-45%of total reads, for some samples. However, the MarkIlluminaAdapters tool doesn't process the uBam file unless they have been discarded. Do you think there may be a solution for this step to work properly also in the presence of reads with missing pairs?
    Thanks

    Post edited by abor on
  • aborabor SwitzerlandMember Posts: 9
    edited January 17

    Plus, I wonder why this happens. When using Bam files in which the RealignerTargetCreator and BaseRecalibrator steps are performed through the whole genome, I can pass the Bam files again through the pipeline (RevertSam, MarkIlluminaAdapters, SamToFasq, BWA, MarkDuplicates and GATK RealignerTargetCreator/BaseRecalibrator) several times without loosing any reads; instead, I notice that when using the -L option on the bait regions at the RealignerTargetCreator and BaseRecalibrator steps, reads are discarded if the BAMs are subsequently re-passed through RevertSam.
    Of course I don't need to re-run RevertSam several times, but I could need to do it in the future. So the question is whether the -L option at RealignerTargetCreator and BaseRecalibrator steps introduces any tag in the read so that it is then discarded as "missing mate" when passed again through RevertSam. Hope to have been clear enough.
    Thanks again for your help

    Post edited by abor on
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,118 admin
    Removing reads that cause validation errors is what SANITIZE=TRUE is for. If you don't want to remove them you need to disable that option -- but then some other tools may refuse to run on that data. Our supported pipelines all discard unpaired reads as a quality assurance measure.

    When you use -L in the preprocessing steps, do you use it also with IndelRealigner and PrintReads?

    Geraldine Van der Auwera, PhD

  • aborabor SwitzerlandMember Posts: 9
    edited January 17

    Edited post: no, -L was used in RealignerTargetCreator, BaseRecalibrator and PrintReads, but not in IndelRealigner

  • aborabor SwitzerlandMember Posts: 9

    The question is why reads that were previously successfully mapped (they are mapped in the BAM file I use as input for RevertSam) are detected as unpaired reads and discarded

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin
    edited January 20

    Hi @abor,

    Since you mention that you are using older BAM files, I assume you don't have access to the original FASTQ files and the missing mates are forever gone. Is this true?

    It's my understanding that aligners such as BWA align each read independently, even ends of mated reads. That GATK tools have built in quality control features that are different from that of aligners is to a user's benefit. At the least, you become aware of the quality of the data you are working with and can come to a decision on how to proceed with the discrepant data.

    In answer to your question,

    Do you think there may be a solution for this step to work properly also in the presence of reads with missing pairs?

    my answer is yes, there is a solution. My understanding is that you wish to (i) process data through MarkIlluminaAdapters as well as (ii) keep the mate not found reads. Roughly, I think you can separate your paired reads and mate not found reads and process them as PE and SE reads separately, respectively. For example, because processing through MarkIlluminaAdapters only makes sense for PE reads (see Tutorial#6483 for why), you would process only the PE BAM through this step and skip this step for the SE BAM. Sorry, it appears MarkIlluminaAdapters has a parameter to process SE reads (MIN_MATCH_BASES_SE).

    To merge the reads together for joint processing for downstream steps, well, here's where you'd have to do some exploration of tools. I think you would use either MergeSamFiles or MergeBamAlignment. Tutorial#6483 gives details on MergeBamAlignment. It will take as input multiple ALIGNED_BAM files, so you can provide it both the aligned PE BAM and aligned SE BAM.

    I hope this is helpful. Let us know what ends up working for you and if you have additional questions.

    Post edited by shlee on
  • dnousomednousome Member Posts: 3

    Hi all,

    I current have some RNA-Seq Tophat aligned files (no original fastq), both the Aligned bam and unmapped reads files and wanted to generate a fastq to align with another aligner (STAR or Salmon).

    Do you think I should try to merge the Aligned and unmapped files and then perform the conversion?

    I have a feeling that might not be correct because it would insert potential reads with no mate. Otherwise I'm a little stuck as to what to do with the unmapped file.

    Thanks for any help!
    Darryl

    Issue · Github
    by Sheila

    Issue Number
    1648
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 11,118 admin
    @abor When you use -L with PrintReads, you're telling GATK to throw away any reads outside of those intervals. Then any reads inside the intervals that had mates outside become mate-not-founds.

    Geraldine Van der Auwera, PhD

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin

    Hi Darryl ( @dnousome ),

    Whichever file contains the complete set of read pairs, you can use this particular file. STAR takes FASTQ files and I assume Salmon also takes FASTQ files. If the file you start with is the aligned BAM, you can revert using RevertSam, convert to FASTQ and then align with your aligner of choice. If it is the uBAM, then you can proceed with converting to FASTQ and then align. If you wish to pipe conversion to FASTQ and the alignment step, which allows you to avoid storing a FASTQ file, then checkout the workflow detailed in Tutorial#6483 for BWA alignment. I believe the concepts should still apply to your case. The tutorial then explains how and why you would use MergeBamAlignment.

    I hope this is helpful.

  • dnousomednousome Member Posts: 3

    @shlee

    Most definitely! Thanks for your help.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin

    @abor, I've modified my last answer to you above.

  • shleeshlee CambridgeMember, Administrator, Broadie, Moderator, Dev Posts: 422 admin

    Update to RevertSam

    An issue described by this forum question has been fixed by a Picard repo code change documented here as a pull request and here as the related github issue. The effect of this code change is that when encountering mate-missing reads in a PE data file, RevertSam will now remove all mate information and thereby effectively turn mate-missing records into SE reads.

Sign In or Register to comment.