We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Why is converting from fastq to uBAM nesessary before preprocessing?

Hi Everyone,

I am brand new to this so please go easy on me. I have just taken over a project where we are going to be doing variant calling on a large number of human samples. I have inherited a number of scripts that are at least a few years old. I decided I wanted to follow the GATK best practices while noting the differences between the them and the scripts I have. I'm currently trying to push a single family (5 individuals) through the pipeline before applying it to all of the other samples I have.

So, first of all, all of my raw data is stored as paired-end reads in fastq format, I have no uBAM files available to me. According to the data pre-processing for variant discovery steps, the "reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM." So the first thing I did was use FastqToSam to do the conversion yesterday. This is not an insignificant task, I ran each sample for my test family in parallel and it took roughly 5 hours.

I understand the benefit of using uBAM from the get-go (keeping some metadata that is discarded in fastq as described here: https://gatkforums.broadinstitute.org/gatk/discussion/5990/what-is-ubam-and-why-is-it-better-than-fastq-for-storing-unmapped-sequence-data), but I don't see the benefit of doing this conversion if the first step of the alignment is to convert this uBAM back to fastq before running bwa mem and samtools view. The next step would be to use MergeBamAlignments to merge the mapped and unmapped alignments which I guess I couldn't do if I did not do the original fastq->uBAM conversion.

Basically, my question is if the initial conversion from fastq to uBam is necessary or even recommended in this case. I don't see how it could have any added benefit and converting from and to fastq will incur a significant overhead. For what it's worth, the script I inherited simply ran 'bwa mem' on the paired-end reads and piped the output into 'samtools view -bh' to create the aligned BAM file. From here they would move on to the marking of duplicates. If I don't convert to uBAM and then skip the MergeBamAlignments, will that have an impact on me being able to apply the best practices down the line? I want to stick as close to the best practices as I possibly can, but If I can cut out some unnecessary computation time then that would be great.


Best Answers


  • aschoenraschoenr Member

    Thanks for answering.

    So the fact that I'm converting from fastq to uBAM in the first place doesn't affect this at all? I would've thought that a lot of the important metadata would've already been lost since I don't have access to the BAM files straight from illumina. However, what you are saying sounds like it would still be worthwhile and would not be possible without first converting the fastq files to uBAM.

    Again, I am very new to this so I don't exactly have all of the background knowledge. Thanks again for answering.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You can map your reads to your reference using bwa. you don't need to have anything from illumina other than your raw reads or raw basecalls. So generating unmapped bam and merging that to the first alignment from bwa is to restore all the information and add metadata on the alignment file. After that just continue with merged aligned file only.

  • aschoenraschoenr Member

    Great, thanks a lot. I guess I was just confused as to what the benefit was going from fastq->uBAM->fastq at the start. Correct me if I'm wrong but the benefit comes with the merging of the mapped and unmapped BAMs after alignment which restores some of the data lost during alignment to our aligned BAM file. I think I'm just restating what you've said, I just want to be sure I understand.


  • aschoenraschoenr Member

    Thanks again! I just came across this page that I am finding very useful (in case anyone else ends up in this thread): https://gatkforums.broadinstitute.org/gatk/discussion/6483/how-to-map-and-clean-up-short-read-sequence-data-efficiently

  • Evan_BennEvan_Benn Member

    Skywarrior, would it not be more straightforward to modify bwa-mem to preserve that information in the first place? FQ -> Bam -> FQ -> Merge is a significant overhead, many hours per genome, yet from this discussion it seems to just be a workaround for some undesired behaviour in bwa-mem. Correct me if I am wrong.

    I have looked at the bwa-mem c source code and I understand the preference for a java team to just create a new java tool, but this seems short sighted in the long run.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    BWA hardclips reads if there is a significant discordance between the best matching kmer and the read. These hardclips may end up costing you a particular structural variant or a true indel call. Merging unmapped bam and initial alignment restores the hardclips which I know of no solution for that in BWA parameters.

    My workflow goes as follows

    FastQ --> BWA - initial map
    FastQ --> Picard - uBAM
    uBAM + initial map --> Picard - Merge.

    This may seem like an overhead but a necessary one for me.

  • foxDie00foxDie00 Member

    @SkyWarrior said:
    My workflow goes as follows

    FastQ --> BWA - initial map
    FastQ --> Picard - uBAM
    uBAM + initial map --> Picard - Merge.

    I needed it spelled out for me like this. Thank you. The GATK preprocessing instructions are very unclear and made it sound like I needed to align the uBAM with BWA...

  • Evan_BennEvan_Benn Member

    @foxDie00 said:

    made it sound like I needed to align the uBAM with BWA...

    That is how the broad best practice workflows operate.


    ubam input, converted on the fly to fastq and piped to bwa.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Still fastq is needed for bwa. On the fly conversion of uBAM to fastq is necessary.
    Some people can get their data in the form of uBAMs out of their sequencing service but some of us still use the good old bcl2fastq. That's why.

  • I am trying to follow the gatk4 wdl script offered by board institute github page to make things easier. yes, it takes a list of ubam files as input.
    I am trying to transfer NA12891 and NA12892 fastq file to two lists of ubam files which like the list of NA12878 ubam files offered by board.
    I was wondering how could we splite the ubam file into a list of ubams like board did?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @Angry_Panda

    This is a question for the WDL team and they will be able to answer this question, as it is specific to the WDL script you are using. @SChaluvadi , who is our WDL expert will be able to help you out with this.


  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @Angry_Panda Can you please share the name and/or link of the wdl script you are referring to?

  • Hi @SChaluvadi

    I use this wdl script: https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl
    It work well with the NA12878 small ubam files.

    But I cannot find detailed info about the ubam files, such as its fastq, size, sequence depth, read length etc. the same situation for (https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/). Does Board have some documents about this?

    I tried to go through this pipeline with 3 samples, NA12878, 12891 and 12892. It seems like I can download fastq file (1 and 2) from 1000 Genomes Project data. and transfer it to ubam with java -Xmx8G -jar picard.jar FastqToSam \ but I dont know how to broke up it into ubam list via RG like board did to make it can suit for this wdl script.

    Also, does board offered some benchmarking dataset for germline-snps-indels pipeline ?


  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    You can derive the RG information from the read names and then continue with the FastqtoSam step. Here is a link that shows how to derive RG fields' information from read names as well as a forum post that has further information on how you can add/replace read group using AddOrReplaceReadGroups. This should help you add read groups to your uBAM such that they will be compatible with the WDL script that you are using.

    I am looking into if/how to find more detailed information about the uBAM files that you have pointed to and will get back to you! Can you also elaborate on what you are looking for in way of benchmarking data?

  • Angry_PandaAngry_Panda Member
    edited October 2018

    Hi @SChaluvadi
    Thx very much for your answer.
    I am trying to do it.

    I am wondering whether the list of ubams I listed above: (https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped/) is exactly the one which Board transferred from ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­ERR194/­ERR194147/­ERR194147_1.­fastq.­gz and ftp:/­/­ftp.­sra.­ebi.­ac.­uk/­vol1/­fastq/­ERR194/­ERR194147/­ERR194147_2.­fastq.­gz. via the same way as u said.

    Btw, I also use Firecloud, I used featured WORKSPACE: help-gatk/Germline-SNPs-Indels-GATK4-hg38 offered by board, original Workspace Owner: [email protected]

    This is also the place I found the NA12878/unmapped ubams. But I confused about the dataset for Joint Discovery: Sample Set: gvcf of NA12878. if I am not wrong:
    1 sample (n ubam files, for subset divided by RG) -> 1 analyze-ready BAM -> 1 gvcf
    m samples ->....-> m gvcf
    m gvcf -> go through joint-calling -> 1VCF and index.

    Looking at this workspcae dataset on firecloud, only 1 sample NA12878, how could it generate multiple gvcf for joint calling?

    For benchmarking, i am looking for NA12878, 12891, 12892 data to went through the whole Germline-SNPs-Indels pipeline. since the wdl perfer ubam files as inputs, I would like to have 3 list of ubams.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    I was able to find another resource that might help you with your analysis. This link shows how to separate your input BAM/SAM/CRAM by RG by using the -RG (--splitReadGroup). Once you have converted your sample fastq files to the uBAM with FastqtoSam, you can split it up by RG to generate the list of uBAMs!

    The data that you see in the bucket (https://console.cloud.google.com/storage/browser/broad-public-datasets/NA12878/unmapped) is generated by the Broad - the data in the bucket are read group uBAMs that belong to the test sample NA12878.

    For the joint discovery workflow, if you look at the method configuration, you will see that the input is a participant set entity type. This means that there are multiple participants which means multiple gvcfs that go through joint calling to get a single vcf.

    I don't believe that we have any benchmarking data. We use NA12878 in the featured workflows as you have seen already.

    I hope this information helps! Please let me know if you have other follow-up questions.

  • @SChaluvadi , Thanks for your answer and help! I also found the Board offered wdl script to convert fastq/BAM/CRAM to uBAMs.

    But I still have a vital problem.
    I download fastq1 and fastq2 file from: ENA, https://www.ebi.ac.uk/ena/data/view/SAMEA1573618.

    zless to check the fastq file, u will see many reads as follows:

    @ERR194147.11 HSQ1004:134:C0D8DACXX:4:1101:19001:189144/1
    @ERR194147.12 HSQ1004:134:C0D8DACXX:2:2206:12846:29283/1
    [email protected]DDDDD<BBD:A?:?CDEEDCD>@BCCDDCCDC&gt;CCCA
    @ERR194147.13 HSQ1004:134:C0D8DACXX:3:1104:20699:157340/1
    @ERR194147.14 HSQ1004:134:C0D8DACXX:3:1205:17329:12342/1


    If I am right, this fastq file is not sequenced by 1 lane but multiple lanes: C0D8DACXX:4, 3, 2, 1 ect.....
    When I use fastqtosam or the wdl script, I both need to offer:
    each fastq file from the same lane which owns the same readgroup name.
    lane1.fq, lane2.fq, lane3.fq, lane n.fq -----> RG1.ubam, RG2.ubam, RG3.ubam, RGn.ubam.

    Seems like I only have 2 merged fastq files which contain multipe lanes data for 1 sample. What shoud I do now?
    write script to split fastq file into different smaller fastq file according on different C0D8DACXX: n firstly, then run fastqtosam separately?

  • Hi @SChaluvadi , I read some other webpage which also from the webpage u mentioned before.

    It seems like I can ignore the flowcell:lane information which showed in my fastq files read header, just assign it as what I want, such as default A. for the whole big fastq file.

    Do you think this way is right or not? will it cause some problems or reduce its accuracy for the downstream dealing ?

  • @SChaluvadi
    thanks for your reply.

    There is one problem before I can use SpliteReads, it asking for BAM/SAM/CRAM.
    Now my input is fastq1 and fastq2, when i zless it, it contains multiple lanes information. When I try to use FastqToSam, it asking read groups name and corresponding fastq.
    what I did now is:
    merged fastq1 and fastq2 -> FastqToSam -> 1 ubam

    what I am thinking maybe work but not sure if it is right:
    write script to separate merged fastq file based on flowcell:lane infomation on each read header.
    use FastqToSam to each fastq file. -> n ubam files.

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    @Angry_Panda Sorry for the delay! Yes, in your case, where reads from a single sample run on multiple lanes are all in a single fastq file, you must split the fastq file by lane into multiple fastq files. After this step you can use FastqToSam to generate uBAMs (1 uBAM per lane for a single sample). At this point you can add the RGs and later combine your BAM/SAM files using MergeSamFiles.

    Here is a link to a one liner that might help you seprate your fastq file by lane. I think these steps should help you achieve your goal. Please let me know if this works for you!

  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin

    Hi there @Angry_Panda - since we haven't heard from you, this ticket will be closed but you are always able to send us another message if you have any other questions!

  • Hi @SChaluvadi , sorry for my late relpy.

    I was testing the method we talked above. Yes, It worked. Besides, awk, you also can use | grep, to separate the raw fastq files, then use FastqToSam to generate several ubam files based on its flowcell:lane info. I got 4 ubam files, then I continued with the data-processing step, using gatk4 hg38 json and wdl files, originally from: https://github.com/gatk-workflows/gatk4-data-processing.
    After two days, with 16 threads, it showed: workflow successfully completed.

    I found a little bit funny result: my input 4 ubams, each of them sized 33G, my aligned.duplicates-marked.sorted.bam sized around 120G. After, BQSR, the final output bam file sized around 70 G.

    I used exactly the same wdl file I offered above, in json file, just changed input as my own 4 bam files passing as a ubam list, and keep using the ref and known sites resource file offered in above json file. And certainly, I adjust all files path to its absolute path in my virtual machine.

    Do you think this is normal or I did something wrong? I use NA12878.

Sign In or Register to comment.