Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

Why is converting from fastq to uBAM nesessary before preprocessing?

Hi Everyone,

I am brand new to this so please go easy on me. I have just taken over a project where we are going to be doing variant calling on a large number of human samples. I have inherited a number of scripts that are at least a few years old. I decided I wanted to follow the GATK best practices while noting the differences between the them and the scripts I have. I'm currently trying to push a single family (5 individuals) through the pipeline before applying it to all of the other samples I have.

So, first of all, all of my raw data is stored as paired-end reads in fastq format, I have no uBAM files available to me. According to the data pre-processing for variant discovery steps, the "reference implementations expect the read data to be input in unmapped BAM (uBAM) format. Conversion utilities are available to convert from FASTQ to uBAM." So the first thing I did was use FastqToSam to do the conversion yesterday. This is not an insignificant task, I ran each sample for my test family in parallel and it took roughly 5 hours.

I understand the benefit of using uBAM from the get-go (keeping some metadata that is discarded in fastq as described here: https://gatkforums.broadinstitute.org/gatk/discussion/5990/what-is-ubam-and-why-is-it-better-than-fastq-for-storing-unmapped-sequence-data), but I don't see the benefit of doing this conversion if the first step of the alignment is to convert this uBAM back to fastq before running bwa mem and samtools view. The next step would be to use MergeBamAlignments to merge the mapped and unmapped alignments which I guess I couldn't do if I did not do the original fastq->uBAM conversion.

Basically, my question is if the initial conversion from fastq to uBam is necessary or even recommended in this case. I don't see how it could have any added benefit and converting from and to fastq will incur a significant overhead. For what it's worth, the script I inherited simply ran 'bwa mem' on the paired-end reads and piped the output into 'samtools view -bh' to create the aligned BAM file. From here they would move on to the marking of duplicates. If I don't convert to uBAM and then skip the MergeBamAlignments, will that have an impact on me being able to apply the best practices down the line? I want to stick as close to the best practices as I possibly can, but If I can cut out some unnecessary computation time then that would be great.

Thanks!

Best Answers

Answers

  • Thanks for answering.

    So the fact that I'm converting from fastq to uBAM in the first place doesn't affect this at all? I would've thought that a lot of the important metadata would've already been lost since I don't have access to the BAM files straight from illumina. However, what you are saying sounds like it would still be worthwhile and would not be possible without first converting the fastq files to uBAM.

    Again, I am very new to this so I don't exactly have all of the background knowledge. Thanks again for answering.

  • SkyWarriorSkyWarrior TurkeyMember

    You can map your reads to your reference using bwa. you don't need to have anything from illumina other than your raw reads or raw basecalls. So generating unmapped bam and merging that to the first alignment from bwa is to restore all the information and add metadata on the alignment file. After that just continue with merged aligned file only.

  • Great, thanks a lot. I guess I was just confused as to what the benefit was going from fastq->uBAM->fastq at the start. Correct me if I'm wrong but the benefit comes with the merging of the mapped and unmapped BAMs after alignment which restores some of the data lost during alignment to our aligned BAM file. I think I'm just restating what you've said, I just want to be sure I understand.

    Thanks!

  • Thanks again! I just came across this page that I am finding very useful (in case anyone else ends up in this thread): https://gatkforums.broadinstitute.org/gatk/discussion/6483/how-to-map-and-clean-up-short-read-sequence-data-efficiently

  • Skywarrior, would it not be more straightforward to modify bwa-mem to preserve that information in the first place? FQ -> Bam -> FQ -> Merge is a significant overhead, many hours per genome, yet from this discussion it seems to just be a workaround for some undesired behaviour in bwa-mem. Correct me if I am wrong.

    I have looked at the bwa-mem c source code and I understand the preference for a java team to just create a new java tool, but this seems short sighted in the long run.

  • SkyWarriorSkyWarrior TurkeyMember

    BWA hardclips reads if there is a significant discordance between the best matching kmer and the read. These hardclips may end up costing you a particular structural variant or a true indel call. Merging unmapped bam and initial alignment restores the hardclips which I know of no solution for that in BWA parameters.

    My workflow goes as follows

    FastQ --> BWA - initial map
    FastQ --> Picard - uBAM
    uBAM + initial map --> Picard - Merge.

    This may seem like an overhead but a necessary one for me.

  • foxDie00foxDie00 Member

    @SkyWarrior said:
    My workflow goes as follows

    FastQ --> BWA - initial map
    FastQ --> Picard - uBAM
    uBAM + initial map --> Picard - Merge.

    I needed it spelled out for me like this. Thank you. The GATK preprocessing instructions are very unclear and made it sound like I needed to align the uBAM with BWA...

  • @foxDie00 said:

    made it sound like I needed to align the uBAM with BWA...

    That is how the broad best practice workflows operate.

    https://github.com/gatk-workflows/gatk4-data-processing/blob/64860758f4f01543e036cc05ef5c30530455fb06/processing-for-variant-discovery-gatk4.wdl#L309

    ubam input, converted on the fly to fastq and piped to bwa.

  • SkyWarriorSkyWarrior TurkeyMember

    Still fastq is needed for bwa. On the fly conversion of uBAM to fastq is necessary.
    Some people can get their data in the form of uBAMs out of their sequencing service but some of us still use the good old bcl2fastq. That's why.

Sign In or Register to comment.