NextSeq 500 Lanes and BaseRecalibrator

digitasedigitase CanberraMember

Hi,

I am working with RNA-seq data from the NextSeq500: 144 samples, with 24 samples multiplexed into an equi-molar pool for each of the 6 runs.

The NextSeq flowcell consists of 4 lanes that are supplied from from a single reservoir, so the same pool must be sequenced on all 4 lanes. On other platforms such as the HiSeq, the 8 lanes have to physically be loaded separately even if they are sequencing the same pool.

My understanding is that BaseRecalibrator should be run for each lane of data, which I can specify using the PU read groups tag. My sequencing provider de-multiplexed the raw FASTQ reads by sample, but not by lane, so I currently have 144 BAM files, one BAM file per sample, which contain mapped reads (using STAR) for that sample sourced from all 4 lanes of the run. In order to assign different PU read group tags to reads from different lanes using Picard, I would need to split either the BAM file or the raw FASTQ files by lane, then process the 4 files separately.

Should I be treating each of these 4 NextSeq lanes separately like HiSeq lanes, or can I run BaseRecalibrator on all 4 lanes together since the lanes are supplied from one reservoir?

Issue · Github
by Sheila

Issue Number
904
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @digitase,

    First, a caveat: we don't work with NextSeq, so my comments are entirely based on theoretical considerations. That being said -- ideally we'd still prefer to mark the lanes separately, in order to be able to capture lane-specific events or manufacturing defects. However that would be a lot of manipulation so frankly, I'm not sure it's worth it. It's up to you but if it was me I'd probably go ahead and recalibrate the data as is. The risk of doing this is that if one lane was significantly more biased in some way compared to the others, you may not be able to correct for that bias, and/or some of the data that is perfectly fine may be unfairly penalized. I don't think that's very likely but there is a chance. If that were the case, I think the before/after recalibration plots might look a bit odd, so be sure to check them. I'm happy to take a look if you want to post them here.

  • digitasedigitase CanberraMember

    Hi @Geraldine_VdAuwera ,

    Thanks for your reply!

    I went ahead with the lane-wise demultiplexing and read grouping just in case there is some lane-specific bias. I'm now working with a BAM file for each sample x lane.

    According to the guide article How should I pre-process data from multiplexed sequencing and multi-library designs?, the 2nd step re-aggregates the sample x lane files into sample files. The 3rd step then runs base recalibration on the sample file, which I presume is recalibrating at the sample x lane level, as delineated by the PU read group tags.

    is it worth feeding whole runs (all 24 samples, which comprise 4 unique PUs corresponding to the NextSeq lanes) into BaseRecalibrator so that it has more information to build its error model on? I think the GATK convention is to recalibrate at the sample x lane level, but I am not sure if we have enough data for each PU group from just one sample (I think the minimum is 100M mapped? base pairs, with 1B bp recommended. Our minimum is around 300M bp).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @digitase
    Hi,

    Yes, you can run BQSR on all the reads from one lane. As long as your PU field distinguishes the lanes, you are all set to input all the samples at once into BaseRecalibrator.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @digitase To elaborate on @Sheila's answer, you can do it in the sense that it's technically feasible, but frankly we can't guarantee it's worth it. We don't do it that way in our own production work -- but of course we use different instrumentation and I'm not sure how the sequence quantities per lane relate. I would say run it per sample on a subset of your samples, and see if the plots look reasonable. If they do, you should be fine. Generally if there's anything problematic the program will complain or the plots will look bad.

  • digitasedigitase CanberraMember

    @Sheila @Geraldine_VdAuwera

    Thanks for both of your inputs. I ran the recalibration on 12 test samples from two of my runs, and it appears that the PU groups were separated out correctly. In doing this, I supplied all 12 indel-realigned sample BAM files to BaseRecalibrator, using 4 threads (-nct option). When I come to do the same with all of my samples, does the memory usage and/or runtime scale linearly with the number of input samples?

    I'm not entirely sure on the interpretation of the AnalyzeCovariates plots. It seems the plotting format has changed since http://gatkforums.broadinstitute.org/gatk/discussion/44/base-quality-score-recalibration-bqsr was written. I am a bit concerned that only quality score that appears in the Base Insertion and Base Deletion plots is 45.

    Does this have something to do how MAPQ is assigned by the STAR aligner? The filter applied at the SplitNCigarReads stage was -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 . The plots pdf can be found at https://drive.google.com/open?id=0B4t9UUDOrtFnM1hVcXJ4UjRJRUk

    Issue · Github
    by Sheila

    Issue Number
    942
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
Sign In or Register to comment.