On Monday and Tuesday, November 12-13, the communications team will be out of the office for a U.S. federal holiday and a team event. We will be back in action on November 14th and apologize for any inconvenience this may cause. Thank you for using the forum.

BQSR for WES data generated by different exome-capture platforms

Hi everyone,

We have over thousand WES samples generated by two different exome-capture platforms. Samples were multiplexed and sequenced on Illumina HiSeq, with each lane containing 3-10 samples. Since for most samples, there are not enough data to run BQSR, we plan to estimate the model parameters on one whole lane and then apply it separately to each sample. Considering that two exome-capture platforms were used, we are thinking to specify two interval files simultaneously (namely, –L interval.kit1 –L interval.kit2) during RealignerTargetCreator, BaseRecalibrator and HaplotypeCaller.
However, we are confused if union or intersection of the two interval files should be used in our case. Although our sample size is large and we may get useful information from regions unique to each exome-capture platform (as discussed in http://gatkforums.broadinstitute.org/gatk/discussion/4945/joint-genotyping-different-caputre-kits#), would the use of union of interval files result in off target sequences and mess up the results of BaseRecalibrator? (https://software.broadinstitute.org/gatk/events/slides/1504/GATKwr7-X-2-WGS_vs_WEx.pdf)
Or in our case, is it better to perform analysis in two different batches (one for each exome-capture platform) to generate gVCF files; and then perform joint genotyping and VSQR in all samples together? For your information, eventually the data were to be jointly analyzed with data generated from another WGS analysis.

Your kind advice is very much appreciated. Thanks in advance!

Issue · Github
by Sheila

Issue Number
1157
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @genescha Hi there, apologies for the late response.

    We routinely multiplex more samples than that per lane ourselves, yet still perform BQSR per read group. Are you sure that you don't have enough data? Have you tried it and encountered an error? I think it should work. It would be better to do it that way (per read group) than to recalibrate multiple libraries together that were generated separately.

    Later on when you do the joint variant calling on the samples, I would advise that you run HaplotypeCaller per sample with the interval file corresponding to how that sample was produced, then run GenotypeGVCFs on all the resulting GVCFs together with the intersection of the interval files (which you can achieve using the -isr argument, see CommandLineGATK engine documentation). This will allow you to produce calls only at sites where you expect to have data in all samples, as the remaining sites will not be useable for comparative analysis across the cohort.

  • geneschagenescha JapanMember

    @Geraldine_VdAuwera Hi, thank you so much for kind advice!

    As for BQSR per read group, I got it now. So it is 100M "bases" per read group required for BQSR. Since we have average around 40M reads, each 101bp long, per sample; we do have enough data to perform BQSR per read group, am I right?

    For the second part, I am afraid that I am still not so clear about it. Is it 1) or 2)?

    1) Specify two interval files simultaneously (namely, –L interval.kit1 –L interval.kit2) during RealignerTargetCreator and BaseRecalibrator; but run HaplotypeCaller per sample with the interval file corresponding to how that sample was produced; and then run GenotypeGVCFs on all the resulting GVCFs together using the -isr argument

    2) Run RealignerTargetCreator, BaseRecalibrator and HaplotypeCaller per sample with the interval file corresponding to how that sample was produced; and then run GenotypeGVCFs on all the resulting GVCFs together using the -isr argument

    Thanks a lot for your time and advice!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes, you have enough data per read group for BQSR. For the second part, the second option is correct (2).

  • geneschagenescha JapanMember

    Thanks a lot!

Sign In or Register to comment.