Coverage after Base Quality Recalibration: enough for variant calling?

sp580sp580 GermanyMember

Dear GATK community
I used Picardtools' CollectWgsMetrics to check the mean, median and sd of coverage for each of my samples (see attached figure).

For that I used base-quality-recalibrated BAMs and let them run through CollectWgsMetrics with default settings (incl. min mapping quality 20 and min base quality 20).

I was expecting to get coverage close to 30, as this is what we requested, however for many samples the coverage is much lower than that.

My general question is the same as in the title, but my specific questions are:

1/ am I using the right tool to check the coverage?
2/ did I run CollectWgsMetrics too stringently?
3/ should I evaluate coverage before base quality recalibration, and consider that as confirmation that the requested coverage was met?
4/ shoud we increase coverage for some of the samples by sequencing them again?

Any help and suggestions on how to go about this are greatly welcome.

Thanks in advace!

Best Answer

Answers

  • sp580sp580 GermanyMember

    Dear Sheila,
    sorry for the late response; I had to do more computations to answer your questions and it took a while.

    1) Did you compare the coverages before and after BQSR? Did BQSR have a major impact?

    I did now and apparently there is no major impact on coverage, as shown in the graph and the table below.

                                        mean_of_MEAN_COVERAGE median_of_MEAN_COVERAGE
    BAM_after_BQSR_Metrics_NoFilters                 23.89238                23.18982
    BAM_after_BQSR_Metrics_WithFilters               22.22259                21.60444
    BAM_before_BQSR_Metrics_NoFilters                23.89297                23.19029
    BAM_before_BQSR_Metrics_WithFilters              22.24182                21.62669
    

    2) What was the command you ran? Can you check how much of a difference there is if you lower the quality thresholds?

    java -jar path/to/picard.jar CollectWgsMetrics \
        I=path/to/sample.merged.sorted.dedup.bam \
        O=path/to/out_dir/sample.merged.sorted.dedup.CollectWgsMetrics.txt \
        R=path/to/Mus_musculus.GRCm38.dna.primary_assembly.fa \
        Q=val \ # where val equals 20 or 0 as min quality
        MQ=val # same as above
    

    If I lower the quality thresholds to zero, there 1-2x improvement in coverage vs min (base,map)qual 20.

    Fortunatelly, there is still some budget to add more coverage to the lower samples.

    Cheers!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @sp580 In our production pipeline we look at metrics at various stages, including before alignment, right after alignment and after pre-processing (before calling). In addition to mean coverage overall we also look at distribution of coverage, because sometimes the mean only looks fine because of some areas of deep accumulation, but there are some insufficiently covered areas elsewhere. A common metric for that is the 80/20 metric: given a list of target areas, do we have at least 20x coverage at at least 80 percent of targets? You can adapt this for target coverage of course; that's just the basic logic.

    FWIW if you have the budget, I would recommend erring on the side of getting more sequence, as it will typically make your life easier downstream when it comes to filtering and interpretation.

    I hope this is helpful; unfortunately we can't provide more guidance than that on this topic -- not that we don't want to, but we don't currently have the bandwidth to collate the relevant information. We may be able to do so in the future though, since there seems to be a lot of demand for this.

Sign In or Register to comment.