Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Interval list in CollectWgsMetrics has huge effect on output

Hello,

I'm getting VERY different results based on whether or not I include an interval list in CollectWgsMetrics.

I initially ran one of my analysis ready bams through CollectWgsMetrics with the following command:

java -Xms2000m -jar ${picard_jar} \
        CollectWgsMetrics \
        INPUT=${input_bam} \
        VALIDATION_STRINGENCY=SILENT \
        REFERENCE_SEQUENCE=${ref_fasta} \
        INCLUDE_BQ_HISTOGRAM=true \
        INTERVALS=${wgs_coverage_interval_list} \
        OUTPUT=${metrics_filename} \
        USE_FAST_ALGORITHM=true \
        READ_LENGTH=150

where ref_fasta and wgs_coverage_interval_list are files taken from the google cloud paths ‎gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta‎ and ‎gs://broad-references/hg38/v0/wgs_calling_regions.hg38.interval_list‎

I got pretty abysmal results from the above command, with a mean coverage of 0.01746. This is largely because a huge proportion of reads are being filtered out because of poor mapping quality (63%) and duplication (34%).

I then ran the same command, without the INTERVALS=${wgs_coverage_interval_list} line, and my results dramatically improved to a mean coverage of 17.97; the poor mapping quality and duplication rates shrank considerably.

Now, this seems to suggest that the majority of my reads map to crappy, between-interval locations with high quality, but I highly doubt this to be the case. Moreover, the extremely low coverage isn't unique to this particular sample, as it has happened to every other sample I've tested on.

Any reason why there might be such a high discrepancy between these two commands?

For reference, I attached the summary metrics and logs for both commands.

Thank you!

Best Answer

  • Accepted Answer

    I finally figured it out! The issue was not the interval list, but rather the "USE_FAST_ALGORITHM=true" option. All of my coverage issues were fixed by turning it off. For some reason, setting USE_FAST_ALGORITHM to true when no interval list was applied works without a problem.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @lrao
    Hi,

    Can you try running with just one entire contig as the interval? It seems like the tool performs better with whole contigs as intervals rather than small intervals.

    Thanks,
    Sheila

  • lraolrao Member
    edited July 16

    I'm not sure I understand what you mean by "one entire contig as the interval". Are you suggesting that I should make each chromosome one interval, like the wgs_coverage_regions.hg38.interval_list file from this gatk bundle? Or do you mean my interval file should just be one interval?

    And sorry to go off-topic, but which of those 4 interval lists from the above link would you recommend for Mutect2? I've been using wgs_calling_regions.hg38.interval_list, but now I'm a little less sure of my choice now that I'm seeing strange behavior on CollectWgsMetrics.

    Thanks!

  • lraolrao Member

    Woops, forgot to tag you back @Sheila.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @lrao
    Hi,

    Are you suggesting that I should make each chromosome one interval

    Yes, just include one interval with -L. For example, -L chr1.

    And sorry to go off-topic, but which of those 4 interval lists from the above link would you recommend for Mutect2?

    This thread should help.

    -Sheila

  • lraolrao Member
    Accepted Answer

    I finally figured it out! The issue was not the interval list, but rather the "USE_FAST_ALGORITHM=true" option. All of my coverage issues were fixed by turning it off. For some reason, setting USE_FAST_ALGORITHM to true when no interval list was applied works without a problem.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @lrao
    Hi,

    Thank you for posting you solution.

    -Sheila

  • MaguelonneMaguelonne ParisMember

    @Sheila

    -"Are you suggesting that I should make each chromosome one interval"
    -"Yes, just include one interval with -L. For example, -L chr1."

    I am interested in running CollectWgsMetrics chromosome by chromosome. You suggested to use the "-L" command but does the "-L" command work for picard? I saw, in picard documentation, that I could use the "--INTERVALS" command but the format for the interval.list (described here : https://software.broadinstitute.org/gatk/documentation/article?id=11009) requires to put start and end positions while I would just want to put "chr1", .... "chrX".

    Can you help me with that issue?

  • bhanuGandhambhanuGandham Member, Administrator, Broadie, Moderator admin

    Hi @Maguelonne

    First I apologize for the confusion.
    --INTERVALS in CollectWgsMetrics/Picard tools is different from the -L / --intervals in the GATK tools. They have different formats.
    -INTERVALS needs interval list file that contains the positions to restrict the assessment. -L does not work for CollectWgsMetrics.
    I hope this helps.

    Regards
    Bhanu

Sign In or Register to comment.