Version highlights for GATK version 2.3

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin
edited January 2013 in Announcements

Overview

Release version 2.3 is the last before the winter holidays, so we've done our best not to put in anything that will break easily. Which is not to say there's nothing important - this release contains a truckload of feature tweaks and bug fixes (see the release notes in the next tab for full list). And we do have one major new feature for you: a brand-spanking-new downsampler to replace the old one.

Feature improvement highlights

- Sanity check for mis-encoded quality scores

It has recently come to our attention that some datasets are not encoded in the standard format (Q0 == ASCII 33 according to the SAM specification, whereas Illumina encoding starts at ASCII 64). This is a problem because the GATK assumes that it can use the quality scores as they are. If they are in fact encoded using a different scale, our tools will make an incorrect estimation of the quality of your data, and your analysis results will be off. To prevent this from happening, we've added a sanity check of the quality score encodings that will abort the program run if they are not standard. If this happens to you, you'll need to run again with the flag --fix_misencoded_quality_scores (-fixMisencodedQuals). What will happen is that the engine will simply subtract 31 from every quality score as it is read in, and proceed with the corrected values. Output files will include the correct scores where applicable.

- Overall GATK performance improvement

Good news on the performance front: we eliminated a bottleneck in the GATK engine that increased the runtime of many tools by as much as 10x, depending on the exact details of the data being fed into the GATK. The problem was caused by the internal timing code invoking expensive system timing resources far too often. Imagine you looked at your watch every two seconds -- it would take you ages to get anything done, right? Anyway, if you see your tools running unusually quickly, don't panic! This may be the reason, and it's a good thing.

- Co-reducing BAMs with ReduceReads (Full version only)

You can now co-reduce separate BAM files by passing them in with multiple -I or as an input list. The motivation for this is that samples that you plan to analyze together (e. g. tumor-normal pairs or related cohorts) should be reduced together, so that if a disagreement is triggered at a locus for one sample, that locus will remain unreduced in all samples. You will therefore conserve the full depth of information for later analysis of that locus.

Downsampling, overhauled

The downsampler is the component of the GATK engine that handles downsampling, i. e. the process of removing a subset of reads from a pileup. The goal of this process is to speed up execution of the desired analysis, particularly in genome regions that are covered by excessive read depth.

In this release, we have replaced the old downsampler with a brand new one that extends some options and performs much better overall.

- Downsampling to coverage for read walkers

The GATK offers two different options for downsampling:

  • --downsample_to_coverage (-dcov) enables you to set the maximum amount of coverage to keep at any position
  • --downsample_to_fraction (-dfrac) enables you to remove a proportional amount of the reads at any position (e. g. take out half of all the reads)

Until now, it was not possible to use the --downsample_to_coverage (-dcov) option with read walkers; you were limited to using --downsample_to_fraction (-dfrac). In the new release, you will be able to downsample to coverage for read walkers.

However, please note that the process is a little different. The normal way of downsampling to coverage (e. g. for locus walkers) involves downsampling over the entire pileup of reads in one take. Due to technical reasons, it is still not possible to do that exact process for read walkers; instead the read-walker-compatible way of doing it involves downsampling within subsets of reads that are all aligned at the same starting position. This different mode of operation means you shouldn't use the same range of values; where you would use -dcov 100 for a locus walker, you may need to use -dcov 10 for a read walker. And these are general estimates - your mileage may vary depending on your dataset, so we recommend testing before applying on a large scale.

- No more downsampling bias!

One important property of the downsampling process is that it should be as random as possible to avoid introducing biases into the selection of reads that will be kept for analysis. Unfortunately our old downsampler - specifically, the part of the downsampler that performed the downsampling to coverage - suffered from some biases. The most egregious problem was that as it walked through the data, it tended to privilege more recently encountered reads and displaced "older" reads. The new downsampler no longer suffers from these biases.

- More systematic testing

The old downsampler was embedded in the engine code in a way that made it hard to test in a systematic way. So when we implemented the new downsampler, we reorganized the code to make it a standalone engine component - the equivalent of promoting it from the cubicle farm to its own corner office. This has allowed us to cover it much better with systematic tests, so we have better assessment of whether it's working properly.

- Option to revert to the old downsampler

The new downsampler is enabled by default and we are confident that it works much better than the old one. BUT as with all brand-spanking-new features, early adopters may run into unexpected rough patches. So we're providing a way to disable it and use the old one, which is still in the box for now: just add -use_legacy_downsampler to your command line. Obviously if you use this AND -dcov with a read walker, you'll get an error, since the old downsampler can't downsample to coverage for read walkers.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • ymwymw Posts: 9Member

    When I use BaseRecalibrator in the version 2.3, I got a error message as bolow: ERROR MESSAGE: SAM/BAM file SAMFileReader{bamboo120Grealnrmdupl.bam} appears to be using the wrong encoding for quality scores: we encountered an extremely high quality score of 65; please see the GATK --help documentation for options related to this error But it is ok if I use the version 2.1.

    bamboo120Grealnrmdupl.bam is a output file after I used WBA for mapping, "RealignerTargetCreator" in GATK v6.1 to do local realigment, and picard to remove duplicated reads.

    I will apprecaite any suggestion to solve this problem. Thanks,

    Chih-Ming

  • ymwymw Posts: 9Member

    I just found out the solution after reading the article above. Sorry I should read through the whole article before I post the qeuetion. For whoever encounter the probelm, adding the flag --fix_misencoded_quality_scores -fixMisencodedQuals can solve this problem. Chih-MIng

  • DethecorDethecor Posts: 1Member

    Hi all,

    I was wondering if you could integrate a feature to check the quality encoding and apply -fixMisencodedQuals accordingly. I have many samples with different quality encodings and process them in an automated pipeline, so I would like to not have to set this flag manually for all those that produce errors.

    Cheers, Paul

  • ebanksebanks Posts: 684GATK Developer mod

    Sorry, no @Dethecor. We don't want to change encodings unless the user really wants us to, so we'd prefer not to do it by default. If you have many bams with problematic encodings then it suggests that you should really be fixing them up front in your pipeline and not relying on downstream fixes.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • flescaiflescai Posts: 53Member ✭✭

    I am running in a similar problem, and I understand you cannot handle these things.

    ##### ERROR MESSAGE: Bad input: while fixing mis-encoded base qualities we encountered a read that was correctly encoded; we cannot handle such a mixture of reads so unfortunately the BAM must be fixed with some other tool

    I would highly appreciate though if you could suggest us which tool could be used to fix the ASCII encoding. Can it be done at the .bam level, or should it be corrected at the .fastq level?

    thanks very much for any suggestions!

    Francesco

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    @flescai, it would probably be best to fix the encodings in the original fastq files. I don't know of a tool to do this but there may be some existing scripts out there to do it. Perhaps someone from our user community will pipe up with a suggestion; otherwise try asking on SeqAnswers of BioStars, since those communities are larger.

    Geraldine Van der Auwera, PhD

  • KurtKurt Posts: 166Member ✭✭✭

    I didn't write this (I think I might have gotten this through SeqAnswers a couple of years ago); but this is some perl code that will convert Illumina 1.3+ (ASCII-64) to Sanger fastq quality score encoding for fastq files. If I can find where I ended up getting this from, I'll post the link; As Geraldine said, SeqAnswers or Biostars would be the place to go if the below doesn't work for you.

    while (<>) {
        $head1 = $_;
        $seq = <>;
        $head2 = <>;
        $qual = <>;
        print $head1.$seq.$head2;
        chomp $qual;
        @Q = split(//,$qual);
        for ($i=0; $i<=$#Q; $i++) {
            print chr( ord($Q[$i]) - 31 );
        }
        print "\n";
    }
Sign In or Register to comment.