To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Why does the recalibration table miss most of the machine cycles?

I am recalibrating the base qualities, using all the standard covariates, including machine cycle. The reads come from a 454 machine, and they have an average length of around 300 bases. Curiously, the recalibrator only issues an error if I set the --maximum_cycle_value to less than 27. And even if I set the --maximum_cycle_value to the true maximum length (1601), the RecalTable2 does not record cycles longer than 27. In DEBUG level, I get a few messages that seem to be related, but that I don't know how to interpret:

DEBUG 11:45:23,986 ReadCovariates - Keys cache miss for length 39 cache size 1 DEBUG 11:45:23,988 ReadCovariates - Keys cache miss for length 38 cache size 2 DEBUG 11:45:23,995 ReadCovariates - Keys cache miss for length 41 cache size 3 DEBUG 11:45:23,999 ReadCovariates - Keys cache miss for length 36 cache size 4 DEBUG 11:45:24,001 ReadCovariates - Keys cache miss for length 31 cache size 5 DEBUG 11:45:24,004 ReadCovariates - Keys cache miss for length 43 cache size 6 DEBUG 11:45:24,010 ReadCovariates - Keys cache miss for length 26 cache size 7 DEBUG 11:45:24,015 ReadCovariates - Keys cache miss for length 40 cache size 8 DEBUG 11:45:24,026 ReadCovariates - Keys cache miss for length 30 cache size 9 DEBUG 11:45:24,037 ReadCovariates - Keys cache miss for length 37 cache size 10 DEBUG 11:45:24,079 ReadCovariates - Keys cache miss for length 33 cache size 11 DEBUG 11:45:24,086 ReadCovariates - Keys cache miss for length 44 cache size 12 DEBUG 11:45:24,152 ReadCovariates - Keys cache miss for length 35 cache size 13 DEBUG 11:45:24,606 ReadCovariates - Keys cache miss for length 32 cache size 14 DEBUG 11:45:24,654 ReadCovariates - Keys cache miss for length 28 cache size 15 DEBUG 11:45:24,683 ReadCovariates - Keys cache miss for length 34 cache size 16 DEBUG 11:45:24,716 ReadCovariates - Keys cache miss for length 45 cache size 17 DEBUG 11:45:24,963 ReadCovariates - Keys cache miss for length 27 cache size 18 DEBUG 11:45:25,033 ReadCovariates - Keys cache miss for length 29 cache size 19 DEBUG 11:45:27,401 ReadCovariates - Keys cache miss for length 25 cache size 20 DEBUG 11:45:28,445 ReadCovariates - Keys cache miss for length 46 cache size 21 DEBUG 11:45:29,426 ReadCovariates - Keys cache miss for length 23 cache size 22 DEBUG 11:45:31,228 ReadCovariates - Keys cache miss for length 47 cache size 23

The fact is that after recalibration the low-qualitity tails get higher qualities, and I suspect that the recalibration does not work well for late cycles. I'm using GATK v3.2-2-gec30cee. This is what my command looks like:

java -Xmx4g -jar GenomeAnalysisTK.jar -T BaseRecalibrator -I realigned.bam -R ref.fa -o recal_data.table --maximum_cycle_value 1601 --run_without_dbsnp_potentially_ruining_quality -l DEBUG

One particularity of the bam file is that it's not huge. There are only about 500000 reads. Recalibration may not be accurate, but still, I wonder if it is using all the information available. There are hundreds of thousands of reads longer than 100 bases. Can anybody explain what's happening? Thanks.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Not sure on the machine cycle (will have to check a couple of things) but before we go any further, I notice you're using --run_without_dbsnp_potentially_ruining_quality which is extremely bad. Are you aware of that?

  • Yes, I'm aware of that. I'm testing an idea. The reads are 454 mate pairs. The linker sequence between the mates is known and it is more or less randomly distributed along the reads. So, I am trying to use the linker sequence to calibrate the base qualities. Virtually all mismatches between the reads and the linker should be sequencing errors.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @ignasiof‌

    Hi,

    What you are trying to do is not in our best practices, so we cannot offer proper advice.

    The linker sequences are errors because of what they are. They do not indicate sequencing errors, so they are useless for recalibration. And, recalibration without known sites is extremely bad because everything including real variation will be counted as an error.

    Please try to run this according to our best practices to check if the error still occurs. If it does, then we can go from there.

    Thanks,
    Sheila

Sign In or Register to comment.