Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

ReduceReads format specifications

CarneiroCarneiro Posts: 271Administrator, GSA Member admin

What is a synthetic read?

When running reduce reads, the algorithm will find regions of low variation in the genome and compress them together. To represent this compressed region, we use a synthetic read that carries all the information necessary to downstream tools to perform likelihood calculations over the reduced data.

They are called Synthetic because they are not read by a sequencer, these reads are automatically generated by the GATK and can be extremely long. In a synthetic read, each base will represent the consensus base for that genomic location. Each base will have it's consensus quality score represented in the equivalent offset in the quality score string.

Consensus Bases

ReduceReads has several filtering parameters for consensus regions. Consensus is created based on base qualities, mapping qualities and other adjustable parameters from the command line. All filters are described in the technical documentation of reduce reads.

Consensus Quality Scores

The consensus quality score of a consensus base is essentially the mean of all bases that passed all the filters and represent an observation of that base. It is represented in the quality score field of the SAM format.

image

n is the number of bases that contributed to the consensus base and q_i is the corresponding quality score of each base.

Insertion quality scores and Deletion quality scores (generated by BQSR) will undergo the same process and will be represented the same way.

Mapping Quality

The mapping quality of a synthetic read is a value representative of the mapping qualities of all the reads that contributed to it. This is an average of the root mean square of the mapping quality of all reads that contributed to the bases of the synthetic read. It is represented in the mapping quality score field of the SAM format.

image

where n is the number of reads and x_i is the mapping quality of each read.

Original Alignments

A synthetic read may come with up to two extra tags representing its original alignment information. Due to many filters in ReduceReads, reads are hard-clipped to the are of interest. These hard-clips are always represented in the cigar string with the H element and the length of the clipping in genomic coordinates. Sometimes hard clipping will make it impossible to retrieve what was the original alignment start / end of a read. In those cases, the read will contain extra tags with integer values representing their original alignment start or end.

Here are the two integer tags:

  • OP -- original alignment start
  • OE -- original alignment end

For all other reads, where this can still be obtained through the cigar string (i.e. using getAlignmentStart() or getUnclippedStart()), these tags are not created.

The RR Tag

the RR tag is a tag that holds the observed depth (after filters) of every base that contributed to a reduce read. That means all bases that passed the mapping and base quality filters, and had the same observation as the one in the reduced read.

The RR tag carries an array of bytes and for increased compression, it works like this: the first number represents the depth of the first base in the reduced read. all subsequent numbers will represent the offset depth from the first base. Therefore, to calculate the depth of base "i" using the RR array, one must use :

RR[0] + RR[i]

but make sure i > 0. Here is the code we use to return the depth of the i'th base:

return (i==0) ? firstCount : (byte) Math.min(firstCount + offsetCount, Byte.MAX_VALUE);


Using Synthetic Reads with GATK tools

The GATK is 100% compatible with synthetic reads. You can use Reduced BAM files in combination with non-reduced BAM files in any GATK analysis tools and it will work seamlessly.

Programming in the GATK

If you are programming using the GATK framework, the GATKSAMRecord class carries all the necessary functionality to use synthetic reads transparently with methods like:

  • public final byte getReducedCount(final int i)
  • public int getOriginalAlignmentStart()
  • public int getOriginalAlignmentEnd()
  • public boolean isReducedRead()
meanrms.jpg
271 x 102 - 4K
meanqscore.png
87 x 72 - 2K

Comments

  • blueskypyblueskypy Posts: 182Member

    does GATK has a function to merge reduced reads of different lanes?

  • CarneiroCarneiro Posts: 271Administrator, GSA Member admin

    I'm gonna say this with enormous fear in my head : Just re-reduce the files together (assuming they have the same samples).

    I've never done this. Theoretically it's supposed to work, but no guarantees.

  • AdminTimAdminTim LondonPosts: 19Member
    edited November 2013

    Hi,

    Does ReduceReads lose haplotype information by creating genome region consensus sequences? If I want to use HaplotypeCaller and also read backed phasing will I get the same results with reduced BAM files as with non-reduced?

    Thanks

    Tim

    Edit: I noticed a post here http://gatkforums.broadinstitute.org/discussion/2635/haplotypecaller-and-reduced-reads about HC being ok with reduced BAMs so that's cool. What about the RBP? Thanks.

    Edit: Aha - I also found this post https://gatk.vanillaforums.com/discussion/2425/reducereads-multisample-or-single that says "RBP will perform the same if you reduce them individually or all together as the read information per sample is the same." So that's good too.

    Would be great if someone could quickly confirm these? Thanks.

    Post edited by AdminTim on
  • ebanksebanks Posts: 671GSA Member mod

    Hi Tim, That post about RR and RBP is from an older version of the GATK. Because of "het"compression (introduced since then) in RR, RBP will not work from reduced bams. I would recommend that you not reduce your bams if this is important to you.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • AdminTimAdminTim LondonPosts: 19Member

    Hi Eric, phew - thanks for that post - I was just about to start. I presume HaplotypeCaller on reduced followed by RBP on non-reduced is ok?

  • AdminTimAdminTim LondonPosts: 19Member
    edited November 2013

    Seems a shame that the het compression breaks the RBP. How come HapCaller is not affected since as I understand it this also tries to re-create local haplotypes?

    Does this option from RR prevent het compression, e.g. with an empty VCF file?

    -known / --known_sites_for_polyploid_reduction ( List[RodBinding[VariantContext]] with default value [] ) Input VCF file(s) with known SNPs. Any number of VCF files representing known SNPs to be used for the polyploid-based reduction. Could be e.g. dbSNP and/or official 1000 Genomes SNP calls. Non-SNP variants in these files will be ignored. If provided, the polyploid ("het") compression will work only when a single SNP from the known set is present in a consensus window (otherwise there will be no reduction); if not provided then polyploid compression will be triggered anywhere there is a single SNP present in a consensus window. -known binds reference ordered data. This argument supports ROD files of the following types: BCF2, VCF, VCF3

    ... where it says 'no reduction' is that none at all or just no het compression? Is there any more info on the het compression?

    Post edited by AdminTim on
  • ebanksebanks Posts: 671GSA Member mod

    Actually, using an empty VCF with the known option should work to disable het compression. However, you may want to run your pipeline on a single sample first just to confirm that everything is working as expected. Also, I should point out that we are currently working on an alternative to using Reduce Reads with the Haplotype Caller. RR + HC is definitely not the future...

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • AdminTimAdminTim LondonPosts: 19Member

    Ok thanks for the quick help, much appreciated. I think I'll just take the time hit and not use RR for either.

Sign In or Register to comment.