It looks like you're new here. If you want to get involved, click one of these buttons!
Geraldine_VdAuwera
Posts: 2,238Administrator, GSA Official Member admin
The primary goal of the GATK is to provide a suite of small data access patterns that can easily be parallelized and otherwise externally managed. As such, rather than asking walker authors how to iterate over a data stream, the GATK asks the user how data should be presented.
Walk over the data set one location (single-base locus) at a time, presenting all overlapping reads, reference bases, and reference-ordered data.
The @By attribute can be used to control whether locus walkers see all loci or just covered loci. To switch between viewing all loci and covered loci, apply one of the following attributes:
@By(DataSource.REFERENCE)
@By(DataSource.READS)
By default, the following filters are automatically added to every locus walker.
These walkers walk over the data set one location at a time, but only those locations covered by reference-ordered data. They are essentially a special case of locus walkers. ROD walkers are read-free traversals that include operate over Reference Ordered Data and the reference genome at sites where there is ROD information. They are geared for high-performance traversal of many RODs and the reference such as VariantEval and CallSetConcordance. Programmatically they are nearly identical to RefWalkers<M,T> traversals with the following few quirks.
RODWalkers are only called at sites where there is at least one non-interval ROD bound. For example, if you are exploring dbSNP and some GELI call set, the map function of a RODWalker will be invoked at all sites where there is a dbSNP record or a GELI record.
Because of this skipping RODWalkers receive a context object where the number of reference skipped bases between map calls is provided:
nSites += context.getSkippedBases() + 1; // the skipped bases plus the current location
In order to get the final count of skipped bases at the end of an interval (or chromosome) the map function is called one last time with null ReferenceContext and RefMetaDataTracker objects. The alignment context can be accessed to get the bases skipped between the last (and final) ROD and the end of the current interval.
ROD walkers inherit the same filters as locus walkers:
Changing to a RODWalker is very easy -- here's the new top of VariantEval, changing the system to a RodWalker from its old RefWalker state:
//public class VariantEvalWalker extends RefWalker<Integer, Integer> {
public class VariantEvalWalker extends RodWalker<Integer, Integer> {
The map function must now capture the number of skipped bases and protect itself from the final interval map calls:
public Integer map(RefMetaDataTracker tracker, ReferenceContext ref, AlignmentContext context) {
nMappedSites += context.getSkippedBases();
if ( ref == null ) { // we are seeing the last site
return 0;
}
nMappedSites++;
That's all there is to it!
A ROD walker can be very efficient compared to a RefWalker in the situation where you have sparse RODs. Here is a comparison of ROD vs. Ref walker implementation of VariantEval:
| RODWalker | RefWalker | |
|---|---|---|
| dbSNP and 1KG Pilot 2 SNP calls on chr1 | 164u (s) | 768u (s) |
| Just 1KG Pilot 2 SNP calls on chr1 | 54u (s) | 666u (s) |
Read walkers walk over the data set one read at a time, presenting all overlapping reference bases and reference-ordered data.
By default, the following filters are automatically added to every read walker.
Read pair walkers walk over a queryname-sorted BAM, presenting each mate and its pair. No reference bases or reference-ordered data are presented.
By default, the following filters are automatically added to every read pair walker.
Duplicate walkers walk over a read and all its marked duplicates. No reference bases or reference-ordered data are presented.
By default, the following filters are automatically added to every duplicate walker.
Geraldine Van der Auwera, PhD