Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

Managing user inputs

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,909Administrator, GATK Developer admin
edited October 2012 in Developer Zone

1. Naming walkers

Users identify which GATK walker to run by specifying a walker name via the --analysis_type command-line argument. By default, the GATK will derive the walker name from a walker by taking the name of the walker class and removing packaging information from the start of the name, and removing the trailing text Walker from the end of the name, if it exists. For example, the GATK would, by default, assign the name PrintReads to the walker class org.broadinstitute.sting.gatk.walkers.PrintReadsWalker. To override the default walker name, annotate your walker class with @WalkerName("<my name>").

2. Requiring / allowing primary inputs

Walkers can flag exactly which primary data sources are allowed and required for a given walker. Reads, the reference, and reference-ordered data are currently considered primary data sources. Different traversal types have different default requirements for reads and reference, but currently no traversal types require reference-ordered data by default. You can add requirements to your walker with the @Requires / @Allows annotations as follows:

@Requires(DataSource.READS)
@Requires({DataSource.READS,DataSource.REFERENCE})
@Requires(value={DataSource.READS,DataSource.REFERENCE})
@Requires(value=DataSource.REFERENCE})

By default, all parameters are allowed unless you lock them down with the @Allows attribute. The command:

@Allows(value={DataSource.READS,DataSource.REFERENCE})

will only allow the reads and the reference. Any other primary data sources will cause the system to exit with an error.

Note that as of August 2011, the GATK no longer supports RMD the @Requires and @Allows syntax, as these have moved to the standard @Argument system.

3. Command-line argument tagging

Any command-line argument can be tagged with a comma-separated list of freeform tags.

The syntax for tags is as follows:

-<argument>:<tag1>,<tag2>,<tag3> <argument value>

for example:

-I:tumor <my tumor data>.bam
-eval,VCF yri.trio.chr1.vcf

There is currently no mechanism in the GATK to validate either the number of tags supplied or the content of those tags.

Tags can be accessed from within a walker by calling getToolkit().getTags(argumentValue), where argumentValue is the parsed contents of the command-line argument to inspect.

Applications

The GATK currently has comprehensive support for tags on two built-in argument types:

  • -I,--input_file <input_file>

    Input BAM files and BAM file lists can be tagged with any type. When a BAM file list is tagged, the tag is applied to each listed BAM file.

From within a walker, use the following code to access the supplied tag or tags:

getToolkit().getReaderIDForRead(read).getTags();
  • Input RODs, e.g. `-V ' or '-eval '

    Tags are used to specify ROD name and ROD type. There is currently no support for adding additional tags. See the ROD system documentation for more details.

4. Adding additional command-line arguments

Users can create command-line arguments for walkers by creating public member variables annotated with @Argument in the walker. The @Argument annotation takes a number of differentparameters:

  • fullName

    The full name of this argument. Defaults to the toLowerCase()’d member name. When specifying fullName on the command line, prefix with a double dash (--).

  • shortName

    The alternate, short name for this argument. Defaults to the first letter of the member name. When specifying shortName on the command line, prefix with a single dash (-).

  • doc

    Documentation for this argument. Will appear in help output when a user either requests help with the –-help (-h) argument or when a user specifies an invalid set of arguments. Documentation is the only argument that is always required.

  • required

    Whether the argument is required when used with this walker. Default is required = true.

  • exclusiveOf

    Specifies that this argument is mutually exclusive of another argument in the same walker. Defaults to not mutually exclusive of any other arguments.

  • validation

    Specifies a regular expression used to validate the contents of the command-line argument. If the text provided by the user does not match this regex, the GATK will abort with an error.

By default, all command-line arguments will appear in the help system. To prevent new and debugging arguments from appearing in the help system, you can add the @Hidden tag below the @Argument annotation, hiding it from the help system but allowing users to supply it on the command-line. Please use this functionality sparingly to avoid walkers with hidden command-line options that are required for production use.

Passing Command-Line Arguments

Arguments can be passed to the walker using either the full name or the short name. If passing arguments using the full name, the syntax is −−<arg full name> <value>.

--myint 6

If passing arguments using the short name, the syntax is -<arg short name> <value>. Note that there is a space between the short name and the value:

-m 6

Boolean (class) and boolean (primitive) arguments are a special in that they require no argument. The presence of a boolean indicates true, and its absence indicates false. The following example sets a flag to true.

-B

Supplemental command-line argument annotations

Two additional annotations can influence the behavior of command-line arguments.

  • @Hidden

    Adding this annotation to an @Argument tells the help system to avoid displaying any evidence that this argument exists. This can be used to add additional debugging arguments that aren't suitable for mass consumption.

  • @Deprecated

    Forces the GATK to throw an exception if this argument is supplied on the command-line. This can be used to supply extra documentation to the user as command-line parameters change for walkers that are in flux.

Examples

Create an required int parameter with full name –myint, short name -m. Pass this argument by adding –myint 6 or -m 6 to the command line.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(doc="my integer")
    public int myInt;

Create an optional float parameter with full name –myFloatingPointArgument, short name -m. Pass this argument by adding –myFloatingPointArgument 2.71 or -m 2.71.

import org.broadinstitute.sting.utils.cmdLine.Argument;
public class HelloWalker extends ReadWalker<Integer,Long> {
    @Argument(fullName="myFloatingPointArgument",doc="a floating point argument",required=false)
    public float myFloat;

The GATK will parse the argument differently depending on the type of the public member variable’s type. Many different argument types are supported, including primitives and their wrappers, arrays, typed and untyped collections, and any type with a String constructor. When the GATK cannot completely infer the type (such as in the case of untyped collections), it will assume that the argument is a String. GATK is aware of concrete implementations of some interfaces and abstract classes. If the argument’s member variable is of type List or Set, the GATK will fill the member variable with a concrete ArrayList or TreeSet, respectively. Maps are not currently supported.

5. Additional argument types: @Input, @Output

Besides @Argument, the GATK provides two additional types for command-line arguments: @Input and @Output. These two inputs are very similar to @Argument but act as flags to indicate dataflow to Queue, our pipeline management software.

  • The @Input tag indicates that the contents of the tagged field represents a file that will be read by the walker.

  • The @Output tag indicates that the contents of the tagged field represents a file that will be written by the walker, for consumption by downstream walkers.

We're still determining the best way to model walker dependencies in our pipeline. As we determine best practices, we'll post them here.

6. Getting access to Reference Ordered Data (RMD) with @Input and RodBinding

As of August 2011, the GATK now provides a clean mechanism for creating walker @Input arguments and using these arguments to access Reference Meta Data provided by the RefMetaDataTracker in the map() call. This mechanism is preferred to the old implicit string-based mechanism, which has been retired.

At a very high level, the new RodBindings provide a handle for a walker to obtain the Feature records from Tribble from a map() call, specific to a command line binding provided by the user. This can be as simple as a single ROD file argument|one-to-one binding between a command line argument and a track, or as complex as an argument argument accepting multiple command line arguments, each with a specific name. The RodBindings are generic and type specific, so you can require users to provide files that emit VariantContexts, BedTables, etc, or simply the root type Feature from Tribble. Critically, the RodBindings interact nicely with the GATKDocs system, so you can provide summary and detailed documentation for each RodBinding accepted by your walker.

A single ROD file argument

Suppose you have a walker that uses a single track of VariantContexts, such as SelectVariants, in its calculation. You declare a standard GATK-style @Input argument in the walker, of type RodBinding<VariantContext>:

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

This will require the user to provide a command line option --variant:vcf my.vcf to your walker. To get access to your variants, in the map() function you provide the variants variable to the tracker, as in:

Collection<VariantContext> vcs = tracker.getValues(variants, context.getLocation());

which returns all of the VariantContexts in variants that start at context.getLocation(). See RefMetaDataTracker in the javadocs to see the full range of getter routines.

Note that, as with regular tribble tracks, you have to provide the Tribble type of the file as a tag to the argument (:vcf). The system now checks up front that the corresponding Tribble codec produces Features that are type-compatible with the type of the RodBinding<T>.

RodBindings are generic

The RodBinding class is generic, parameterized as RodBinding<T extends Feature>. This T class describes the type of the Feature required by the walker. The best practice for declaring a RodBinding is to choose the most general Feature type that will allow your walker to work. For example, if all you really care about is whether a Feature overlaps the site in map, you can use Feature itself, which supports this, and will allow any Tribble type to be provided, using a RodBinding<Feature>. If you are manipulating VariantContexts, you should declare a RodBinding<VariantContext>, which will restrict automatically the user to providing Tribble types that can create a object consistent with the VariantContext class (a VariantContext itself or subclass).

Note that in multi-argument RodBindings, as List<RodBinding<T>> arg, the system will require all files provided here to provide an object of type T. So List<RodBinding<VariantContext>> arg requires all -arg command line arguments to bind to files that produce VariantContexts.

An argument that can be provided any number of times

The RodBinding system supports the standard @Argument style of allowing a vararg argument by wrapping it in a Java collection. For example, if you want to allow users to provide any number of comp tracks to your walker, simply declare a List<RodBinding<VariantContext>> field:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

With this declaration, your walker will accept any number of -comp arguments, as in:

-comp:vcf 1.vcf -comp:vcf 2.vcf -comp:vcf 3.vcf

For such a command line, the comps field would be initialized to the List with three RodBindings, the first binding to 1.vcf, the second to 2.vcf and finally the third to 3.vcf.

Because this is a required argument, at least one -comp must be provided. Vararg @Input RodBindings can be optional, but you should follow proper varargs style to get the best results.

Proper handling of optional arguments

If you want to make a RodBinding optional, you first need to tell the @Input argument that its options (required=false):

@Input(fullName="discordance", required=false)
private RodBinding<VariantContext> discordanceTrack;

The GATK automagically sets this field to the value of the special static constructor method makeUnbound(Class c) to create a special "unbound" RodBinding here. This unbound object is type safe, can be safely passed to the RefMetaDataTracker get methods, and is guaranteed to never return any values. It also returns false when the isBound() method is called.

An example usage of isBound is to conditionally add header lines, as in:

if ( mask.isBound() ) {
    hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));
}

The case for vararg style RodBindings is slightly different. If you want, as above, users to be able to omit the -comp track entirely, you should initialize the value of the collection to the appropriate emptyList/emptySet in Collections:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=false)
public List<RodBinding<VariantContext>> comps = Collections.emptyList();

which will ensure that comps.isEmpty() is true when no -comp is provided.

Implicit and explicit names for RodBindings

@Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
public RodBinding<VariantContext> variants;

By default, the getName() method in RodBinding returns the fullName of the @Input. This can be overloaded on the command-line by providing not one but two tags. The first tag is interpreted as the name for the binding, and the second as the type. As in:

-variant:vcf foo.vcf     => getName() == "variant"
-variant:foo,vcf foo.vcf => getName() == "foo"

This capability is useful when users need to provide more meaningful names for arguments, especially with variable arguments. For example, in VariantEval, there's a List<RodBinding<VariantContext>> comps, which may be dbsnp, hapmap, etc. This would be declared as:

@Input(fullName="comp", shortName = "comp", doc="Comparison variants from this VCF file", required=true)
public List<RodBinding<VariantContext>> comps;

where a normal command line usage would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

In the code, you might have a loop that looks like:

for ( final RodBinding comp : comps )
    for ( final VariantContext vc : tracker.getValues(comp, context.getLocation())
        out.printf("%s has a binding at %s%n", comp.getName(), getToolkit().getGenomeLocParser.createGenomeLoc(vc)); 

which would print out lines that included things like:

hapmap has a binding at 1:10
omni has a binding at 1:20
hapmap has a binding at 1:30
1000g has a binding at 1:30

This last example begs the question -- what happens with getName() when explicit names are not provided? The system goes out of its way to provide reasonable names for the variables:

  • The first occurrence is named for the fullName, where comp

  • Subsequent occurrences are postfixed with an integer count, starting at 2, so comp2, comp3, etc.

In the above example, the command line

-comp:vcf hapmap.vcf -comp:vcf omni.vcf -comp:vcf 1000g.vcf

would emit

comp has a binding at 1:10
comp2 has a binding at 1:20
comp has a binding at 1:30
comp3 has a binding at 1:30

Dynamic type resolution

The new RodBinding system supports a simple form of dynamic type resolution. If the input filetype can be specially associated with a single Tribble type (as VCF can), then you can omit the type entirely from the the command-line binding of a RodBinding!

So whereas a full command line would look like:

-comp:hapmap,vcf hapmap.vcf -comp:omni,vcf omni.vcf -comp:1000g,vcf 1000g.vcf

because these are VCF files they could technically be provided as:

-comp:hapmap hapmap.vcf -comp:omni omni.vcf -comp:1000g 1000g.vcf

If you don't care about naming, you can now say:

-comp hapmap.vcf -comp omni.vcf -comp 1000g.vcf

Best practice for documenting a RodBinding

The best practice is simple: use a javadoc style comment above the @Input annotation, with the standard first line summary and subsequent detailed discussion of the meaning of the argument. These are then picked up by the GATKdocs system and added to the standard walker docs, following the standard structure of GATKDocs @Argument docs. Below is a best practice documentation example from SelectVariants, which accepts a required variant track and two optional discordance and concordance tracks.

public class SelectVariants extends RodWalker<Integer, Integer> {
   /**
     * Variants from this file are sent through the filtering and modifying routines as directed
     * by the arguments to SelectVariants, and finally are emitted.
     */
    @Input(fullName="variant", shortName = "V", doc="Select variants from this VCF file", required=true)
    public RodBinding<VariantContext> variants;

    /**
     * A site is considered discordant if there exists some sample in eval that has a non-reference genotype
     * and either the site isn't present in this track, the sample isn't present in this track,
     * or the sample is called reference in this track.
     */
    @Input(fullName="discordance", shortName = "disc", doc="Output variants that were not called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> discordanceTrack;

    /**
     * A site is considered concordant if (1) we are not looking for specific samples and there is a variant called
     * in both variants and concordance tracks or (2) every sample present in eval is present in the concordance
     * track and they have the sample genotype call.
     */
    @Input(fullName="concordance", shortName = "conc", doc="Output variants that were also called in this Feature comparison track", required=false)
    private RodBinding<VariantContext> concordanceTrack;
}

Note how much better the above version is compared to the old pre-Rodbinding syntax (code below). Below you have a required argument variant that doesn't show up as a formal argument in the GATK, different from the conceptually similar @Arguments for discordanceRodName and concordanceRodName, which have no type restrictions. There's no place to document the variant argument as well, so the system is effectively blind to this essential argument.

@Requires(value={},referenceMetaData=@RMD(name="variant", type=VariantContext.class))
public class SelectVariants extends RodWalker<Integer, Integer> {
    @Argument(fullName="discordance", shortName =  "disc", doc="Output variants that were not called on a ROD comparison track. Use -disc ROD_NAME", required=false)
    private String discordanceRodName = "";

    @Argument(fullName="concordance", shortName =  "conc", doc="Output variants that were also called on a ROD comparison track. Use -conc ROD_NAME", required=false)
    private String concordanceRodName = "";
}

RodBinding examples

In these examples, we have declared two RodBindings in the Walker

@Input(fullName="mask", doc="Input ROD mask", required=false)
public RodBinding<Feature> mask = RodBinding.makeUnbound(Feature.class);

@Input(fullName="comp", doc="Comparison track", required=false)
public List<RodBinding<VariantContext>> comps = new ArrayList<VariantContext>();
  • Get the first value

    Feature f = tracker.getFirstValue(mask)

  • Get all of the values at a location

    Collection<Feature> fs = tracker.getValues(mask, thisGenomeLoc)

  • Get all of the features here, regardless of track

    Collection<Feature> fs = tracker.getValues(Feature.class)

  • Determining if an optional RodBinding was provided . if ( mask.isBound() ) // writes out the mask header line, if one was provided hInfo.add(new VCFFilterHeaderLine(MASK_NAME, "Overlaps a user-input mask"));

    if ( ! comps.isEmpty() ) logger.info("At least one comp was provided")

Example usage in Queue scripts

In QScripts when you need to tag a file use the class TaggedFile which extends from java.io.File.

Example in the QScript on the Command Line
Untagged VCF myWalker.variant = new File("my.vcf") -V my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF") -V:VCF my.vcf
Tagged VCF myWalker.variant = new TaggedFile("my.vcf", "VCF,custom=value") -V:VCF,custom=value my.vcf
Labeling a tumor myWalker.input_file :+= new TaggedFile("mytumor.bam", "tumor") -I:tumor mytumor.bam

Notes

No longer need to (or can) use @Requires and @Allows for ROD data. This system is now retired.

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • curtishcurtish Posts: 1Member

    I'm using the HaplotypeCaller (GATK 2.5-2-gf57256b) on a project for the first time, and tried to use tags on --input_file:tag bam lists to label my 3 experimental conditions (control, ill, verysick). However, I don't see any evidence of these tags making it through to the resulting VCF where I could then use them in down-stream analyses. Are these tags only available for custom code inside the caller? Have I mis-understood their meaning? or is there something I have to turn on to get them into the VCF? I've searched the FAQ/gatk wiki, etc, but "tag" is a difficult term to search on...

    # tag bam file lists going into HC based on experimental condition (~200 bam/condition)
    java -Xmx22g -Djava.io.tmpdir=/scratch/user/tmp \
      -jar GenomeAnalysisTK-2.5-2-gf57256b/GenomeAnalysisTK.jar \
      -T HaplotypeCaller -R .../ucsc.hg19.fa --dbsnp .../gatk_bundle/2.5/hg19/dbsnp_137.hg19.vcf \
      --intervals probe_regions.bed --interval_padding 0 \
      -o gatk2.5-2-all_intervals_list-raw_snps_indels.vcf \
      --input_file:healthy bams-gatk-reduced-healthy.list \
      --input_file:ill bams-gatk-reduced-ill.list \
      --input_file:verysick bams-gatk-reduced-verysick.list \
      --input_file:hapmap bams-gatk-reduced-hapmap.list 
    

    but when I look at the headers in the VCF, I see only

    grep -n healthy gatk2.5-2-all_intervals_list-raw_snps_indels.vcf
    gatk2.5-2-all_intervals_list-raw_snps_indels.vcf:8:##HaplotypeCaller="analysis_type=HaplotypeCaller input_file=[bams-gatk-reduced-healthy.list, bams-gatk-reduced-ill.list, bams-gatk-reduced-verysick.list, bams-gatk-reduced-hapmap.list] ... tag=NA ...
    

    and using grep, I can find no other occurrence of that tag "healthy" in my VCF, aside from that filenames themselves in the input_file list.

    Thanks, Curtis

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,909Administrator, GATK Developer admin

    Hi Curtis,

    Not all tools in the GATK are set up to use the tags, and currently the HaplotypeCaller makes no use of them. We mostly use the tags for variant evaluation and manipulation downstream of calling, not so much at the bam level. But I agree the functionality you describe would be convenient; we don't have the resources to implement this right now, but if someone wanted to contribute a patch we'd be happy to look at it.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.