We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

How are a list of read groups excluded from discovery and genotyping

jfarrelljfarrell Member ✭✭

If the insert size distribution is not good for a list of read groups, what is the best way to drop them from the discovery and genotyping steps?

Best Answer


  • jfarrelljfarrell Member ✭✭

    Unfortunately, a subset of the bam files to be analyzed do not have group IDs that are unique across the bam files. Each bam file from one of the sequencing centers has RG simply assigned 0, 0.1, and 0.2. Libraries ids are unique though.

    To address the non-unique read situation, it would be preferable to supply a bam file name plus the read group id for exclusion or something similar to the SAMPLE LIBRARY READGROUP list above. Would setting a very large ISIZE be a workaround for this in the discovery step? Any other suggestions?



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    edited January 2014

    @bhandsaker will correct me if I'm wrong, but I believe you could use the GATK engine's -rf ReadGroupBlackList filter to exclude read groups by library tag, if they're unique.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    Doesn't -rf have the same problem with uniqueness of read group IDs ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It shouldn't because ReadGroupBlackList can filter on specific RG tags, e.g. on SM, LB, PU etc. regardless of ID.

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    Then this should work, but you would have to also use "-disableGATKTraversal false" (the default is true).
    Genome STRiP doesn't use the normal GATK traversal engine by default.

    I also think you would have to modify SVQScript.q to achieve this (in the java interface, we explicitly pass boolean parameter values so that you can have a boolean with a default of true and still override it on the command line, but we currently haven't implemented this for the Queue/Scala interfaces). The advantage of doing this is that you could use any of the GATK read filters.

    As @jfarrell suggested, another option would be to use the isize override and supply a really big size (for example something larger than -maximumSize).

  • jfarrelljfarrell Member ✭✭

    Does ReadGroupBlackList refer to a file that contains the tags (one per line) or is it a list of tags?

    So would the filter be specified as:
    -rf LB:bad_library1 \
    -rf LB:bad_library2 \

    Or like -rf bad_libraries.list

    where bad_libaries.list contains

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Genome STRiP doesn't use the normal GATK traversal engine by default.

    Ah, I was wondering about that. Good to know, thanks Bob.

    Ultimately the best solution for @jfarrell, IMHO, would be to fix the RG IDs (and yell at the provider for their poor data management policies) to avoid any unpleasantness down the road. Otherwise it's going to be workarounds all the way, which is just asking for trouble.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I realize now the usage of ReadGroupBlackList is really unclear in the docs, will try to fix that. Basically you need to switch on the filtering by passing -rf ReadGroupBlackList, then you actually pass in the parameters using the --read_group_black_list engine argument, which can take either a single LB:bad_library1 value or a .txt file containing a list, one per line.

    So you end up with either

    -rf ReadGroupBlackList \
    --read_group_black_list LB:bad_library1 \
    --read_group_black_list LB:bad_library2


    -rf ReadGroupBlackList \
    --read_group_black_list blacklist.txt

    This is an old implementation and kind of awkward so I'll put it on my to-do list of improvements, but it'll be pretty low priority. In the meantime this should work, let me know if you have any trouble.

  • jfarrelljfarrell Member ✭✭

    Fixing the RG ids sounds like the way to go for the long term. Does UnifiedGenotyper also expect unique RG ids?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I don't think so -- UG aggregates data by sample (SM tags) and I don't think it looks at ID at all.

  • jfarrelljfarrell Member ✭✭

    What is the trade off for turning GATKTraversal on or off for GenomeSTRiP? Is it just faster to have it off (but at the cost of fewer options)?

  • bhandsakerbhandsaker Member, Broadie ✭✭✭✭

    It is mostly more scalable to big data sets and lots of bam files. LIkely also faster, but scalability is the main point.
    I haven't used GATK traversal mode recently, but in the past we had trouble scaling to process large data sets (like calling all of 1000 Genomes together: 2500 bam files x 4x or about 10000x).

    In non-GATK mode, we can process very large data sets with a 4G java heap. For most of the code, memory usage is independent of input size. There are a couple of exceptions, like the read pair clustering in discovery, which depends on the data (but not directly on the input size). But if you just have 64 bam files, it shouldn't matter either way.

    The other possible downside is that we don't usually run in GATK traversal mode, so it is possible you might run into bugs specific to that code path (but if so we'll try to fix them).

Sign In or Register to comment.