To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

When does IndelRealigner discard reads?

I'm using IndelRealigner on version VN:3.4-46-gbc02625, command line field is:

CL:knownAlleles=[] targetIntervals=/data2/processed/dreamchallenge_set1/synthetic.challenge.set1.tumor.v2/tmp/synthetic.challenge.set1.tumor.v2.target.intervals.list LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null

The (sambamba-produced) flagstat file is very different before and after - about ~18M reads are gone. We don't normally see this, but we also haven't run the DREAM data before. In what situations can IndelRealigner discard reads?

The documentation states that downsampling is not done by this tool by default. I notice the logs of the RealignerTargetCreator show some filters failing and downsampling to 1000 coverage. My assumption is that this should not affect the final result, as it is only to produce the regions to clean.

(I'm aware that in many cases indel re-alignment is no longer recommended).

Tagged:

Issue · Github
by Sheila

Issue Number
1261
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @sambrightman,

    The tools itself doesn't discard any reads, but before the data is provided to the tool, it does get filtered through some read filters by default (MalformedReadFilter and BadCigarFilter as documented here) to prevent the tool from crashing on reads that are defective (and therefore unusable). If that's what's causing your reads to be filtered out, you'll find it indicated in the filtering summary toward the end of the output log.

  • sambrightmansambrightman AmsterdamMember

    All of the filter lines (only those two filters mentioned) say 0 reads failing. The target creator shows some BadMateFilter, DuplicateReadFilter, FailsVendorQualityCheckFilter, MappingQualityZeroFilter but I do expect this to - if anything - reduce the targets. Is that correct?

    What else could go wrong?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    The filters applied by RTC don't remove any reads since it doesn't write out a BAM, only creates targets, as you surmise. The IndelRealigner logs are what matters here. If they're not showing any reads filtered, then that's not what's happening. And frankly I can't think of another reason you'd be losing reads, unless you're scattering by contig and somehow failing to retrieve one of the contigs' data.
  • sambrightmansambrightman AmsterdamMember

    I only see LocusScatterFunction in the logs. What if a different ref genome was used vs the mapper? I would still expect to see errors I think?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    LocusScatterFunction might be an important clue -- are you running this through Queue or some other pipelining tool? Can you tell if reads are being drop in a particular region, or if they're distributed across the genome?

    If a different reference genome was used (manifested as differences in the sequence dictionary) then GATK would refuse to run at all.

  • sambrightmansambrightman AmsterdamMember
    edited September 2016

    Running through Queue. Trying now with what should be the exact ref genome from the BAM header. Previously version was the same and no complaints from the tool, but thought it worth ruling out. Anything you would suggest for quickly locating the missing reads? Comparing pileups or read counts maybe?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Can you post your Queue command? Was it running on a set of intervals? Maybe the unmapped reads got left out -- there's some special-casing to treat "unmapped" as an honorary contig. You can try searching the bam for reads that have the unmapped sam flag set. That would be the fastest way to test whether my hypothesis is right. If that doesn't work, try DepthOfCoverage, or potentially DiffObjects.

  • sambrightmansambrightman AmsterdamMember

    Both the pre- and post-realign flagstat show % mapped reads - shouldn't the post- version show all mapped if they are dropped?

    The command is:

    java -Xmx4G -Djava.io.tmpdir=/data2/processed/dreamchallenge_set1/synthetic.challenge.set1.tumor.v2/tmp -jar /data/tools/gatk_v3.4-46/Queue.jar -R /data/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta -S /data2/data/repos/pipeline-test/QScripts/IndelRealignment.scala -jobQueue all.q \
    -nt 6 -mem 30 -nsc 25 -jobNative " -P hartwig -pe threaded 6 -q all.q -l h_rt=256:0:0" -known /data/dbs/GATK_bundle_v2.8/1000G_phase1.indels.b37.vcf -known /data/dbs/GATK_bundle_v2.8/Mills_and_1000G_gold_standard.indels.b37.vcf \
    -run -I /data2/processed/dreamchallenge_set1/synthetic.challenge.set1.tumor.v2/mapping/synthetic.challenge.set1.tumor.v2.bam -jobRunner GridEngine

    Flagstats:

    [sam@hmf_crunch001 synthetic.challenge.set1.tumor.v2]$ cat ./mapping/synthetic.challenge.set1.tumor.v2.flagstat
    915773200 + 146446776 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    61555124 + 12827792 duplicates
    877109260 + 105964051 mapped (95.78%:72.36%)
    915773200 + 146446776 paired in sequencing
    457886600 + 73223388 read1
    457886600 + 73223388 read2
    858807276 + 98873946 properly paired (93.78%:67.52%)
    864515654 + 100122374 with itself and mate mapped
    12593606 + 5841677 singletons (1.38%:3.99%)
    4546278 + 1133696 with mate mapped to a different chr
    3327598 + 593299 with mate mapped to a different chr (mapQ>=5)

    [sam@hmf_crunch001 synthetic.challenge.set1.tumor.v2]$ cat ./mapping/synthetic.challenge.set1.tumor.v2.realigned.flagstat
    899798136 + 144362646 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    0 + 0 supplementary
    57335128 + 12166920 duplicates
    861997850 + 104146585 mapped (95.80%:72.14%)
    899798136 + 144362646 paired in sequencing
    449896128 + 72180988 read1
    449902008 + 72181658 read2
    844798250 + 97355532 properly paired (93.89%:67.44%)
    850267898 + 98571572 with itself and mate mapped
    11729952 + 5575013 singletons (1.30%:3.86%)
    4391434 + 1108924 with mate mapped to a different chr
    3271810 + 586358 with mate mapped to a different chr (mapQ>=5)

  • sambrightmansambrightman AmsterdamMember
    edited September 2016

    It looks like this was caused by the reference genome difference. The BAM referenced some decoys that were not present in our standard reference genome. IndelRealigner did not refuse to run, and from what I can tell also did not complain - reads were silently dropped. I doubt this is intended behaviour? Maybe I missed the particular message.

    Issue · Github
    by Sheila

    Issue Number
    1295
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    sooheelee
  • shleeshlee CambridgeMember, Broadie, Moderator

    @sambrightman,

    It's hard to decipher what alignments are dropped with your samtools flagstat command. Why don't you use samtools idxstats instead to see for each before and after BAM (i) the list of contigs, (ii) the number of alignments per contig, and (iii) the number of unmapped reads (last output line). Then we can narrow down if IndelRealigner is dropping contigs as you say or if alignments are MIA as you initially proposed.

    Decoys are called decoys for a reason--to siphon off reads that would otherwise cause the aligner to use disproportionate compute trying to find the best alignment match for reads that are uninteresting, e.g. common cloning plasmid backbone sequences or that occur commonly but are not represented in the reference. For such reads, it would make sense not to carry forward alignments on decoys in downstream processes. They are not of interest to typical research aims. That alignments are dropped based on a refined sequence dictionary that drops the decoy contigs, well, this type of feature makes a lot of sense. However, we'll have to look into what is the intended behavior for IndelRealigner to see if what you see (no error/no warn) is intended. To figure out whether what you see is intended, I need some information from you.

    Can you check your BAM header for the lines that start with @SQ? E.g. look like the following:

    @SQ SN:chrM LN:16571
    @SQ SN:chr1 LN:249250621
    @SQ SN:chr2 LN:243199373
    

    Does your input BAM have such lines? If these are present in your BAM, and they mismatch that of the provided reference dictionary, then our tools should error. If this type of reference dictionary information is absent from your BAM, then our tools are meant to use the reference (dictionary) you provide in your command with a warning, e.g. Track doesn't have a sequence dictionary built in, skipping dictionary validation.

  • shleeshlee CambridgeMember, Broadie, Moderator

    @sambrightman,

    One more thing to check. When you run Queue, you may be specifying GATK args within the script that don't show up in the Queue command. One of these may be telling GATK to ignore dictionary validation errors, e.g. -U ALLOW_SEQ_DICT_INCOMPATIBILITY or -U ALL. Do you see these in any of the actual commands run by Queue? I'm told these should be in the log named "FunctionEdge" or something similar.

  • sambrightmansambrightman AmsterdamMember
    edited September 2016

    I do not see any references to -U parameters in the FunctionEdge lines of the log that state the command line.

    samtools idxstats confirms what I said above - the decoy regions have been dropped and the rest are the same. The BAM header contains these, the reference genome does not. What error message should I look for? grep -ie error -e warn -e fail -e dict on the log file shows only lines with 0 Fail.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited September 2016

    @sambrightman,

    You're saying that the BAM header contains @SQ lines that represent more contigs than your reference genome. As Geraldine says above, this will definitely error our tools:

    If a different reference genome was used (manifested as differences in the sequence dictionary) then GATK would refuse to run at all.

    I've confirmed such a mismatch should error for both RealignerTargetCreator and IndelRealigner for your version of GATK, GenomeAnalysisTK-3.4-46.jar using some of my own data. I am NOT using Queue, but rather using the tools with simple commands. The error messages should look like the following messages.

    [1] For RelalignerTargetCreator

    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 3.4-46-gbc02625): 
    ##### ERROR
    ##### ERROR This means that one or more arguments or inputs in your command are incorrect.
    ##### ERROR The error message below tells you what is the problem.
    ##### ERROR
    ##### ERROR If the problem is an invalid argument, please check the online documentation guide
    ##### ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ##### ERROR
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ##### ERROR
    ##### ERROR MESSAGE: Badly formed genome loc: Contig chr19_KI270866v1_alt given as location, but this contig isn't present in the Fasta sequence dictionary
    ##### ERROR ------------------------------------------------------------------------------------------
    

    [2] For IndelRealigner

    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR stack trace 
    org.broadinstitute.gatk.utils.exceptions.ReviewedGATKException: BUG: requested unknown contig=chr19_KI270866v1_alt index=-1
        at org.broadinstitute.gatk.utils.MRUCachingSAMSequenceDictionary.updateCache(MRUCachingSAMSequenceDictionary.java:178)
        at org.broadinstitute.gatk.utils.MRUCachingSAMSequenceDictionary.getSequence(MRUCachingSAMSequenceDictionary.java:109)
        at org.broadinstitute.gatk.utils.GenomeLocParser.validateGenomeLoc(GenomeLocParser.java:284)
        at org.broadinstitute.gatk.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:239)
        at org.broadinstitute.gatk.utils.GenomeLocParser.createGenomeLoc(GenomeLocParser.java:449)
        at org.broadinstitute.gatk.engine.datasources.providers.ReadReferenceView.getReferenceContext(ReadReferenceView.java:98)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:139)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano$2.next(TraverseReadsNano.java:128)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.aggregateMapData(TraverseReadsNano.java:119)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:101)
        at org.broadinstitute.gatk.engine.traversals.TraverseReadsNano.traverse(TraverseReadsNano.java:56)
        at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:108)
        at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:315)
        at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(CommandLineExecutable.java:121)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
        at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
        at org.broadinstitute.gatk.engine.CommandLineGATK.main(CommandLineGATK.java:106)
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
    ##### ERROR
    ##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ##### ERROR Visit our website and forum for extensive documentation and answers to 
    ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ##### ERROR
    ##### ERROR MESSAGE: BUG: requested unknown contig=chr19_KI270866v1_alt index=-1
    ##### ERROR ------------------------------------------------------------------------------------------
    

    Given we can rule out the tools themselves as the source of the oversight, my guess is that something is up with Queue.

    By any chance, do any of your contigs contain colons : in their names, e.g. as is the case in GRCh38? I ask because specifying intervals of contigs with colons in their names requires special handling for GATK versions prior to v3.6.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Btw, do you know about WDL + Cromwell? This is a new approach to parallelization for running pipelines/workflows. There is an example script here.

  • sambrightmansambrightman AmsterdamMember

    No, there are no hyphen characters in the contig name (there is one hyphen in one UR tag). There are dots in some of them and an underscore in one:

    $ /data/common/tools/samtools_v1.2/samtools view -H /data/dream/set1/downloads/06683320-dbd9-464e-b6f2-143e715b2981/synthetic.challenge.set1.tumor.bam | cut -f2 | grep ^SN
    SN:1
    SN:2
    SN:3
    SN:4
    SN:5
    SN:6
    SN:7
    SN:8
    SN:9
    SN:10
    SN:11
    SN:12
    SN:13
    SN:14
    SN:15
    SN:16
    SN:17
    SN:18
    SN:19
    SN:20
    SN:21
    SN:22
    SN:X
    SN:Y
    SN:MT
    SN:GL000207.1
    SN:GL000226.1
    SN:GL000229.1
    SN:GL000231.1
    SN:GL000210.1
    SN:GL000239.1
    SN:GL000235.1
    SN:GL000201.1
    SN:GL000247.1
    SN:GL000245.1
    SN:GL000197.1
    SN:GL000203.1
    SN:GL000246.1
    SN:GL000249.1
    SN:GL000196.1
    SN:GL000248.1
    SN:GL000244.1
    SN:GL000238.1
    SN:GL000202.1
    SN:GL000234.1
    SN:GL000232.1
    SN:GL000206.1
    SN:GL000240.1
    SN:GL000236.1
    SN:GL000241.1
    SN:GL000243.1
    SN:GL000242.1
    SN:GL000230.1
    SN:GL000237.1
    SN:GL000233.1
    SN:GL000204.1
    SN:GL000198.1
    SN:GL000208.1
    SN:GL000191.1
    SN:GL000227.1
    SN:GL000228.1
    SN:GL000214.1
    SN:GL000221.1
    SN:GL000209.1
    SN:GL000218.1
    SN:GL000220.1
    SN:GL000213.1
    SN:GL000211.1
    SN:GL000199.1
    SN:GL000217.1
    SN:GL000216.1
    SN:GL000215.1
    SN:GL000205.1
    SN:GL000219.1
    SN:GL000224.1
    SN:GL000223.1
    SN:GL000195.1
    SN:GL000212.1
    SN:GL000222.1
    SN:GL000200.1
    SN:GL000193.1
    SN:GL000194.1
    SN:GL000225.1
    SN:GL000192.1
    SN:NC_007605
    

    Taking the command from the logs and running it directly (without Queue/scatter intervals):

    java -jar /data/common/tools/gatk_v3.4.46/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /data/dream/set1/downloads/ae16ceb3-ce31-4648-840c-66f3c5d180a6/synthetic.challenge.set1.normal.bam -R /data/common/refgenomes/Homo_sapiens.GRCh37.GATK.illumina/Homo_sapiens.GRCh37.GATK.illumina.fasta -nt 6 -o /tmp/synthetic.challenge.set1.normal.v2.target.intervals.list -known /data/common/dbs/GATK_bundle_v2.8/1000G_phase1.indels.b37.vcf -known /data/common/dbs/GATK_bundle_v2.8/Mills_and_1000G_gold_standard.indels.b37.vcf
    

    does not show any errors at startup (when I'd hope this would be found). I'm letting it run to see if it causes a problem when the reads that map to those contigs are reached.

  • sambrightmansambrightman AmsterdamMember

    Indeed, eventually it reaches a read with a contig not in the FASTA:

    ##### ERROR MESSAGE: Badly formed genome loc: Contig GL000207.1 given as location, but this contig isn't present in the Fasta sequence dictionary
    

    I can't really see a way to recreate the interval list that LocusScatterFunction would do from the command line, and I've lost the interval files.

  • shleeshlee CambridgeMember, Broadie, Moderator
    edited September 2016

    Now that you are aware that this is an oversight in Queue, you have two choices.

    1. Switch to a reference that represents these dropped contigs
    2. Choose to drop the reads in these contigs

    Since it is apparent this issue you've astutely observed is a bug in Queue, would you mind filing out a bug report using the instructions here? I'm unclear what is the priority in fixing bugs in Queue, but someone in the know should let you know.

    You can find the full hg37 reference here.

    Post edited by shlee on
  • sambrightmansambrightman AmsterdamMember

    I've already switched to the correct reference, that is not a problem anymore.

    Is there a simple way to re-generate the interval list that the scatter created to help track down the issue?

    Issue · Github
    by Sheila

    Issue Number
    1317
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
    No, there's no simple way to do this. You would have to reconstruct them based on the function edge commands in the logs, I think.
  • sambrightmansambrightman AmsterdamMember

    I'm unable to connect to the FTP server due to it being at the maximum number of users (this has been the case for several hours).

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sambrightman
    Hi,

    Any luck?

    We have heard many reports of this being the case, and we always tell users to try again later. Hopefully the new system will be up and running soon so this won't be an issue anymore.

    -Sheila

  • sambrightmansambrightman AmsterdamMember

    It took several days to get a connection but I've put gatkbug.tar.gz on the FTP server. The standard FASTA should be enough to reproduce. Likewise, issue should not be specific to these BAMs (just any with extra contigs), so not gone to the trouble of uploading BAMs.

    Issue · Github
    by Sheila

    Issue Number
    1362
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @sambrightman
    Hi,

    Great. Thanks. I will have a look soon.

    -Sheila

Sign In or Register to comment.