Holiday Notice:
The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!

Incomplete output in ApplyRecalibration without generating an ERROR message

FerFer AustriaMember

Dear GATK team,
I've been facing a problem involving incomplete output vcf files while running VariantRecalibrator followed by ApplyRecalibration. However, this misbehavior didn't produce an ERROR message from any of those commands. Instead, I caught it only once I've tried SelectVariants downstream in my pipeline, but the ERROR messages where of two kinds: "track variant is out of coordinate order..." or "The provided VCF file is malformed at approximately line number...". (More details below).

My pipeline consists of three commands:
1) VariantRecalibrator. Doesn't report any ERROR and the recal file seems OK.

java -Djava.io.tmpdir=$mytmp -Xmx232g -jar GenomeAnalysisTK.jar -R $ref -T VariantRecalibrator -input 1245g.vcf -recalFile 1245g.$SNPs.recal -tranchesFile 1245g.$SNPs.tranches -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -an QD -an InbreedingCoeff -tranche 100.0 -tranche 99.5 -tranche 99.0 -tranche 90.0 -mode SNP -resource:$SNPs,known=true,training=true,truth=true,prior=15 $SNPs.vcf -rscriptFile 1245g.$SNPs.recalibrate_SNP.R -nt 40

2) ApplyRecalibration. Doesn't report any ERROR, but the vcf file is smaller than expected.

java -Djava.io.tmpdir=$mytmp -Xmx232g -jar GenomeAnalysisTK.jar -R $ref -T ApplyRecalibration -input 1245g.vcf --ts_filter_level 99 -tranchesFile 1245g.$SNPs.tranches -recalFile 1245g.$SNPs.recal -mode SNP -o 1245g.$SNPs.vcf -nt 40

3) SelectVariants. Aborts after producing an ERROR.

java -Djava.io.tmpdir=$mytmp -Xmx232g -jar GenomeAnalysisTK.jar -R $ref -T SelectVariants -V 1245g.$SNPs.vcf -selectType SNP -restrictAllelesTo BIALLELIC -o 1245g.$SNPs.BIALLELIC.vcf -nt 40

Examples of ERROR messages from SelectVariants:

##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 122: 0/0:7,0:7:15:0,15,225 is not a valid start position in the VCF format, for input source: 1245g.$SNPs.vcf

or

##### ERROR MESSAGE: LocationAwareSeekableRODIterator: track variant is out of coordinate order on contig Chr2:14660901 compared to Chr2:14661030

Few important NOTES:

  • GATK version 3.4-45.
  • $SNPs refers to the slightly different sets of training SNPs that I'm testing. For some sets, sometimes it runs all the way through without visible errors, but sometimes it doesn't .
  • The two main variants of ERROR messages can occur even for the same $SNPs training set, in independent runs.
  • I'm dealing with ~12M variants from 1245 individuals. However, worth to mention is that this same pipeline didn't give me any troubles when dealing with ~6M variants from a subset of 242 individuals.
  • Attached is a file with the last two lines of file 1245g.$SNPs.vcf in a failed run: you'll notice that it insists in writing the beginning of a new line in an unfinished one.
  • Before posting this message I even tried a run without multi-threading and the problem persisted.

Regards.

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Fer
    Hi,

    Can you try running ValidateVariants at each stage of your workflow, starting with the input to VariantRecalibrator? https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_ValidateVariants.php

    Thanks,
    Sheila

  • FerFer AustriaMember

    Hi,
    I did it now. It seems that the input for VariantRecalibrator is ok:
    Successfully validated the input file. Checked 18927461 records with no failures.

    Then, I ran my pipeline (VariantRecalibrator + ApplyRecalibration) for three slightly different training SNPs sets. They are simply subsets of each other: 191k, 181k and 178k SNPs.

    ValidateVariants runs successfully for the the first two VCF files generated by ApplyRecalibration (191k and 181k sets).
    However, the third VCF is malformed:

    ##### ERROR MESSAGE: File 1163g.178k.15..6.95.vcf fails strict validation: The provided VCF file is malformed at approximately line number 4212366: there are 957 genotypes while the header requires that 1163 genotypes be present for all records at Chr1:28179796

    That line is an INDEL, therefore it is absent in the recal file generated by VariantRecalibrator. And all recal files seemed ok as judged by the total number of lines.

    What do you think could be the source of the problem?

    Regards

  • ebirnebirn ViennaMember
    edited November 2015

    Hi,
    I tried to help @Fer debugging the issue, but as of yet we were unsuccessful.
    My background is IT, not biology, but please take a detailed look at the attached sample file. I believe there is something wrong with writing the file output:
    Search for the markers "Chr5" in the file. My understanding is that they should always be at the beginning of the line. However in the attached sample it looks like some lines were garbled.
    Regards, E

    [edit:] clarified, that the "Chr5" marker should be at the beginning of the line (not file)

    Post edited by ebirn on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Fer @ebirn
    Hi,

    Can you please post the VCF record at position Chr1:28179796?

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @Fer and @ebirn Your comment that

    some lines were garbled.

    suggests that something went wrong during the file creation. Did you try re-running the command that produced that file?

  • ebirnebirn ViennaMember

    Hi,

    @Geraldine_VdAuwera let me state this more precise:
    Yes, we did run it several times, the error is not deterministic. i.e. sometimes it goes wrong, sometimes it works. Also yes, we agree that the problem ist the step producing the file (in the step ApplyRecalibration if I'm not mistaken).

    When I noticed the problem, my first assumption is that the threads writing to the file have a synchronization problem and ist basically starts writing the next line, even though the previous one is not done yet.
    To me this looks like a thread synchronization problem. However, the problem still happens, when omitting the (-nt data threads) option - assuming it only uses one data thread in this case.

    Does that make any sense?
    What can we do to diagnose this problem further, narrowing down the root cause?

    -Erich

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ebirn
    Hi Erich,

    You can try running ApplyRecalibration to each chromosome and see if it fails on any one particular chromosome.

    -Sheila

  • FerFer AustriaMember

    Hi,
    @Sheila : attached is a file that starts at record Chr1:28179796, and goes until line 4212366, just in case ("The provided VCF file is malformed at approximately line number 4212366...").

    Additionally, I've re-run ApplyRecalibration in Chr1 (the one that supposedly failed) and I got it done without errors.

  • FerFer AustriaMember

    Hi again,
    Little update. I've re-read the previous message addressed to @ebirn (by the way, he is our HPC specialist) more carefully, and run ApplyRecalibration on EACH chromosome separately. All finished successfully as confirmed by ValidateVariants.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Glad to hear it worked! This sounds like it might have been a transient system error -- but let me know if it happens again.

Sign In or Register to comment.