Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Task JointGenotyping.ImportGVCFs failed with message 10: failed to copy files

Task JointGenotyping.ImportGVCFs failed on joint discovery of 34 exome samples:

When I look into the stderr.log file:
08:33:54.012 INFO GenomicsDBImport - Starting batch input file preload
08:33:59.013 INFO GenomicsDBImport - Finished batch preload
08:33:59.013 INFO GenomicsDBImport - Importing batch 1 with 34 samples
terminate called after throwing an instance of 'VCF2TileDBException'
what(): VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 20, 664149855 ] current cell: [ 20, 664149855 ].
The most likely cause is unexpected data in the input file:
(a) A VCF file has two lines with the same genomic position
(b) An unsorted CSV file
(c) Malformed VCF file (or malformed index)
See point 2 at: https://github.com/Intel-HLS/GenomicsDB/wiki/Importing-VCF-data-into-GenomicsDB#organizing-your-data

GVCFs were called with haplotypecaller-gvcf-gatk4 (all done with workspace: Germline-SNPs-Indels-GATK4-hg38)

What can I do to fix this error?

Best Answer

Answers

  • klawagklawag Member

    Hi @shlee,

    I have run bcftools norm on all genomic vcf exome samples and indexed using gatk IndexFeatureFile and this solved my problem. Now the method joint-discovery-gatk4 from the workspace Germline-SNPs-Indels-GATK4-hg38 finished without errors.

    I used all methods of "GATK Best Practices for Germline SNPs Indels" to generate these data (Data Pre-processing, HaplotypeCallerGVCF, and Joint Discovery) so I am wondering if bcftools norm should be incorporated in this workflow after calling GVCF?

    I do not know if these calls came from the same sample or from multiple samples, how could I check this?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @klawag
    Hi,

    I just asked the team if this will be fixed in later versions.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @klawag
    Hi again,

    It seems this will not be fixed.

    Reasons from the developer:
    (a) GATK-4 tools generally don't produce VCFs/gVCFs that require this fix - it is strange that you ran into this issue
    (b) It's a one time fix for non-normalized VCFs
    (c) The fix is already in bcftools and we would simply be duplicating functionality/copying code

    -Sheila

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    @klawag, you can grep '20\t664149855' each GVCF to see if any sample gives two records in a row for the site. If so, then check to see if this region is at an interval boundary in the intervals list(s) you are using, if any.

  • Evan_BennEvan_Benn Member

    Hi Shlee,
    I seem to be getting the same error. Do you mind explaining the reasoning in your last post ' is at an interval boundary'. I did not quite understand that. Does Would HaplotypeCaller produce this duplicate line in some case (based on the intervals)?

    I have attached my bam file header, gvcf header (from haplotypecaller), and a stderr from ImportGVCFs.

    This is the fastq reads: https://console.cloud.google.com/storage/browser/genomics-public-data/gatk-examples/example1/NA19913/

    I was using https://github.com/gatk-workflows/gatk4-data-processing and https://github.com/gatk-workflows/gatk4-germline-snps-indels .

    Issue · Github
    by shlee

    Issue Number
    3076
    State
    closed
    Last Updated
    Assignee
    Array
    Closed By
    sooheelee
  • Evan_BennEvan_Benn Member
    edited May 2018

    These are the intervals: https://console.cloud.google.com/storage/browser/genomics-public-data/resources/broad/hg38/v0/scattered_calling_intervals/

    And this is the result of the grep you described:

    chr1    58999897    .   T   <NON_REF>   .   .   END=58999902    GT:DP:GQ:MIN_DP:PL0/0:31:90:31:0,90,1350
    chr1    58999900    .   T   <NON_REF>   .   .   END=58999902    GT:DP:GQ:MIN_DP:PL0/0:31:90:31:0,90,1350
    chr6    158999902   .   G   <NON_REF>   .   .   END=158999902   GT:DP:GQ:MIN_DP:PL0/0:8:24:8:0,24,278
    
    Post edited by shlee on
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited May 2018

    Hi @Evan_Benn,

    I think this is a bug and would you mind submitting a bug report that allows us to recapitulate this? The expected behavior is for your first GVCF block to END=58999900. @Sheila can tell us the current protocol for bug report submission, given your data is in the cloud. I believe all that is needed is that you point us to the files to recapitulate the bug and for you to share your bucket with us.

    Let me now address your questions. In general, we can only speak hypothetically on best-guess possibilities if we cannot see the full picture, i.e. know the version of the toolkit being used and see what is going on in the actual data. So thanks for sharing all these files.

    Section 4 of this article explains how HaplotypeCaller can expand active regions around a putative variant site up to that defined by --max-assembly-region-size, the default of which is 300 bases in GATK4.

    However, it seems what is happening for you pertains to only the first interval list's last interval:

    chr1    29553836    58999999
    

    The ERC calls you post give two overlapping GVCF blocks, both ending at END=58999902, that pertain to this interval, i.e. the same shard run. HaplotypeCaller sees two variants, 3 bp apart, but mistakes the GVCF block boundaries. That is why you get the error:

      what():  VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 0, 58999902 ] current cell: [ 0, 58999902 ].
    The most likely cause is unexpected data in the input file:
    (a) A VCF file has two lines with the same genomic position
    (b) An unsorted CSV file
    (c) Malformed VCF file (or malformed index)
    See point 2 at: https://github.com/Intel-HLS/GenomicsDB/wiki/Importing-VCF-data-into-GenomicsDB#organizing-your-data
    

    when running GenomicsDBImport. From the stdout, we see you are using GenomicsDBImport from GATK4.0.2.1. From the GVCF header, we see Version=4.0.2.1-SNAPSHOT for the version of HaplotypeCaller you use. I do not see any changes to HaplotypeCaller since this version that may address this bug. So we'd be grateful if you can share your data to help us recapitulate it so we can fix it.

  • Evan_BennEvan_Benn Member

    Thanks Shlee for the quick and detailed response. To clarify I am using the 'best practice pipeline' from github, but I am running it on my own infrastructure, with some modifications to make that work. Should I post a bug to the GATK github, or the best practice pipeline github?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks for clarifying you are not running the pipeline as is @Evan_Benn.

    Is there something in your changes that you suspect is causing this seeming bug, e.g. did you change the HaplotypeCaller command itself? We'll have to get the details of how you are running things, e.g. the alternative infrastructure you mention. At the least, this will help us identify a pattern if other researchers use the same approach and get the same error but other approaches do not encounter the error. Knowing how you run things will also help us assess whether looking into it is within the scope of our support. It would be best if you could recapitulate the error using the recommended infrastructure and unmodified scientific workflow. Better yet, can you recapitulate it in the simplest use case, on the commandline for the given interval on your data? This type of evidence will receive immediate attention.

    If you describe everything in this thread, within a post, then it is easy for us to create a GitHub issue ticket containing what you've written, directly from this page and to post the issue ticket to the appropriate repository. Then, other researchers can easily find the relevant information and this will help their research. Also, we'd like for our developers to be able to focus on their development work and only look into bugs if we (those of us on the Communications team) have confirmed that we can recapitulate them. I do appreciate your offer to post the bug to GATK GitHub or the best practice pipeline GitHub directly. @Sheila is actually looking into evolving our bug processing protocol, so I wanted to make sure we adhere to her preferences.

  • Evan_BennEvan_Benn Member

    Thanks Shlee, I do not think any of my changes would modify the output of the tools. I will attempt to de-cromwell the runs, and post a new thread with a single shell script that reproduces the bug.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Thanks @Evan_Benn for looking into this. It's alright to post to this same thread or a new thread. Your choice.

Sign In or Register to comment.