Hi GATK Users,

Happy Thanksgiving!
Our staff will be observing the holiday and will be unavailable from 22nd to 25th November. This will cause a delay in reaching out to you and answering your questions immediately. Rest assured we will get back to it on Monday November 26th. We are grateful for your support and patience.
Have a great holiday everyone!!!

Regards
GATK Staff

GATK4 beta2 GenotypeGVCFs produces VCF with no records, just a header

WimSWimS Member
edited July 2017 in Ask the GATK team

Hi,

GATK4 beta2 GenotypeGVCFs produces a VCF with no records on my test data. The file does contain a valid VCF header.

The commands that I used for GenomicsDBImport and GenotypeGVCFs are below. Both were run on the GATK4 beta2 jar.

gatk-launch --javaOptions '-Xms500m -Xmx28665m -XX:+UseSerialGC -Djava.io.tmpdir=/data/run/tmpKVSMhn' GenomicsDBImport -V DA_123_01.vcf.gz -V DA_123_02.vcf.gz -V DA_123_03.vcf.gz -V DA_123_04.vcf.gz -V DA_123_05.vcf.gz -V DA_123_06.vcf.gz -V DA_123_07.vcf.gz -V DA_123_08.vcf.gz --genomicsDBWorkspace DEV_1066_Chr_01 --intervals Chr_01
gatk-launch --javaOptions '-Xms500m -Xmx28665m -XX:+UseSerialGC -Djava.io.tmpdir=/data/run/tmpKVSMhn' GenotypeGVCFs -R ./my_species.fa -V gendb://DEV_1066_Chr_01 -G StandardAnnotation -newQual -O DEV_1066_Chr_01.vcf.gz

The output of both tools does not show anything strang. The GenotypeGVCFs tool finishes with these statements:

17:07:24.554 INFO  ProgressMeter -      Chr_01:38007262             12.4              17243000        1394831.5
17:07:34.561 INFO  ProgressMeter -      Chr_01:38743870             12.5              17590000        1403959.7
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),500.6120602850972,Cpu time(s),450.9598250043053
17:07:37.965 INFO  ProgressMeter -      Chr_01:39060683             12.6              17742934        1409782.4
17:07:37.965 INFO  ProgressMeter - Traversal complete. Processed 17742934 total variants in 12.6 minutes.
17:07:37.975 WARN  IntelDeflaterFactory - IntelDeflater is not supported, using Java.util.zip.Deflater
17:07:37.976 INFO  GenotypeGVCFs - Shutting down engine
[July 20, 2017 5:07:37 PM CEST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 12.61 minutes.
Runtime.totalMemory()=506986496

The gVCF files were produced with GATK4 beta1. Is there a way that this issue can be solved or that I can further debug this issue?

Thank you.

Best Answer

Answers

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @WimS,

    You are saying the VCF from GenotypeGVCFs contains no records but does have a valid VCF header. The questions to answer are

    [1] If we use GATK3.7, does this dataset produce calls after GenotypeGVCFs?
    [2] Does the GenomicsDBImport step produce an empty database?
    [3] Are the tools handling .vcf.gz files correctly. I ask this last question because if we look at Geraldine's example commands on this page, we see she uses .vcf files as both input and output and not .vcf.gz. There are some differences in how Picard handles either file type so I'm just wondering.

  • WimSWimS Member

    Hi @shlee . Thank you for the trouble shooting help.

    [1] I think I can't use GATK3.7 because we don't have a license.
    [2] Both importing vcf and vcf.gz produces a database directory of ca 690M. See the file import_vcf.file_sizes.txt in the attached zip.
    [3] I tested importing and exporting plain vcf and vcf.gz files. Both gave the same result: a vcf or a vcf.gz file with just a header.

    [4] I ran validatevcf on the plain vcf files. These show the gvcf files are valid.

    In the attached zip file you can find the output logs of GenomicsDBImport and GenotypeGVCFs for importing and exporting vcf and vcf.gz files.

    Also the output vcf and vcf.gz file is included. And the validatevcf logs are included. Last thing that is included is the directory size for the databases and the file sizes in those databases.

    I hopes this information is useful for giving further advice on how to fix or debug this issue.

    The location of the GATK jar in the logs might look confusing gatk4-4.0b1-0/gatk-package-4.beta.2-local.jar
    This is because I created the gVCF files with GATK4 beta1 using bcbio. Then I ran into the null pointer exception, for which I needed gatk4 beta2. The GATK4 beta2 launch script and GATK4 beta2 jar I put in the gatk4-4.0b1-0 directory of bcbio.

    Thank you.

    Issue · Github
    by shlee

    Issue Number
    2323
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • shleeshlee CambridgeMember, Broadie, Moderator admin

    @WimS,

    I'm going to consult the developers. In the meanwhile, would you mind sharing snippets of the original input GVCFs so we can recapitulate the error on our end? Instruction for a bug report are in Article#1894.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @WimS It's okay for you to use 3.7 without a license if we ask you to do so for debugging purposes.

    I'm a little concerned to see this line in the logs:

    11:14:13.646 WARN  IntelGKLUtils - Error starting process to check for AVX support : grep -i avx /proc/cpuinfo
    

    Will ask our Intel collaborators if they know what might be going on there.

    Aside from that nothing jumps out at me. Would you be able to share a snippet of data that reproduces the error so we can troubleshoot this locally? Instructions are here: https://software.broadinstitute.org/gatk/guide/article?id=1894

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh and our consulting engineer suggests running this command to extract the contents of the db:

    ./gatk-launch SelectVariants -V gendb://DEV_1066_Chr_01 -O combined.g.vcf -R <your reference fasta>
    
  • WimSWimS Member
    edited July 2017

    Hi @slee and @Geraldine_VdAuwera . Thank you for the advice.

    I checked a couple of things:
    [1] Running the GenomicsDBImport and GenotypeGVCFs GATK4 tools on a more modern machine with AVX support did not change the outcome, still VCF file with just a header. So I don't think lack of AVX support is causing the issue.

    [2] Running GATK4 SelectVariants on the GenomicsDB produced a multi-sample VCF for the 8 samples and the Chr_01 chromosome. All genotypes are missing (./.) in this output. I am not sure if this is expected. Two example records from the SelectVariants output:

    Chr_01  321     .       G       <NON_REF>       .       .       .       GT:DP:GQ:MIN_DP:PL      ./.:16:45:16:0,45,675   ./.:10:27:9:0,27,261    ./.:13:30:13:0,30,450   ./.:17:36:15:0,36,540   ./.:16:42:15
    :0,42,630   ./.:11:30:11:0,30,450   ./.:4:12:4:0,12,100     ./.:15:42:14:0,42,415
    Chr_01  322     .       G       T,<NON_REF>     .       .       BaseQRankSum=-3.010e-01;ClippingRankSum=0.00;DP=95;ExcessHet=3.01;MQ0=0;MQRankSum=0.457;RAW_MQ=148981.00;ReadPosRankSum=0.619   GT:AD:DP:GQ:
    MCL ./.:6,7,0:13:99:0.00,0.00       ./.:0,9,0:9:27:0.00,0.00        ./.:3,5,0:8:99:0.00,0.00        ./.:0,7,0:7:21:0.00,0.00        ./.:8,7,0:15:99:0.00,0.00       ./.:4,7,0:11:99:0.00,0.00       ./.:0,3,
    0:3:9:0.00,0.00 ./.:5,9,0:14:99:0.00,0.00
    

    [3] GATK3.7 GenotypeGVCFs does produce a multi-sample VCF file with variant records and sample genotypes.
    I do notice 1 strange thing which I don't know if it is related to this issue. It looks like 1 sample was counted twice for the AN and AC attribute. This happens for many variants. For example 1 variant shows:

    Chr_01  45  .   T   C   107.06  .   AC=3;AF=0.125;AN=18;BaseQRankSum=-1.020e+00;ClippingRankSum=0.00;DP=95;ExcessHet=4.8226;FS=0.000;MLEAC=5;MLEAF=0.278;MQ=47.14;MQ0=0;MQRankSum=-9.670e-01;QD=3.24;ReadPosRankSum=-2.125e+00;SOR=0.465    GT:AD:DP:GQ:MCL:MMQ:PL  
    0/0:10,0:10:30:.:.:0,30,362
    0/0:11,1:12:4:0,0,0:40,27,0:0,4,425
    0/0:11,0:11:0:.:.:0,0,330
    0/0:11,1:12:4:0,0,0:40,27,0:0,4,446
    0/1:8,2:10:19:0,0,0:37,31,0:19,0,277
    0/0:9,0:9:0:.:.:0,0,262 
    0/0:6,0:6:0:.:.:0,0,140 
    0/1:9,4:13:99:0,0,0:40,31,0:105,0,329
    

    AN = 18 while there are 8 samples in the VCF (max AN should be 8*2 = 16)
    AC = 3 while there are only 2 samples with a 0/1 genotype and 6 with 0/0 genotype, AC should be 2.
    Selecting all samples with bcftools view calculates the corrrect AC and AN value for the variant records.

    [4] I am re-running the entire analysis including gVCF file creation with GATK4 beta2 to see if the issue reproduces with GATK4 beta2. For this I am using public data so that I can share all results in a bug report.

    Post edited by WimS on
  • WimSWimS Member
    edited July 2017

    Hi @slee and @Geraldine_VdAuwera. I could reproduce the issue using GATK4 beta2 on a small set of public data .

    I am ready to send you the data so you can hopefully reproduce and solve the issue. Using the guidelines from here
    https://software.broadinstitute.org/gatk/documentation/article.php?id=1894

    I can't connect to your ftp.broadinstitute.org. Can you confirm your FTP server is up and running? Then I known it's some kind of network issue at my side.

    Thank you.

    Post edited by WimS on
  • sleeslee Member, Broadie, Dev ✭✭

    @WimS I think you tagged me by mistake---tagging @shlee back in!

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Yes @WimS, the server is up and running. Other members have successfully submitted bug reports.

  • WimSWimS Member

    Hi @shlee . Upload the to ftp worked. The zip file is named GATK4.beta2_GenotypeGVCFs_output_header_only.zip

    It includes

    • 3 small input gVCF files
    • the database directory created by GenomicsDBImport
    • the header only output vcf file created by GenotypeGVCFs
    • text file with the commands used for GenomicsDBImport and GenotypeGVCFs
    • log files containing the output of GenomicsDBImport and GenotypeGVCFs
    • the reference genome used (fasta, .fai, and .dict )

    I hope this is enough and useful for diagnosing and fixing the issue.

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Great @WimS, @Geraldine_VdAuwera will get back to you after she gets to your bug report. This might be next week.

  • WimSWimS Member
    edited July 2017

    Ok next week is fine. Thank you for diagnosing this issue. I look forward to your analysis/fix of the issue.

  • WimSWimS Member
    edited August 2017
    Hi @Geraldine_VdAuwera . I am wondering if you already had a chance to look at the data / bug report. If you need more information or data from our side please let us know. Thank you.
  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @WimS
    Hi,

    We are looking into it and will get back to you soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @WimS
    Hi,

    We had some difficulties replicating your issue. However, there was another case like yours reported here. You can keep track of the bug in that thread.

    -Sheila

  • WimSWimS Member
    edited August 2017 Accepted Answer

    Updating the expected number of attributes for MCL and MMQ to R in all gVCF files solved the issue as recommended in the github issue. https://gatkforums.broadinstitute.org/gatk/discussion/comment/41338#Comment_41338

    Old header lines

    ##FORMAT=<ID=MCL,Number=A,Type=Float,Description="number of soft- and hard- clipped bases">
    ##FORMAT=<ID=MMQ,Number=A,Type=Float,Description="median mapping quality">
    

    New header lines that fix the issue.

    ##FORMAT=<ID=MCL,Number=R,Type=Float,Description="number of soft- and hard- clipped bases">
    ##FORMAT=<ID=MMQ,Number=R,Type=Float,Description="median mapping quality">
    

    After this pull request if merged the next GATK release probably won't have this issue.
    https://github.com/broadinstitute/gatk/pull/3490

Sign In or Register to comment.