Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

GenotypeGVCFs exits before Traversal is complete

fangli08fangli08 CHOPMember

Hi,

I am running gatk-4.1.0.0 in CentOS 6.3. My java version is jdk1.8.0_152

I split the human genome into 40 intervals, each of which is < 100 Mb. The boundaries of the intervals are gap regions of the ref genome. The variants were called separately for each interval.

I followed the GATK best practice for germline SNP/Indels. First I used HaplotypeCaller to generate GVCF files for each sample. And then I used GenomicsDBImport to import the GVCF files to the genomicsdb.
The GenomicsDBImport was successful because I can see the following information:

20:24:25.849 INFO  ProgressMeter - Traversal complete. Processed 10 total batches in 2633.3 minutes.
20:24:25.849 INFO  GenomicsDBImport - Import of all batches to GenomicsDB completed!

While I am running GenotypeGVCFs, it always exits before the traversal is complete. I tried increasing the memory to 48G (-Xmx48g -Xms48g) but didn't help.

For example, I can see the following information:

21:06:38.038 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
21:07:10.501 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:07:10.501 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.1.0.0
21:07:10.501 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
21:07:10.502 INFO  GenotypeGVCFs - Executing as [email protected] on Linux v2.6.32-279.14.1.el6.x86_64 amd64
21:07:10.502 INFO  GenotypeGVCFs - Java runtime: Java HotSpot(TM) 64-Bit Server VM v1.8.0_152-b16
21:07:10.502 INFO  GenotypeGVCFs - Start Date/Time: June 18, 2019 9:06:37 PM CST
21:07:10.502 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:07:10.502 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Version: 2.18.2
21:07:10.503 INFO  GenotypeGVCFs - Picard Version: 2.18.25
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:07:10.503 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:07:10.503 INFO  GenotypeGVCFs - Deflater: IntelDeflater
21:07:10.503 INFO  GenotypeGVCFs - Inflater: IntelInflater
21:07:10.503 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
21:07:10.503 INFO  GenotypeGVCFs - Requester pays: disabled
21:07:10.503 INFO  GenotypeGVCFs - Initializing engine
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
21:07:28.713 INFO  GenotypeGVCFs - Done initializing engine
21:07:28.768 INFO  ProgressMeter - Starting traversal
21:07:28.769 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
WARNING: No valid combination operation found for INFO field DB - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field DS - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
WARNING: No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
21:13:19.041 INFO  ProgressMeter -           X:76704691              5.8                  1000            171.3
21:13:52.734 INFO  ProgressMeter -           X:76705691              6.4                  2000            312.5
21:14:42.171 INFO  ProgressMeter -           X:76706691              7.2                  3000            415.3
21:15:31.340 INFO  ProgressMeter -           X:76707691              8.0                  4000            497.3
21:16:28.727 INFO  ProgressMeter -           X:76708691              9.0                  5000            555.6
21:16:47.125 INFO  ProgressMeter -           X:76709691              9.3                  6000            644.7
21:17:31.225 INFO  ProgressMeter -           X:76711691             10.0                  8000            796.7
21:18:18.273 INFO  ProgressMeter -           X:76712691             10.8                  9000            831.4
21:18:28.852 INFO  ProgressMeter -           X:76713691             11.0                 10000            909.0
21:18:42.161 INFO  ProgressMeter -           X:76714691             11.2                 11000            980.1
21:18:53.288 INFO  ProgressMeter -           X:76717691             11.4                 14000           1227.1
21:19:29.089 INFO  ProgressMeter -           X:76721691             12.0                 18000           1499.3
21:19:53.928 INFO  ProgressMeter -           X:76727691             12.4                 24000           1932.5
21:20:37.732 INFO  ProgressMeter -           X:76739691             13.1                 36000           2737.8
21:20:55.007 INFO  ProgressMeter -           X:76750691             13.4                 47000           3497.7
21:22:40.732 INFO  ProgressMeter -           X:76774691             15.2                 71000           4671.2

......
......


09:06:20.738 INFO  ProgressMeter -           X:90915023            718.9              14204000          19758.9
09:06:35.564 INFO  ProgressMeter -           X:90918023            719.1              14207000          19756.3
09:06:51.435 INFO  ProgressMeter -           X:90920023            719.4              14209000          19751.8
09:07:03.766 INFO  ProgressMeter -           X:90921023            719.6              14210000          19747.5
09:08:05.525 INFO  ProgressMeter -           X:90932023            720.6              14221000          19734.6
09:08:18.631 INFO  ProgressMeter -           X:90949023            720.8              14238000          19752.2
09:08:28.639 INFO  ProgressMeter -           X:90955023            721.0              14244000          19756.0
Using GATK jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx48g -Xms48g -jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /public/users/xieshangqian/fangli/db/hg19/hs37d5.fa -V gendb:///public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20 -O /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/NGS.reg20.raw.vcf.gz

The last position it processed is X:90955023, however, the interval is X:76653693-155270560, which means a large portion of variants were not processed. I can't see these variants in the vcf.gz file, either.

If I run the same command again, GenotypeGVCFs may exit in a different position. For example, I ran it again and got the following information:

......

04:08:36.664 INFO  ProgressMeter -           X:89888970            809.8              13178000          16273.5
04:08:51.137 INFO  ProgressMeter -           X:89903970            810.0              13193000          16287.1
04:09:03.381 INFO  ProgressMeter -           X:89912970            810.2              13202000          16294.2
04:09:49.453 INFO  ProgressMeter -           X:89924970            811.0              13214000          16293.5
04:10:02.710 INFO  ProgressMeter -           X:89927970            811.2              13217000          16292.8
04:10:12.716 INFO  ProgressMeter -           X:89936970            811.4              13226000          16300.5
04:10:28.162 INFO  ProgressMeter -           X:89951970            811.6              13241000          16313.8
Using GATK jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -Xms4g -jar /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk-package-4.1.0.0-local.jar GenotypeGVCFs -R /public/users/xieshangqian/fangli/db/hg19/hs37d5.fa -V gendb:///public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20 -O /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/NGS.reg20.raw.vcf.gz


The last position is X:89951970, which is different from the previous one. The vcf.gz file size is also different.

For a few intervals, the GenotypeGVCFs step succeeded, and I can see the following information:

   ProgressMeter - Traversal complete. Processed 24096952 total variants in 1819.9 minutes.

My command is:

###  GenomicsDBImport
/public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk --java-options "-Xmx4g -Xms4g" GenomicsDBImport -L X:76653693-155270560 --genomicsdb-workspace-path /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20 --batch-size 10  -V sample1.gatk_raw.g.vcf.gz -V sample2.gatk_raw.g.vcf.gz -V sample3.gatk_raw.g.vcf.gz ...


###  GenotypeGVCFs
/public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk --java-options "-Xmx48g -Xms48g" GenotypeGVCFs -R /public/users/xieshangqian/fangli/db/hg19/hs37d5.fa -V gendb:///public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/genomicsdb_reg20  -O /public/users/xieshangqian/fangli/analysis_ngs/all_gatk_vcf/joint_genotyping/NGS.reg20.raw.vcf.gz

Looking forward to your help!

Thank you!
Li

Answers

  • fangli08fangli08 CHOPMember
    edited June 19

    To avoid possible confusion, let me explain a little bit more about my analysis.

    There are 100 samples and 40 intervals. I generated a GVCF file of each sample and each interval using HaplotypeCaller. Thus, in total, it's 4000 GVCF files.

    For each interval, I ran GenomicsDBImport to import the GVCF files of the 100 samples. So I have 40 folders, each of which is a genomicsdb of one interval. And then I ran GenotypeGVCFs for each genomicsdb.

    Best,
    Li

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @fangli08

    That is weird, let me check with the dev team and get back to you.

  • fangli08fangli08 CHOPMember

    Thank you! By the way, is there any way to examine if the GVCF files are not corrupted?

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 24

    Hi @fangli08

    Can you please confirm that this the last line in stdout "09:08:28.639 INFO ProgressMeter - X:90955023 721.0 14244000 19756.0"
    What is the last record(locus) in the output file from HaplotypeCaller?
    Is GenotypeGVCFs generating an output file at all?

    PS: Checkout Terra for end-to-end GATK pipelining solutions and let us know what more pipelines we can add that will make using GATK easier for you! For more details on whether this is the right fit for you checkout our blog page.

    Post edited by bhanuGandham on
  • fangli08fangli08 CHOPMember

    Hi,

    If I unzip this file, I have this error:

    $ gunzip -c NGS.reg20.raw.vcf.gz > NGS.reg20.raw.vcf   
    
    gzip: NGS.reg20.raw.vcf.gz: unexpected end of file
    

    But I can still see the last line of the NGS.reg20.raw.vcf :

    X   90955368    .   C   T   1394.58 .   AC=4;AF=0.020;AN=200;BaseQRankSum=1.24;DP=2408;ExcessHet=0.0660;FS=2.901;InbreedingCoeff=0.4793;MLEAC=4;MLEAF=0.020;MQ=60.00;MQRankSum=0.00;QD=16.60;ReadPosRankSum=0.639;SOR=0.841 GT:AD:DP:GQ:PL  0/0:35,0:35:96:0,96,1440    0/0:33,0:33:90:0,90,1350    0/0:32,0:32:54:0,54,1150    0/0:29,0:29:87:0,87,1039    0/0:37,0:37:99:0,101,1314   0/0:13,0:13:39:0,39,473 0/0:24,0:24:72:0,72,915 0/0:13,0:13:36:0,36,540 0/0:29,0:29:87:0,87,1120    0/0:34,0:34:82:0,82,1277    0/0:30,0:30:84:0,84,1260    0/0:32,0:32:66:0,66,1146    0/0:20,0:20:57:0,57,855 0/0:12,0:12:27:0,27,405 0/0:32,0:32:96:0,96,1236    0/0:21,0:21:60:0,60,900 0/0:25,0:25:72:0,72,954 0/1:14,16:30:99:495,0,473   0/0:19,0:19:54:0,54,810 0/0:33,0:33:99:0,99,1090    0/0:14,0:14:42:0,42,559 0/0:33,0:33:96:0,96,1440    0/0:16,0:16:29:0,29,523 0/0:15,0:15:45:0,45,525 0/0:33,0:33:90:0,90,1244    0/0:28,0:28:81:0,81,945 0/0:14,0:14:42:0,42,479 0/0:30,0:30:84:0,84,1260    0/0:16,0:16:48:0,48,597 0/0:33,0:33:99:0,99,1158    0/0:24,0:24:69:0,69,1035    0/0:14,0:14:36:0,36,540 0/0:16,0:16:48:0,48,606 0/0:31,0:31:90:0,90,1350    0/0:41,0:41:99:0,103,1518   0/0:25,0:25:72:0,72,946 0/0:20,0:20:60:0,60,745 0/0:16,0:16:48:0,48,615 0/0:11,0:11:33:0,33,430 0/0:24,0:24:72:0,72,934 0/0:21,0:21:60:0,60,900 0/0:20,0:20:43:0,43,745 0/0:30,0:30:90:0,90,999 0/0:14,0:14:24:0,24,486 0/0:40,0:40:99:0,102,1182   0/0:28,0:28:84:0,84,1103    0/0:34,0:34:99:0,99,1336    0/0:26,0:26:72:0,72,1080    0/0:33,0:33:99:0,99,1403    0/0:20,0:20:60:0,60,703 0/0:22,0:22:66:0,66,742 0/0:20,0:20:57:0,57,855 0/0:27,0:27:81:0,81,1017    0/0:12,0:12:30:0,30,450 0/0:32,0:32:90:0,90,1350    0/0:37,0:37:99:0,105,1312   0/0:24,0:24:72:0,72,893 0/0:13,0:13:36:0,36,540 0/0:30,0:30:81:0,81,1215    0/0:13,0:13:39:0,39,447 0/0:33,0:33:99:0,99,1149    1/1:0,15:15:45:599,45,0 0/0:30,0:30:90:0,90,1150    0/0:20,0:20:60:0,60,780 0/0:31,0:31:90:0,90,1091    0/0:13,0:13:39:0,39,443 0/0:10,0:10:30:0,30,376 0/1:27,12:39:99:363,0,881   0/0:26,0:26:75:0,75,846 0/0:15,0:15:42:0,42,630 0/0:24,0:24:72:0,72,1006    0/0:12,0:12:36:0,36,389 0/0:11,0:11:33:0,33,405 0/0:29,0:29:81:0,81,1215    0/0:4,0:4:12:0,12,134   0/0:17,0:17:51:0,51,673 0/0:11,0:11:33:0,33,427 0/0:23,0:23:66:0,66,859 0/0:17,0:17:45:0,45,675 0/0:16,0:16:20:0,20,595 0/0:27,0:27:81:0,81,921 0/0:34,0:34:99:0,99,1232    0/0:34,0:34:99:0,102,1117   0/0:33,0:33:99:0,99,1207    0/0:27,0:27:60:0,60,913 0/0:19,0:19:54:0,54,810 0/0:24,0:24:72:0,72,814 0/0:26,0:26:78:0,78,1094    0/0:31,0:31:93:0,93,1231    0/0:33,0:33:99:0,99,1125    0/0:37,0:37:99:0,108,1383   0/0:33,0:33:99:0,99,1115    0/0:17,0:17:51:0,51,617 0/0:11,0:11:33:0,33,385 0/0:28,0:28:81:0,81,1215    0/0:20,0:20:60:0,60,650 0/0:46,0:46:99:0,120,1800   0/0:16,0:16:45:0,45,675 0/0:15,0:15:39:0,39,585 0/0:18,0:18:54:0,54,603
    X   90955406    .   TA  T,TAA   1666.97 .   AC=18,13;AF=0.090,0.065;AN=200;BaseQRankSum=-1.080e-01;DP=2289;ExcessHet=1.6884;FS=0.000;InbreedingCoeff=0.0497;MLEAC=16,12;MLEAF=0.080,0.060;MQ=58.34;MQRankSum=0.474;QD=2.85;ReadPosRankSum=0.00;SOR=0.672    GT:AD:DP:GQ:PGT:PID:PL:PS   0/1:19,3,0:22:13:.:.:13,0,420,70,429,498    0/0:30,0,0:30:47:.:.:0,47,1066,47,1066,1066 0/0:29,0,0:29:78:.:.:0,78,1170,78,1170,1170 0/1:14,4,2:23:68:.:.:68,0,318,84,274,447    0/2:2
    
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @fangli08

    Is GenotypeGVCFs generating an output file at all?
    Can you validate you raw vcf and gvcf by running ValidateVariants . Looks like your input raw vcf or gvcf generated by Haplotypecaller may have an issue.
    Also please post the exact Haplotypecaller command you used.

  • fangli08fangli08 CHOPMember
    edited June 26

    Hi,
    Both Haplotypecaller and GenotypeGVCFs generated output files. I have checked all the GVCF files generated by Haplotypecaller. They all look good. The last variant of each GVCF file generated by Haplotypecaller is > X:155270000
    The raw vcf file generated by GenotypeGVCFs is corrupted. As I described above, the last variant is X 90955406.

    My command for Haplotypecaller is:


    time /public/users/xieshangqian/fangli/software/gatk-4.1.0.0/gatk --java-options "-Xmx4g" HaplotypeCaller --input /public/users/xieshangqian/fangli/analysis/batch8/NGS0100/NGS0100.bwamem.sorted.merged.markdup.BQSR.cram --output /public/users/xieshangqian/fangli/analysis/batch8/NGS0100/gatk_gvcf_files/NGS0100.bwamem.sorted.merged.markdup.BQSR.cram.reg07.gatk_raw.g.vcf --reference /public/users/xieshangqian/fangli/db/hg19/hs37d5.fa --dbsnp /public/users/xieshangqian/fangli/db/dbSNP/build_152/dbSNP.build_152.normal_name.vcf.gz --intervals X:76653693-155270560 --emit-ref-confidence GVCF time /public/software/bcbio-xiao/anaconda/bin/bgzip /public/users/xieshangqian/fangli/analysis/batch8/NGS0100/gatk_gvcf_files/NGS0100.bwamem.sorted.merged.markdup.BQSR.cram.reg07.gatk_raw.g.vcf time /public/software/bcbio-xiao/anaconda/bin/tabix /public/users/xieshangqian/fangli/analysis/batch8/NGS0100/gatk_gvcf_files/NGS0100.bwamem.sorted.merged.markdup.BQSR.cram.reg07.gatk_raw.g.vcf.gz

    Best,
    Li

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @fangli08

    I checked with the dev team and this is what they had to say:

    This is likely a memory problem, but not in the java code. GenomicsDBImport and GenotypeGVCFs interface with 3rd-party C code (GenomicsDB) that has its own memory allocation. We have a few immediate suggestions for helping with this problem:
    1) Run on a machine with a lot of memory. If running on a cluster, make sure that your job has sufficient memory allocated to it.
    2) Do not allocate excessive memory to the java code (that takes available memory from the C code).
    3) Try running with
    --batch-size 50
    . This will use a bit more memory during GenomicsDBImport, but may decrease the memory needed during GenotypeGVCFs.

    Also: It would be helpful if the user could supply the VCFs that caused this problem (i.e. the VCFS corresponding to X:76653693-155270560). We can pass those on to the developers and get them to make the software more robust and improve error messages.
    Please follow instructions provided here for sharing the bug report: https://software.broadinstitute.org/gatk/guide/article?id=1894

Sign In or Register to comment.