We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Using GenomicsDBImport with many samples and few thousand intervals, can I restart the run?

HernanHernan MelbourneMember

Hello,
I'm running the GATK4 pipeline for calling SNPs in a capture-seq dataset of 500 individuals and 9,500 genomic intervals. I succesfully produced GVCF files with the HaplotypeCaller. When moving to cobined them with GenomicsDBImport, I first test 10 intervals with 50 individuals, and after confirming this took few minutes to complete I was hoping the entire run with 500 individuals and 9,500 genomic intervals will take 2-3 days. I'm using the following code

${GATKLoc} --java-options "-Xmx80g -Xms8g" \ GenomicsDBImport \ --genomicsdb-workspace-path ${SAMPLES}_database \ --consolidate \ --batch-size 88 \ -L $INTERVAL \ --sample-name-map ${SAMPLES}_list.sample_map \ --tmp-dir=${TMPdir}

I was wrong. It has been a week and is still not finished....

Now I realised I should have do a batch work with subsets of the intervals!?

So my questions are: Can I interrupt the run now and restart the unfinished intervals () working with batches now? How can I find out which intervals are finsihed already? Should I put the new runs in the same database folder?

The screen output is not informative, it only shows a huge list of these lines:

11:10:44.477 INFO GenomicsDBImport - Importing batch 3 with 88 samples
11:11:14.452 INFO GenomicsDBImport - Importing batch 3 with 88 samples
11:11:42.312 INFO GenomicsDBImport - Importing batch 3 with 88 samples

Also ALL intervals have generated their respective folder in the database folder. But I don;t know how to tell which ones are finsihed and which ones are not? E.g. this is the output for the first interval:

ls f2_samples_database/Contig2\$1\$7033/
__90d72b7d-46fb-40a5-9734-0fb7ee64cbb7140370416006912_1541705482603 __bb96df53-3662-4fb7-9d6d-59b906d673a0140370416006912_1541429490911 genomicsdb_meta_dir
__array_schema.tdb __c9eae372-1963-4631-abe7-222589fb2a88140370416006912_1541149600860

And this is the output for the last interval:

ls f2_samples_database/Contig388217\$1\$15477/
__50096ca9-52c9-4606-82a7-514133a73dd9140370416006912_1541429461943 __a3423e6f-4321-4c62-a487-085b8d903774140370416006912_1541705455158 __array_schema.tdb genomicsdb_meta_dir

Thanks!

H

Best Answer

Answers

Sign In or Register to comment.