Meaning of -nt and -nct

hrbigelowhrbigelow San FranciscoPosts: 6Member
edited September 2013 in Ask the GATK team

Hi,

According to http://www.broadinstitute.org/gatk/guide/article?id=1975:

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

-nt / --num_threads controls the number of data threads sent to the processor
-nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

Setup:
RHEL5, 144 GB memory, 12 cores (Intel 2.8 GHz)

~/src/jre1.7.0_40/bin/java -Xmx64g -Xms32g -d64 -jar /apps/gau/GATK_versions/GATKLite-2.1/GenomeAnalysisTKLite.jar -nt 8 -nct 6 -L chr8:90000001-120000000 -rbs 10000000 -T UnifiedGenotyper -rf BadCigar -R /dev/shm/CEUref.hg19.fasta -glm BOTH -D /dev/shm/dbsnp_135.hg19.reordered.vcf -metrics test.metrics.txt -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000 --max_alternate_alleles 10 -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A DepthOfCoverage -A DepthPerAlleleBySample -A FisherStrand -A HaplotypeScore -A HardyWeinberg -A IndelType -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A MappingQualityZeroFraction -A QualByDepth -A ReadPosRankSumTest -A RMSMappingQuality -A SampleList -o chunk55.vcf -I ./AC2181ACXX_DS-124072_GAGTGG_L006_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124113_GTCCGC_L001_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124080_ATTCCT_L008_001.markdup.fixed.left.recal.rehead.bam -I ./AD23GUACXX_DS-124122_AGTCAA_L005_001.markdup.fixed.left.recal.rehead.bam...

(with 116 bam files)

/usr/sbin/lsof -p <java process ID> | wc -l
returns 728 open files. (This turns out to be the 116 BAM files opened each of 8 times).

When run with the following settings, I see some strange messages:

-nt 12 -nct 1
INFO 16:57:41,908 SAMDataSource - Running in asynchronous I/O mode; number of threads = 11

-nt 12 -nct 2
INFO 16:58:46,246 SAMDataSource - Running in asynchronous I/O mode; number of threads = 10
INFO 16:58:46,763 MicroScheduler - Running the GATK in parallel mode with 2 concurrent threads

and so on, until:

-nt 12 -nct 11
INFO 17:00:07,673 SAMDataSource - Running in asynchronous I/O mode; number of threads = 1
INFO 17:00:08,288 MicroScheduler - Running the GATK in parallel mode with 11 concurrent threads

-nt 12 -nct 13
ERROR MESSAGE: Invalid thread allocation. User requested 12 threads in total, but the count of cpu threads (13) is higher than the total threads

If -nt is the 'number of data threads', then why does SAMDataSource report <nt> - <nct> as the number of 'threads'?

If -nct is the 'number of CPU threads per data thread', then the total number of CPU threads running should really be <nt> * <nct>. Instead, it seems to be <nt> - <nct>, which makes no sense according to the definitions.

http://www.broadinstitute.org/gatk/guide/article?id=1975 states:

Memory considerations for multi-threading
Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory

In any case, all other software I'm familiar with has no notion of a 'data thread', and it seems unnecessary and wasteful -- one simply specifies the inputs, and chooses a number of CPU threads, and the program handles the rest, without reading the same input multiple times.

Thanks,

Henry

Post edited by hrbigelow on
Tagged:

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,973Administrator, GATK Developer admin

    Hey Henry, I just noticed from your other post that you are using an older version of GATK. Can you try this with version 2.7 and let me know if you see the same pattern with different settings of -nt and -nct?

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.