Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Meaning of -nt and -nct

hrbigelowhrbigelow San FranciscoPosts: 6Member
edited September 2013 in Ask the GATK team


According to http://www.broadinstitute.org/gatk/guide/article?id=1975:

There are two options for multi-threading with the GATK, controlled by the arguments -nt and -nct, respectively, which can be combined:

-nt / --num_threads controls the number of data threads sent to the processor
-nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread

RHEL5, 144 GB memory, 12 cores (Intel 2.8 GHz)

~/src/jre1.7.0_40/bin/java -Xmx64g -Xms32g -d64 -jar /apps/gau/GATK_versions/GATKLite-2.1/GenomeAnalysisTKLite.jar -nt 8 -nct 6 -L chr8:90000001-120000000 -rbs 10000000 -T UnifiedGenotyper -rf BadCigar -R /dev/shm/CEUref.hg19.fasta -glm BOTH -D /dev/shm/dbsnp_135.hg19.reordered.vcf -metrics test.metrics.txt -stand_call_conf 30.0 -stand_emit_conf 10.0 -dcov 1000 --max_alternate_alleles 10 -A AlleleBalance -A AlleleBalanceBySample -A BaseCounts -A BaseQualityRankSumTest -A DepthOfCoverage -A DepthPerAlleleBySample -A FisherStrand -A HaplotypeScore -A HardyWeinberg -A IndelType -A LowMQ -A MappingQualityRankSumTest -A MappingQualityZero -A MappingQualityZeroBySample -A MappingQualityZeroFraction -A QualByDepth -A ReadPosRankSumTest -A RMSMappingQuality -A SampleList -o chunk55.vcf -I ./AC2181ACXX_DS-124072_GAGTGG_L006_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124113_GTCCGC_L001_001.markdup.fixed.left.recal.rehead.bam -I ./AC2181ACXX_DS-124080_ATTCCT_L008_001.markdup.fixed.left.recal.rehead.bam -I ./AD23GUACXX_DS-124122_AGTCAA_L005_001.markdup.fixed.left.recal.rehead.bam...

(with 116 bam files)

/usr/sbin/lsof -p <java process ID> | wc -l
returns 728 open files. (This turns out to be the 116 BAM files opened each of 8 times).

When run with the following settings, I see some strange messages:

-nt 12 -nct 1
INFO 16:57:41,908 SAMDataSource - Running in asynchronous I/O mode; number of threads = 11

-nt 12 -nct 2
INFO 16:58:46,246 SAMDataSource - Running in asynchronous I/O mode; number of threads = 10
INFO 16:58:46,763 MicroScheduler - Running the GATK in parallel mode with 2 concurrent threads

and so on, until:

-nt 12 -nct 11
INFO 17:00:07,673 SAMDataSource - Running in asynchronous I/O mode; number of threads = 1
INFO 17:00:08,288 MicroScheduler - Running the GATK in parallel mode with 11 concurrent threads

-nt 12 -nct 13
ERROR MESSAGE: Invalid thread allocation. User requested 12 threads in total, but the count of cpu threads (13) is higher than the total threads

If -nt is the 'number of data threads', then why does SAMDataSource report <nt> - <nct> as the number of 'threads'?

If -nct is the 'number of CPU threads per data thread', then the total number of CPU threads running should really be <nt> * <nct>. Instead, it seems to be <nt> - <nct>, which makes no sense according to the definitions.

http://www.broadinstitute.org/gatk/guide/article?id=1975 states:

Memory considerations for multi-threading Each data thread needs to be given the full amount of memory you’d normally give a single run. So if you’re running a tool that normally requires 2 Gb of memory to run, if you use -nt 4, the multithreaded run will use 8 Gb of memory

In any case, all other software I'm familiar with has no notion of a 'data thread', and it seems unnecessary and wasteful -- one simply specifies the inputs, and chooses a number of CPU threads, and the program handles the rest, without reading the same input multiple times.



Post edited by hrbigelow on


Sign In or Register to comment.