If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Large number of pread system calls on the reference sequence FASTA file
Dear GATK Developers,
Thank your for maintaining such a valuable tool.
I noticed a performance bottleneck with newer GATK versions (I'm on the latest version right now v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56) where (simple) analysis types such as SelectVariants or VariantsToTable would run for days when they would previously complete in less than an hour. I have to add that I am using bgzip-compressed and tabix-indexed VCF files throughout the analysis and that I see the performance problem on our cluster running several processes in parallel. Most Java processes would show a rather low percentage of CPU consumption, while our cluster file system maxes out at around 100,000 read requests per second. Using 'strace' of an active Java thread, I noticed that millions of "pread" system calls of 183 bytes against the reference sequence *.fasta file seem to be issued.
[ ~]$ strace -p 28395 Process 28395 attached - interrupt to quit pread(11, "TGGTAAAATGTGGTTGGATGAAGCGTACGCTT"..., 183, 46038724) = 183 pread(11, "TAGACCATTCTATCAAAATGCTCTTTCTACAG"..., 183, 46038907) = 183 ...
The question is, what am I doing wrong? Alternatively, are you aware of any changes to how the reference sequence FASTA file is read in recent GATK versions especially in the context of bgzip-compressed and tabix-indexed VCF files. Are you aware that so many read requests are issued? I tried to go back to version v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 and although the "pread" requests seem to include larger regions, the run time of a single process seems similar. I would have to go back further, but you may be quicker in doing so.
Suggested by our system administrator, I managed to overcome the performance bottleneck by copying the reference sequence *.fasta, *.fasta.fai and *.dict files into the Linux shared memory tmpfs file system (/dev/shm/). As an example, VariantsToTable run times have fallen from 5 hours to just 31 seconds (13h vs 74s per 1M sites), SelectVariants from 5 hours to 18 minutes (26m vs 87s per 1M sites). In case I'm not the only one seeing this, would it make sense to read (and keep in memory) larger chunks of the reference sequence?
Command lines, log files etc. would be available, of course.