Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Large number of pread system calls on the reference sequence FASTA file

mkschustermkschuster CeMM, Vienna, AustriaMember

Dear GATK Developers,

Thank your for maintaining such a valuable tool.

I noticed a performance bottleneck with newer GATK versions (I'm on the latest version right now v3.5-0-g36282e4, Compiled 2015/11/25 04:03:56) where (simple) analysis types such as SelectVariants or VariantsToTable would run for days when they would previously complete in less than an hour. I have to add that I am using bgzip-compressed and tabix-indexed VCF files throughout the analysis and that I see the performance problem on our cluster running several processes in parallel. Most Java processes would show a rather low percentage of CPU consumption, while our cluster file system maxes out at around 100,000 read requests per second. Using 'strace' of an active Java thread, I noticed that millions of "pread" system calls of 183 bytes against the reference sequence *.fasta file seem to be issued.

  [ ~]$ strace -p 28395
  Process 28395 attached - interrupt to quit
  pread(11, "TGGTAAAATGTGGTTGGATGAAGCGTACGCTT"..., 183, 46038724) = 183
  pread(11, "TAGACCATTCTATCAAAATGCTCTTTCTACAG"..., 183, 46038907) = 183
  ...

The question is, what am I doing wrong? Alternatively, are you aware of any changes to how the reference sequence FASTA file is read in recent GATK versions especially in the context of bgzip-compressed and tabix-indexed VCF files. Are you aware that so many read requests are issued? I tried to go back to version v3.3-0-g37228af, Compiled 2014/10/24 01:07:22 and although the "pread" requests seem to include larger regions, the run time of a single process seems similar. I would have to go back further, but you may be quicker in doing so.

Suggested by our system administrator, I managed to overcome the performance bottleneck by copying the reference sequence *.fasta, *.fasta.fai and *.dict files into the Linux shared memory tmpfs file system (/dev/shm/). As an example, VariantsToTable run times have fallen from 5 hours to just 31 seconds (13h vs 74s per 1M sites), SelectVariants from 5 hours to 18 minutes (26m vs 87s per 1M sites). In case I'm not the only one seeing this, would it make sense to read (and keep in memory) larger chunks of the reference sequence?

Command lines, log files etc. would be available, of course.

Thanks,
Michael

Tagged:

Issue · Github
by Sheila

Issue Number
570
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mkschuster,

    Those are some impressively large runtime differences, indeed. There's nothing I can think of that has changed in GATK itself that would explain what you're seeing. But I can tell you that the functionalities involved reside in the htsjdk library, which does go through considerable changes between versions. I wouldn't be surprised if something was changed in htsjdk that explains this. Have you noticed any similar effects with Picard tools? They also rely heavily on htsjdk.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    FYI one of the engineers is going to look into this and see what might have caused the performance regression in htsjdk, if that's indeed what happened.

Sign In or Register to comment.