It looks like you're new here. If you want to get involved, click one of these buttons!
Our cluster migrates files to tape on a schedule by file-size, which means this can happen anytime before or during a GATK program call (e.g. when the file has not been touched during a long program run). It seems to me that GATK (UnifiedGenotyper, v1.5-30-g27e7e17) is not checking if files are partial but continues with binary zeroes instead of valid data when it tries to read from the offline file. (If it were a C program, they would be looking for "open" system calls which use either of the O_NONBLOCK or O_NDELAY flags.)
GATK does not seem to throw a warning/error and the resulting file (vcf) look OK at first glance except that the DMF system seems to think that the used (offline) file was being treated suspiciously and the overall runtime is inflated.
Thanks for your comment.
Answers
Hi there,
That problem is specific to your architecture, so unfortunately it's not something we can devote resources to modifying. You'll need to find a workaround to work with your setup, sorry.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •While the described problem is architecture specific, the underlying issue of not checking whether a file is already open before processing it can cause more widespread problems. I understand that resources are limited but there is already an Java library that governs safe file handling: https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1-win/src/core/org/apache/hadoop/io/nativeio/NativeIO.java
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •I understand your concerns but we are simply unable to work on this right now. That said we would be happy to look at a patch if you want to make one. The source code for the programming framework is available on our Github repository (see Downloads page).
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •