Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Reducing GATK/Picard tools Docker image size

dinvladdinvlad Member, Broadie, Dev
edited March 2018 in Ask the GATK team

Hi Team,

I'm optimizing calls in a WDL (run on Google cloud), and while looking through the logs I realized that it takes full 3 minutes to pull broadinstitute/gatk image, which is now 3GB in size. I'm using this image solely for Picard tools right now, as the official broadinstitute/picard image does not play well with Cromwell due to its use of ENTRYPOINT.

Could something be done to optimize the time it takes to pull the image? Ideally, we'd like this to take <1 min, because our computational tasks will be short (~3-5 min), so another 3 min spent on pulling the image is a significant increase to the overall task time. I'm leaning towards using an (unofficial?) image at https://quay.io/repository/biocontainers/picard, because those are only ~120 MB in size, so pulling them is done in seconds. We could also build our own images, but that would add to the maintenance overhead. Another workaround is to run openjdk image and then wget picard JAR from GitHub at run time (that still feels hacky however).

Thanks!

Best Answer

  • LouisBLouisB Broad Institute ✭✭
    Accepted Answer

    It's weird to me that it takes so long to pull 3gb of docker container, I don't understand why it takes so long. We download much larger files much faster than that, so there's something weird going on with docker specifically. Adding 3 minutes to every short job is a serious problem though.

    As you suspect, the vast majority of that is totally unnecessary to run picard tools. There are a few GATK Tools that require a large amount of additional dependencies that the majority of the tools can safely ignore. We've wanted to avoid publishing an official GATK image without those dependencies because we're afraid that it would cause confusion and support burden when people try to run the tools that require those dependencies and they fail. Since most GATK jobs are long, a few extra minutes here or there doesn't usually make that much difference. High volume users are encouraged to make their own custom dockers with the exact builds that they want.

    We think we can potentially reduce the docker image size by a little bit, maybe squeeze it to 2g, but it's unlikely the full thing will ever be tiny. If you just want picard I definitely recommend using the official picard dockers if possible. I don't understand the issue with cromwell and entrypoint, so I'm not sure how to help there. @KateVoss Are you aware of this issue and can you offer any help?

    If the official picard releases don't work with cromwell, we will definitely try to fix them so they can, but in the meantime using an unofficial build from somewhere is probably a fine idea. We can't make any guarantees about what you're getting from unofficial builds, there's possibility of getting something malicious or broken. That said, they're probably fine and are good solution to your problem.

    Alternatively, a better solution, but one that requires more effort would be to build your own picard custom image (or gatk image). You can clone the repo and edit/ build the dockerfile yourself to your exact specifications.

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @dinvlad
    Hi,

    I have to check with the team and get back to you.

    -Sheila

  • LouisBLouisB Broad InstituteMember, Broadie, Dev ✭✭
    Accepted Answer

    It's weird to me that it takes so long to pull 3gb of docker container, I don't understand why it takes so long. We download much larger files much faster than that, so there's something weird going on with docker specifically. Adding 3 minutes to every short job is a serious problem though.

    As you suspect, the vast majority of that is totally unnecessary to run picard tools. There are a few GATK Tools that require a large amount of additional dependencies that the majority of the tools can safely ignore. We've wanted to avoid publishing an official GATK image without those dependencies because we're afraid that it would cause confusion and support burden when people try to run the tools that require those dependencies and they fail. Since most GATK jobs are long, a few extra minutes here or there doesn't usually make that much difference. High volume users are encouraged to make their own custom dockers with the exact builds that they want.

    We think we can potentially reduce the docker image size by a little bit, maybe squeeze it to 2g, but it's unlikely the full thing will ever be tiny. If you just want picard I definitely recommend using the official picard dockers if possible. I don't understand the issue with cromwell and entrypoint, so I'm not sure how to help there. @KateVoss Are you aware of this issue and can you offer any help?

    If the official picard releases don't work with cromwell, we will definitely try to fix them so they can, but in the meantime using an unofficial build from somewhere is probably a fine idea. We can't make any guarantees about what you're getting from unofficial builds, there's possibility of getting something malicious or broken. That said, they're probably fine and are good solution to your problem.

    Alternatively, a better solution, but one that requires more effort would be to build your own picard custom image (or gatk image). You can clone the repo and edit/ build the dockerfile yourself to your exact specifications.

  • dinvladdinvlad Member, Broadie, Dev

    @LouisB - thank you! Yes, we're going to use either the GATK image or an unofficial image (or build one ourselves) depending on each use case.

    I agree that it seems weird that pulling from DockerHub takes such a long time, but in my experience that's always been the issue with their hosting. Not sure if we could speed that up if we pulled from GCR - perhaps that's worth exploring as well.

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @dinvlad,
    WDL runs on Google Cloud definitely benefit from using the GCR images.

  • dinvladdinvlad Member, Broadie, Dev

    @shlee - thanks, would it perhaps be beneficial to publish official images on GCR in addition to Docker Hub?

  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭
    edited March 2018

    @dinvlad, it looks like we already do. Please check out https://console.cloud.google.com/gcr/images/broad-gatk/US/gatk?gcrImageListsize=50. You may need to be signed into Google Cloud Console to view.

    You should see a list of GATK GCR images, like so:

  • dinvladdinvlad Member, Broadie, Dev

    Excellent, thank you @shlee !

Sign In or Register to comment.