Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

(How to) Install and use Conda for GATK4

GATK_TeamGATK_Team
edited August 13 in Tutorials

Some tools in GATK4, like the gCNV pipeline and the new deep learning variant filtering tools, require extensive Python dependencies. To avoid having to worry about managing these dependencies, we recommend using the GATK4 docker container, which comes with everything pre-installed, as explained here. If you are running GATK4 on a server and/or cannot use the Docker image, we recommend using the Conda package manager as a backup solution. The Conda package manager comes with all the dependencies you need, so you do not need to install everything separately. Both Conda and Docker are intended to solve the same problem, but one of the big differences/benefits of Conda is that you can use Conda without having root access. Conda should be easy to install if you follow these steps.

1) Refer to the installation instructions from Conda. Choose the correct version/computer you need to download it for. You will have the option of downloading Anaconda or Miniconda. Conda provides documentation about the difference between Anaconda and Miniconda. We chose to use Miniconda for this tutorial because we just wanted to use the GATK conda environment and did not want to take up too much space on our computer. If you are not going to use Conda for anything other than GATK4, you might consider doing the same. If you choose to install Anaconda, you may have access to other bioinformatics packages that are helpful to you, and you won’t have to install each package you need. Follow the prompts to properly install the .pkg file. Make sure you choose the correct package for the version of Python you are using. For example, if you have Python 2.7 on your computer, choose the version specific to it.

2) Go to the directory where you have stored the GATK4 jars and the gatk wrapper script, and make sure gatkcondaenv.yml is present. Run
conda env create -n gatk -f gatkcondaenv.yml

source activate gatk

3) To check if your Conda environment is running properly, type conda list and you should see a list of packages installed.

gatkpythonpackages should be one of them.

4) You can also test out whether the new variant filtering tool (CNNScoreVariants) runs properly. If you run
python -c "import vqsr_cnn" the output should look like Using TensorFlow backend.. If you do not have the Conda environment configured correctly, you will get an error immediately saying ImportError: No module named vqsr_cnn.

5) If you later upgrade to a new version of GATK4, you will need to update the Conda configuration in the new GATK4 folder. If you simply overwrite the old GATK with the new one, you will get an error message saying “CondaValueError: prefix already exists: /anaconda2/envs/gatk”. For example, when I upgraded from GATK 4.0.1.2 to GATK 4.0.2.0, I simply ran (in my 4.0.2.0 folder)
source deactivate
conda env remove -n gatk
Then, follow Steps 2-4 again to re-install it.

Important
Do not confuse the above mentioned GATK conda environment setup with this bioconda gatk installation. The current version of the bioconda installation of GATK does not set up the conda environment used for the GATK python tools, so that must still be set up manually.

Post edited by bhanuGandham on

Comments

  • lakhujanivijaylakhujanivijay IndiaMember

    Thank you for the article. :) It will be great if you can add hyper links to the following

    1. GATK4 jars
    2. the gatk wrapper script

    I am having difficulty locating them :# . Could you please help?

  • lakhujanivijaylakhujanivijay IndiaMember
    edited March 5

    Additionally, I followed the steps,

    conda env create -n gatk -f gatkcondaenv.yml
    

    It gave the output

    Collecting package metadata: done
    Solving environment: done
    
    Downloading and Extracting Packages
    intel-openmp-2018.0. | 620 KB    | ############################################################################################################################################# | 100% 
    pip-9.0.1            | 1.7 MB    | ############################################################################################################################################# | 100% 
    zlib-1.2.11          | 109 KB    | ############################################################################################################################################# | 100% 
    readline-6.2         | 606 KB    | ############################################################################################################################################# | 100% 
    openssl-1.0.2l       | 3.2 MB    | ############################################################################################################################################# | 100% 
    tk-8.5.18            | 1.9 MB    | ############################################################################################################################################# | 100% 
    certifi-2016.2.28    | 216 KB    | ############################################################################################################################################# | 100% 
    xz-5.2.3             | 667 KB    | ############################################################################################################################################# | 100% 
    python-3.6.2         | 16.5 MB   | ############################################################################################################################################# | 100% 
    sqlite-3.13.0        | 4.0 MB    | ############################################################################################################################################# | 100% 
    setuptools-36.4.0    | 563 KB    | ############################################################################################################################################# | 100% 
    mkl-2018.0.1         | 184.7 MB  | ############################################################################################################################################# | 100% 
    wheel-0.29.0         | 88 KB     | ############################################################################################################################################# | 100% 
    mkl-service-1.1.2    | 11 KB     | ############################################################################################################################################# | 100% 
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done
    #
    # To activate this environment, use
    #
    #     $ conda activate gatk
    #
    # To deactivate an active environment, use
    #
    #     $ conda deactivate
    

    Then i activated gatk environment

                conda activate gatk
    

    Then I ran following command which throws errors:

    (gatk) bioinfo$ gatk NeuralNetInference -R reference.fasta -V NA12878.vcf -O NeuralNetInferenceFiltered.vcf -a cnn_1d_annotations.hd5
    No command 'gatk' found, did you mean:
     Command 'gitk' from package 'gitk' (main)
     Command 'gak' from package 'gui-apt-key' (universe)
     Command 'gawk' from package 'gawk' (main)
    gatk: command not found
    

    Can you please help?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You need to put GATK to your path.

  • lakhujanivijaylakhujanivijay IndiaMember
    edited March 5

    Thanks SkyWarrior . That helped. However, now I able to launch GATK without activating gatk environment.

    [email protected]$ conda activate gatk
    (gatk) [email protected]$ gatk
    
     Usage template for all tools (uses --spark-runner LOCAL when used with a Spark tool)
        gatk AnyTool toolArgs
    
     Usage template for Spark tools (will NOT work on non-Spark tools)
        gatk SparkTool toolArgs  [ -- --spark-runner <LOCAL | SPARK | GCS> sparkArgs ]
    
     Getting help
        gatk --list       Print the list of available tools
    
        gatk Tool --help  Print help on a particular tool
    
     Configuration File Specification
         --gatk-config-file                PATH/TO/GATK/PROPERTIES/FILE
    
     gatk forwards commands to GATK and adds some sugar for submitting spark jobs
    
       --spark-runner <target>    controls how spark tools are run
         valid targets are:
         LOCAL:      run using the in-memory spark runner
         SPARK:      run using spark-submit on an existing cluster 
                     --spark-master must be specified
                     --spark-submit-command may be specified to control the Spark submit command
                     arguments to spark-submit may optionally be specified after -- 
         GCS:        run using Google cloud dataproc
                     commands after the -- will be passed to dataproc
                     --cluster <your-cluster> must be specified after the --
                     spark properties and some common spark-submit parameters will be translated 
                     to dataproc equivalents
    
       --dry-run      may be specified to output the generated command line without running it
       --java-options 'OPTION1[ OPTION2=Y ... ]'   optional - pass the given string of options to the 
                     java JVM at runtime.  
                     Java options MUST be passed inside a single string with space-separated values.
    

    Now , I deactivate conda

    (gatk) [email protected]$ conda deactivate

    and launch GATK

    [email protected]$ gatk

    It still launches, hence, I wonder if this is expected.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited March 5

    Environment and PATH are seperate things therefore the behavior is expected. Launching gatk without environment is a problem for CNV CNN and some other tools. Environment must be active for those tasks.

  • lakhujanivijaylakhujanivijay IndiaMember

    Hi SkyWarrior

    That really helped. Thanks!

  • tiaojontiaojon Member
    When I go to run conda env create -n gatk -f gatkcondaenv.yml i get this error:

    Collecting package metadata: done
    Solving environment: failed

    ResolvePackageNotFound:
    - anaconda::tensorflow==1.12.0=mkl_py36h69b6ba0_0

    It seems like I can't download the 1.12.0 version of tensorflow anymore, because when I check the anaconda site, I can only find version 1.13.1. Is there some way to force the right version? Should I change the .yml file? I wasn't sure if 1.13.1 is backwards compatible with 1.12.0
  • annashipannaship Member
    I have the same problem. I saw that tensorflow was not listed as a package in my miniconda so I installed it, v 1.13.1. But when I try again to create the gatk environment, I still get:

    Collecting package metadata: done
    Solving environment: failed

    ResolvePackageNotFound:
    - anaconda::tensorflow==1.12.0=mkl_py36h69b6ba0_0

    Any assistance very welcome!
  • sohtasohta JapanMember
    I ran into the same issue as tiaojon and annaship did, but resolved it by rewriting the anaconda::tensorflow line so that it looks like the following (which is probably the latest release compatible with 1.12.0 at the moment):

    - anaconda::tensorflow=1.12.0=mkl_py36h2b2bbaf_0

    cf. https://anaconda.org/anaconda/tensorflow/files?version=1.12.0
Sign In or Register to comment.