Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

(How to) Install and use Conda for GATK4

GATK_TeamGATK_Team
edited March 5 in Tutorials

Some tools in GATK4, like the gCNV pipeline and the new deep learning variant filtering tools, require extensive Python dependencies. To avoid having to worry about managing these dependencies, we recommend using the GATK4 docker container, which comes with everything pre-installed, as explained here. If you are running GATK4 on a server and/or cannot use the Docker image, we recommend using the Conda package manager as a backup solution. The Conda package manager comes with all the dependencies you need, so you do not need to install everything separately. Both Conda and Docker are intended to solve the same problem, but one of the big differences/benefits of Conda is that you can use Conda without having root access. Conda should be easy to install if you follow these steps.

1) Refer to the installation instructions from Conda. Choose the correct version/computer you need to download it for. You will have the option of downloading Anaconda or Miniconda. Conda provides documentation about the difference between Anaconda and Miniconda. We chose to use Miniconda for this tutorial because we just wanted to use the GATK package and did not want to take up too much space on our computer. If you are not going to use Conda for anything other than GATK4, you might consider doing the same. If you choose to install Anaconda, you may have access to other bioinformatics packages that are helpful to you, and you won’t have to install each package you need. Follow the prompts to properly install the .pkg file. Make sure you choose the correct package for the version of Python you are using. For example, if you have Python 2.7 on your computer, choose the version specific to it.

2) Go to the directory where you have stored the GATK4 jars and the gatk wrapper script, and make sure gatkcondaenv.yml is present. Run
conda env create -n gatk -f gatkcondaenv.yml

source activate gatk

3) To check if your Conda environment is running properly, type conda list and you should see a list of packages installed.

gatkpythonpackages should be one of them.

4) You can also test out whether the new variant filtering tool (CNNScoreVariants) runs properly. If you run
python -c "import vqsr_cnn" the output should look like Using TensorFlow backend.. If you do not have the Conda environment configured correctly, you will get an error immediately saying ImportError: No module named vqsr_cnn.

5) If you later upgrade to a new version of GATK4, you will need to update the Conda configuration in the new GATK4 folder. If you simply overwrite the old GATK with the new one, you will get an error message saying “CondaValueError: prefix already exists: /anaconda2/envs/gatk”. For example, when I upgraded from GATK 4.0.1.2 to GATK 4.0.2.0, I simply ran (in my 4.0.2.0 folder)
source deactivate
conda env remove -n gatk
Then, follow Steps 2-4 again to re-install it.

Post edited by bhanuGandham on

Comments

  • lakhujanivijaylakhujanivijay IndiaMember

    Thank you for the article. :) It will be great if you can add hyper links to the following

    1. GATK4 jars
    2. the gatk wrapper script

    I am having difficulty locating them :# . Could you please help?

  • lakhujanivijaylakhujanivijay IndiaMember
    edited March 5

    Additionally, I followed the steps,

    conda env create -n gatk -f gatkcondaenv.yml
    

    It gave the output

    Collecting package metadata: done
    Solving environment: done
    
    Downloading and Extracting Packages
    intel-openmp-2018.0. | 620 KB    | ############################################################################################################################################# | 100% 
    pip-9.0.1            | 1.7 MB    | ############################################################################################################################################# | 100% 
    zlib-1.2.11          | 109 KB    | ############################################################################################################################################# | 100% 
    readline-6.2         | 606 KB    | ############################################################################################################################################# | 100% 
    openssl-1.0.2l       | 3.2 MB    | ############################################################################################################################################# | 100% 
    tk-8.5.18            | 1.9 MB    | ############################################################################################################################################# | 100% 
    certifi-2016.2.28    | 216 KB    | ############################################################################################################################################# | 100% 
    xz-5.2.3             | 667 KB    | ############################################################################################################################################# | 100% 
    python-3.6.2         | 16.5 MB   | ############################################################################################################################################# | 100% 
    sqlite-3.13.0        | 4.0 MB    | ############################################################################################################################################# | 100% 
    setuptools-36.4.0    | 563 KB    | ############################################################################################################################################# | 100% 
    mkl-2018.0.1         | 184.7 MB  | ############################################################################################################################################# | 100% 
    wheel-0.29.0         | 88 KB     | ############################################################################################################################################# | 100% 
    mkl-service-1.1.2    | 11 KB     | ############################################################################################################################################# | 100% 
    Preparing transaction: done
    Verifying transaction: done
    Executing transaction: done
    #
    # To activate this environment, use
    #
    #     $ conda activate gatk
    #
    # To deactivate an active environment, use
    #
    #     $ conda deactivate
    

    Then i activated gatk environment

                conda activate gatk
    

    Then I ran following command which throws errors:

    (gatk) bioinfo$ gatk NeuralNetInference -R reference.fasta -V NA12878.vcf -O NeuralNetInferenceFiltered.vcf -a cnn_1d_annotations.hd5
    No command 'gatk' found, did you mean:
     Command 'gitk' from package 'gitk' (main)
     Command 'gak' from package 'gui-apt-key' (universe)
     Command 'gawk' from package 'gawk' (main)
    gatk: command not found
    

    Can you please help?

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    You need to put GATK to your path.

  • lakhujanivijaylakhujanivijay IndiaMember
    edited March 5

    Thanks SkyWarrior . That helped. However, now I able to launch GATK without activating gatk environment.

    [email protected]$ conda activate gatk
    (gatk) [email protected]$ gatk
    
     Usage template for all tools (uses --spark-runner LOCAL when used with a Spark tool)
        gatk AnyTool toolArgs
    
     Usage template for Spark tools (will NOT work on non-Spark tools)
        gatk SparkTool toolArgs  [ -- --spark-runner <LOCAL | SPARK | GCS> sparkArgs ]
    
     Getting help
        gatk --list       Print the list of available tools
    
        gatk Tool --help  Print help on a particular tool
    
     Configuration File Specification
         --gatk-config-file                PATH/TO/GATK/PROPERTIES/FILE
    
     gatk forwards commands to GATK and adds some sugar for submitting spark jobs
    
       --spark-runner <target>    controls how spark tools are run
         valid targets are:
         LOCAL:      run using the in-memory spark runner
         SPARK:      run using spark-submit on an existing cluster 
                     --spark-master must be specified
                     --spark-submit-command may be specified to control the Spark submit command
                     arguments to spark-submit may optionally be specified after -- 
         GCS:        run using Google cloud dataproc
                     commands after the -- will be passed to dataproc
                     --cluster <your-cluster> must be specified after the --
                     spark properties and some common spark-submit parameters will be translated 
                     to dataproc equivalents
    
       --dry-run      may be specified to output the generated command line without running it
       --java-options 'OPTION1[ OPTION2=Y ... ]'   optional - pass the given string of options to the 
                     java JVM at runtime.  
                     Java options MUST be passed inside a single string with space-separated values.
    

    Now , I deactivate conda

    (gatk) [email protected]$ conda deactivate

    and launch GATK

    [email protected]$ gatk

    It still launches, hence, I wonder if this is expected.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited March 5

    Environment and PATH are seperate things therefore the behavior is expected. Launching gatk without environment is a problem for CNV CNN and some other tools. Environment must be active for those tasks.

  • lakhujanivijaylakhujanivijay IndiaMember

    Hi SkyWarrior

    That really helped. Thanks!

Sign In or Register to comment.