Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Haplotypecaller running in parallel for each chromosome

thkitapcithkitapci Los AngelesMember

Hi,
I would like to run haplotypercaller with -ERC GVCF on each chromosome in parallel. Then run GenotypeGVCFs.

Do I need to first concatenate the results from each chromosome in sample1.chr_merged.g.vcf ... files then run GenotypeGVCFs with sample1.chr_merged.g.vcf , sample2.chr_merged.g.vcf .... or is it possible to run GenotypeGVCFs directly?

Best Answer

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @thkitapci

    1) Keep in mind that GenotypeGVCFs can only accept one input.
    2) We recommend using the intervals feature in Haplotypecaller to parallelize execution of an analysis across genomic regions. Take a look at this doc: https://software.broadinstitute.org/gatk/documentation/article?id=11009
    Do I need to first concatenate the results from each chromosome in sample1.chr_merged.g.vcf ... files then run GenotypeGVCFs with sample1.chr_merged.g.vcf , sample2.chr_merged.g.vcf .... or is it possible to run GenotypeGVCFs directly?
    We recommend you combine all gvcfs and then run GenotypeGVCFs on it.

  • thkitapcithkitapci Los AngelesMember

    Hi @bhanuGandham,
    Thanks for your reply. Intervals feature sounds good but in my case I am running my analysis on a cluster it might be better for me to submit indipendent jobs for non-overlapping regions of the genome using the -L argument with Haplotypecaller as

    gatk HaplotypeCaller -R $ref_fasta -I input.bam -O out_chr1.g.vcf -ERC GVCF -L chr1
    gatk HaplotypeCaller -R $ref_fasta -I input.bam -O out_chr2.g.vcf -ERC GVCF -L chr2
    gatk HaplotypeCaller -R $ref_fasta -I input.bam -O out_chr3.g.vcf -ERC GVCF -L chr3

    Then I think I can use the CatVariants (https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_CatVariants.php)

    to merge the output files out_chr1.g.vcf out_chr2.g.vcf out_chr3.g.vcf

    I installed gatk with anaconda (conda install -c bioconda gatk)

    I don't understand where CatVariants is located. For all the other commands I am using the python script gatk but it seems CatVariants cannot be run from the python script.

    Thanks

  • thkitapcithkitapci Los AngelesMember

    Hi @bhanuGandham,
    Thanks for pointing out the GatherVCFs tools. I tried on a small dataset and it worked fine. Below is a small example that I run I am going to share here I think it might be helpful if anybody else gets to this thread

    I run these 3 commands in parallel

    gatk HaplotypeCaller -R $ref_fasta -I $input_bam -O out_chr1_I1.g.vcf -ERC GVCF -L chr1:3070145-3661559
    gatk HaplotypeCaller -R $ref_fasta -I $input_bam -O out_chr1_I2.g.vcf -ERC GVCF -L chr1:3884163-3935397
    gatk HaplotypeCaller -R $ref_fasta -I $input_bam -O out_chr1_I3.g.vcf -ERC GVCF -L chr1:3941660-4190628

    Then put the 3 output file names in a file called vcf_list_to_merge.txt

    rm vcf_list_to_merge.txt
    for i in out_chr1_I1.g.vcf out_chr1_I2.g.vcf out_chr1_I3.g.vcf
    do
    echo $i >>vcf_list_to_merge.txt
    done

    Use GatherVcfs to merge the VCFs

    gatk GatherVcfs -I vcf_list_to_merge.txt -O out_chr1_merged_Intervals.g.vcf

    Do a manual check to make sure merging is correct. GatherVCFs should be equivalent to "cat" the VCFs without headers

    Remove headers from the partial vcf files and then cat

    grep -v "#" out_chr1_I1.g.vcf >noheader_1.txt
    grep -v "#" out_chr1_I2.g.vcf >noheader_2.txt
    grep -v "#" out_chr1_I3.g.vcf >noheader_3.txt
    cat noheader_1.txt noheader_2.txt noheader_3.txt >noheader_cat_out.txt

    Remove header from the GatherVCFs output

    grep -v "#" out_chr1_merged_Intervals.g.vcf >noheader_GatherVcfs_out.txt

    Check to make sure these files are identical

    md5sum noheader_GatherVcfs_out.txt noheader_cat_out.txt

    Thanks
    Best Regards
    Hamdi

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
Sign In or Register to comment.