Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Haplotype caller active regions parameters not available in GATK4

dhwanidhwani indiaMember
edited May 3 in Ask the GATK team

I would like to know like GATK 3.6 there was an option to define active regions, is there any option in GATK v4.2.1

The Active region parameters can be seen from this link:-
https://software.broadinstitute.org/gatk/documentation/tooldocs/3.5- /org_broadinstitute_gatk_tools_walkers_diagnostics_FindCoveredIntervals.php.

  • advanced input which has a parameter name --activeRegionIn,

If i correctly understand, this is the list of interval (can be in bed file) to define active regions which would be further used for phasing.

1) What is the similar parameter in the new GATK release?

2) Using new version of GATK, VCF files has only 2 rows out of ** 228 rows** that have PID information. How do i go in such a case. Have i done any mistake in my analysis .

My commands are as follows:-
/opt/apps/gatk/4.2.1/gatk Haome_analysis/reference/Homo_sapiens.GRCh37.dna.chromosome.6.fa -I base_recalib/abc_aligned_sorted_dupmarked_realigned_recalibrated.bam -O haplotype_caller/abc_haplotyper.g.vcf --emit-ref-confidence GVCF -L /home/dhwani.dholakia/archive/files_required_for_exome_analysis/coord.bed --max-assembly-region-size 1000 -mbq 25 -stand-call-conf 40

Answers

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited May 3

    Hi @dhwani

    HaplotypeCaller's method of identifying active regions is mentioned here : HC overview: How the HaplotypeCaller works.
    For more details on how the activity profile is computed and processed, as well as what options are available to modify the active region parameters, please see HC step 1: Defining ActiveRegions by measuring data entropy. You'll see a subheading called "Setting the ActiveRegion thresholds and intervals".

    Mind elaborating on the second question and uploading a snippet of the results?

  • dhwanidhwani indiaMember
    edited May 3

    @bshifaw

    Thanks for your reply.

    The recent GATK documentation link and the link that you shared link have different arguments.
    In the link that you shared "Under sub heading "HC step 1: Defining ActiveRegions by measuring data entropy", the parameter to define active size ("-activeRegionMaxSize") is not there in the new documentation

    Could you please clear this ambiguity?

  • dhwanidhwani indiaMember

    @bshifaw

    Regarding second question:-

    When i used the parameter --max-assembly-region-size 1000 , i expect that the haplotypes should be phased for a large region around 1000 bp. However in the result of VCF file i got only two near by variants that were phased. The commands that i had used for the same were

    /opt/apps/gatk/4.2.1/gatk HaplotypeCaller --dbsnp /home/dhwani.dholakia/archive/files_required_for_exome_analysis/dbsnp/GRCH37.p17_refseq.vcf -R /home/dhwani.dholakia/archive/files_required_for_exome_analysis/reference/Homo_sapiens.GRCh37.dna.chromosome.6.fa -I base_recalib/DRR015487_aligned_sorted_dupmarked_realigned_recalibrated.bam -O haplotype_caller/haplotyper.g.vcf --emit-ref-confidence GVCF -L /home/dhwani.dholakia/archive/files_required_for_exome_analysis/coord.bed --max-assembly-region-size 1000 -mbq 25 -stand-call-conf 40

    Have i missed any important arguments that's why i am not getting the genotypes phased?

    Attaching my VCF file for further information
    I can only find two variants that are phased and not far from each other also. Why is this happening even when active region stretch was defined for a size of 1000 bp.

    PS. changed the extension of VCF file to .txt because .vcf was not supported for uploading

  • bshifawbshifaw Member, Broadie, Moderator admin

    Ahh, looks like the activeRegionMaxSize argument has been removed from Haplotypecaller in GATK4. The error in the documentation has been noted and will be corrected.

    I'll have to check with the dev team with your second question.

  • bshifawbshifaw Member, Broadie, Moderator admin

    It might also be helpful be follow the instructions in the following article until then.

  • bshifawbshifaw Member, Broadie, Moderator admin

    @dhwani

    Here is a response from the team

    This is a known issue, and the phasing in VCF given by user does not look unusual. Basically, phasing output by HaplotypeCaller is very conservative and phases reads it's 100% sure about, which leads to low sensitivity.

    Look at issue 3368 and proposed fix in issue 5828

Sign In or Register to comment.