Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CombineGVCFs fails to combine calls from whole genome GVCFs if an interval file is provided

Hi,

CombineGVCFs is failing to combine these three files if we provide an interval file.

a.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A 1 2337033 . G <NON_REF> . . END=2337368 GT:DP:GQ:MIN_DP:PL 0/0:216:99:42:0,99,1485

b.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT B 1 2337033 . G <NON_REF> . . END=2337297 GT:DP:GQ:MIN_DP:PL 0/0:114:99:48:0,120,1800

c.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT C 1 2337033 . G <NON_REF> . . END=2337336 GT:DP:GQ:MIN_DP:PL 0/0:134:99:35:0,105,950

intervals.bed

#Chrom Start End Gene 0 Strand 1 2337194 2337283 PEX10 0 -

If I run CombineGVCFs without -L I get what you would expect:

java -Xmx1G -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T CombineGVCFs -R human_g1k_v37.fasta -V a.vcf -V b.vcf -V c.vcf -o combined.vcf

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C 1 2337033 . G <NON_REF> . . END=2337297 GT:DP:GQ:MIN_DP:PL ./.:216:99:42:0,99,1485 ./.:114:99:48:0,120,1800 ./.:134:99:35:0,105,950 1 2337298 . C <NON_REF> . . END=2337336 GT:DP:GQ:MIN_DP:PL ./.:216:99:42:0,99,1485 ./. ./.:134:99:35:0,105,950 1 2337337 . C <NON_REF> . . END=2337368 GT:DP:GQ:MIN_DP:PL ./.:216:99:42:0,99,1485 ./. ./.

However if I pass -L, I get no combined data at all:

java -Xmx1G -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T CombineGVCFs -R human_g1k_v37.fasta -V a.vcf -V b.vcf -V c.vcf -o combined.vcf -L intervals.bed

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C

Notice that this interval file specifies an interval right in the middle of the GVCF band all three files have. However, CombineGVCFs is completely ignoring the data in those bands.

Best regards,
Carlos

Tagged:

Comments

  • SheilaSheila Broad InstituteMember, Broadie admin

    @CarlosBorroto‌

    Hi Carlos,

    This is a known limitation of the GATK. If an interval starts in the middle of a GVCF block, it is possible it may get dropped. The best thing to do is have the interval start at the same position that the block begins.

    Ideally, you should run all your analyses in GATK with the same interval list so all the intervals you are interested in are always included.

    You can also try running this with -ERC BP_RESOLUTION instead of -ERC GVCF.

    -Sheila

  • Hi @Sheila,

    Thanks for the quick answer. I see how using -ERC BP_RESOLUTION when running HaplotyCaller would workaround this issue. Not ideal, but doable.

    May I ask why allow providing an intervals file to CombineGVCFs when the input is in GVCF? At a minimum I feel GATK should give a big all capitals warning for this case.

    Thanks again,

    Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @CarlosBorroto‌

    You could potentially want to run the tool in parallel per chromosome, to go faster. That is in fact what we use internally at Broad. So using the -L argument with this tool is perfectly legitimate, and adding a warning is not something we can do.

    What we can and will do is improve our documentation to clarify how to use interval files successfully depending on use case.

Sign In or Register to comment.