We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

CombineGVCFs fails to combine calls from whole genome GVCFs if an interval file is provided

Hi,

CombineGVCFs is failing to combine these three files if we provide an interval file.

a.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A 1 2337033 . G <NON_REF> . . END=2337368 GT:DP:GQ:MIN_DP:PL 0/0:216:99:42:0,99,1485

b.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT B 1 2337033 . G <NON_REF> . . END=2337297 GT:DP:GQ:MIN_DP:PL 0/0:114:99:48:0,120,1800

c.vcf:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT C 1 2337033 . G <NON_REF> . . END=2337336 GT:DP:GQ:MIN_DP:PL 0/0:134:99:35:0,105,950

intervals.bed

#Chrom Start End Gene 0 Strand 1 2337194 2337283 PEX10 0 -

If I run CombineGVCFs without -L I get what you would expect:

java -Xmx1G -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T CombineGVCFs -R human_g1k_v37.fasta -V a.vcf -V b.vcf -V c.vcf -o combined.vcf

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C 1 2337033 . G <NON_REF> . . END=2337297 GT:DP:GQ:MIN_DP:PL ./.:216:99:42:0,99,1485 ./.:114:99:48:0,120,1800 ./.:134:99:35:0,105,950 1 2337298 . C <NON_REF> . . END=2337336 GT:DP:GQ:MIN_DP:PL ./.:216:99:42:0,99,1485 ./. ./.:134:99:35:0,105,950 1 2337337 . C <NON_REF> . . END=2337368 GT:DP:GQ:MIN_DP:PL ./.:216:99:42:0,99,1485 ./. ./.

However if I pass -L, I get no combined data at all:

java -Xmx1G -jar GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar -T CombineGVCFs -R human_g1k_v37.fasta -V a.vcf -V b.vcf -V c.vcf -o combined.vcf -L intervals.bed

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C

Notice that this interval file specifies an interval right in the middle of the GVCF band all three files have. However, CombineGVCFs is completely ignoring the data in those bands.

Best regards,
Carlos

Tagged:

Comments

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @CarlosBorroto‌

    Hi Carlos,

    This is a known limitation of the GATK. If an interval starts in the middle of a GVCF block, it is possible it may get dropped. The best thing to do is have the interval start at the same position that the block begins.

    Ideally, you should run all your analyses in GATK with the same interval list so all the intervals you are interested in are always included.

    You can also try running this with -ERC BP_RESOLUTION instead of -ERC GVCF.

    -Sheila

  • Hi @Sheila,

    Thanks for the quick answer. I see how using -ERC BP_RESOLUTION when running HaplotyCaller would workaround this issue. Not ideal, but doable.

    May I ask why allow providing an intervals file to CombineGVCFs when the input is in GVCF? At a minimum I feel GATK should give a big all capitals warning for this case.

    Thanks again,

    Carlos

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @CarlosBorroto‌

    You could potentially want to run the tool in parallel per chromosome, to go faster. That is in fact what we use internally at Broad. So using the -L argument with this tool is perfectly legitimate, and adding a warning is not something we can do.

    What we can and will do is improve our documentation to clarify how to use interval files successfully depending on use case.

Sign In or Register to comment.