We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

CombineGVCFs with -L and --breakBandsAtMultiplesOf

CombineGVCFs doesn't seem to work properly when using both -L to specify intervals and also --breakBandsAtMultiplesOf to specify that reference bands should be broken.

I've run a simplified test case with just two samples and a single reference band variant.

First of all, using just --breakBandsAtMultiplesOf (without -L) works fine:

$ java -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs38DH.fa --variant test-1.g.vcf --variant test-2.g.vcf --out test-out-breakbands.g.vcf --breakBandsAtMultiplesOf 10000
$ tail -n 18 test-out-breakbands.g.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1      S2
chr3_KI270935v1_alt     1       .       G       <NON_REF>       .       .       END=9999        GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     10000   .       C       <NON_REF>       .       .       END=19999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     20000   .       C       <NON_REF>       .       .       END=29999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     30000   .       T       <NON_REF>       .       .       END=39999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     40000   .       C       <NON_REF>       .       .       END=49999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     50000   .       A       <NON_REF>       .       .       END=59999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     60000   .       T       <NON_REF>       .       .       END=69999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     70000   .       T       <NON_REF>       .       .       END=79999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     80000   .       G       <NON_REF>       .       .       END=89999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     90000   .       A       <NON_REF>       .       .       END=99999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     100000  .       A       <NON_REF>       .       .       END=109999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     110000  .       T       <NON_REF>       .       .       END=119999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     120000  .       A       <NON_REF>       .       .       END=129999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     130000  .       C       <NON_REF>       .       .       END=139999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     140000  .       C       <NON_REF>       .       .       END=149999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     150000  .       G       <NON_REF>       .       .       END=159999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     160000  .       A       <NON_REF>       .       .       END=165607      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800

Second, as expected, using -L without --breakBandsAtMultiplesOf does not output any variants:

$ java -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs38DH.fa --variant test-1.g.vcf --variant test-2.g.vcf --out test-out-intervals.g.vcf -L chr3_KI270935v1_alt:1-49999 -L chr3_KI270935v1_alt:100000-149999
$ tail -n 1 test-out-intervals.g.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1      S2

Finally, combining -L and --breakBandsAtMultiplesOf (with compatible breakpoints) unexpectedly results in one variant that spans across the gap in specified intervals:

$ java -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs38DH.fa --variant test-1.g.vcf --variant test-2.g.vcf --out test-out-intervals-breakbands.g.vcf -L chr3_KI270935v1_alt:1-49999 -L chr3_KI270935v1_alt:100000-149999 --breakBandsAtMultiplesOf 10000
$ tail -n 11 test-out-intervals-breakbands.g.vcf
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  S1      S2
chr3_KI270935v1_alt     1       .       G       <NON_REF>       .       .       END=9999        GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     10000   .       C       <NON_REF>       .       .       END=19999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     20000   .       C       <NON_REF>       .       .       END=29999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     30000   .       T       <NON_REF>       .       .       END=39999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     40000   .       C       <NON_REF>       .       .       END=49999       GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     50000   .       A       <NON_REF>       .       .       END=109999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     110000  .       T       <NON_REF>       .       .       END=119999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     120000  .       A       <NON_REF>       .       .       END=129999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     130000  .       C       <NON_REF>       .       .       END=139999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800
chr3_KI270935v1_alt     140000  .       C       <NON_REF>       .       .       END=149999      GT:DP:GQ:MIN_DP:PL      ./.:94:99:82:0,120,1800 ./.:94:99:82:0,120,1800

The problem is the variant line with the reference band that spans from 50000-109999. My expectation would be that it would be a variant that spans from 100000-109999 since that was the 10k multiple that falls within the intervals I specified, but it seems that the code that handles the band breaking does not interact well with the code that handles the intervals.

Is this indeed a bug in CombineGVCFs or am I misunderstanding how it is meant to work?

Cheers,

Josh.

Tagged:

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @jrandall
    Hi Josh,

    Why are there no sites present in the output using only -L? There should be sites within the intervals you specified.

    I think I see what is going on for you issue of -L and -breakBandsAtMultiplesOf. Notice you specified -L
    chr3_KI270935v1_alt:100000-149999. Position 100,000 is included in the band from 50,000-109,999 that is why it shows up in your output. But, since you did not specify -L 50,000-99,999 the band is not broken is not broken into bands of size 10,000. I hope this makes sense.

    -Sheila

  • jrandalljrandall Member

    @Sheila yes, that is my understanding of the problem. I consider this to be a bug - if I say I want to break bands at every 10000 bases, then bands should be broken at every 10000 bases, not only when the start of the band happens to lie within the -L region. The code that iterates through bases is correctly identifying that the 50000-109999 band needs to be processed when handling -L 100000-149999, but the code that breaks bands up does not get triggered because the start of the band was outside the processing region.

    I think the fix could possibly involve recognising when a band is entered for the first time (rather than simply when the position is equal to to the beginning of the band) and invoking the band-breaking code then. I don't know enough about the walkers to figure out how to hook into that and fix it myself.

  • jrandalljrandall Member

    @Sheila to answer your first question, this is a contrived example in which there is only a single "variant" line in the file, which is the band that spans from 50000-149999. If using -L without `-breakBandsAtMultiplesOf``, there is no output because without breaking the band, the band from 50000-149999 is not contained within the interval from 100000-149999.

Sign In or Register to comment.