Attention:
The frontline support team will be unavailable to answer questions on April 15th and 17th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!

Using CombineVariants on gvcf files.

srynearson1srynearson1 Member
edited February 2015 in Ask the GATK team

Currently I am following GATK best practice for using HC 3.0+, however I'm splitting my calls to chromosomal regions (-L). Next are the following step I perform working up to GenotypeGVCF and my question.

1 - I use CatVariants (following HC) to merge all 25 chromosome gvcf files into a single gvcf file per individual.
2 - I use CombineGVCF to merge 2 .. n number of individuals together. This is done because some analysis have 300+ individuals.
3- I then use CombineGVCF again to merge all the file from step 2 into one large gvcf file for one large joint GenotypeGVCF step.
4 - GenotypeGVCF is done again based on chromosomal regions (-L), which is followed by a additional CatVariants before VQSR.

The question I have this this:
Given the size of the analysis I have noticed that my CombineGVCF done in step 3 can take anywhere from 4-8 hours. I was wondering if I could change this step to use CombineVariants and have the result be the same (unlost data). The main reason for this would be because GATK currently allow CombineVariants to use the -nt option.

Thanks for you time and work.

--Shawn

Best Answers

Answers

  • srynearson1srynearson1 Member
    edited March 2015

    Strangely, I took your advice and ran the following steps:
    1 - I use CatVariants (following HC) to merge all 25 chromosome gvcf files into a single gvcf file per individual.
    2 - You can just pass a list of GVCFs to GenotypeGVCFs and it will do the big joint genotyping step as if they were all in one file

    Result

    I used the 17 Platinum genomes in a split run across each chromosome (-L). Each run would increase time needed to complete (up to about 50+h) before it eventually died with a call to increase Xmx argument (was running with -Xmx100g).

    Any thoughts would be appreciated.

  • srynearson1srynearson1 Member

    I thought I would add a log file using the above parameters.

  • srynearson1srynearson1 Member

    That was the trick. Reduced the whole run time, start to finish to 2 hours.

    Thanks a bunch.

    --Shawn

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
  • robertbrobertb torontoMember

    I believe that this discussion answers my question. For some samples I have a collection of gvcfs for each chromosome while for others I just have a single gvcf. For joint genotyping I was going to pass all these gvcfs to GenotypeGVCfs together as one long list ( more than 200 gvcfs but only about 50 samples overall ). I haven't read anything that would make me think that this would be a problem. There's the fact that I've got over 200 gvcf's but most of these files are small. Am I still going to see the run time issues?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @robertb The bottleneck for this operation is that the more files you have that need to be open at the same time, the more memory you'll need. You might get away with it, depending on what kind of infrastructure you're working on, but I would still recommend combining GVCFs per-sample before running GenotypeGVCFs. You can do this fairly quickly with CatVariants. If you go with that, be sure to use the parameters I indicated above.

  • robertbrobertb torontoMember

    Yes, Thanks. As with more recent versions of TK I specified .g.vcf for output so there's no need for the variant_index settings

  • robertbrobertb torontoMember

    In the course of doing CatVariants for the chromosomes for each sample I noticed that for 3 samples some chromosomes were not represented meaning there were no gvcfs for these chromosomes. Seems strange to me. Is this normal? Could it suggest a problem with the BAM files given to HaplotypeCaller?

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    To be clear you need to give the indexing parameters to CatVariants otherwise it won't know to use the GVCF settings. That tool doesn't have the extension recognition logic (an unfortunate oversight).

    Regarding your second question, can you clarify what is missing? From what step?

  • robertbrobertb torontoMember

    That's curious because I had initially followed your instructions and got a warning from the program stating that I didn't need to use them when specifying g.vcf for -out.

    When I ran HC I used the -L option for each chromosome ( chr1-chr22 + chrother ). When I used CatVariants to merge the chromosome gvcfs for each sample I noticed that for three of my samples anywhere from 3-5 of the chromosomes were missing ( there were no gvcf for these chromosomes ). For example for sample A I was missing chromosomes 2, 4, 6, 9, and 17. I'd assumed that all chromosomes would be present following HC.

  • robertbrobertb torontoMember

    Here's the relevant section of one of my log files in case your interested. Note the lats line.

    INFO 23:34:36,524 HelpFormatter - -------------------------------------------------------
    INFO 23:34:36,532 HelpFormatter - Program Name: org.broadinstitute.gatk.tools.CatVariants
    INFO 23:34:36,538 HelpFormatter - Program Args: -R /work/lianglab/bin/GATK/resources/hg19/ucsc.hg19.fasta --variant_index_type LINEAR --variant_index_parameter 128000 -V /scratch/rb14sp/Control/VCF/01C08034.chr1.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr2.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr3.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr4.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr5.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr6.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr7.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr8.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr9.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr10.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr11.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr12.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr13.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr14.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr15.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr16.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr17.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr18.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr19.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr20.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr21.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chr22.g.vcf -V /scratch/rb14sp/Control/VCF/01C08034.chrother.g.vcf -out /scratch/rb14sp/Control/VCF/01C08034.g.vcf
    INFO 23:34:36,568 HelpFormatter - Executing as [email protected] on Linux 2.6.32-358.14.1.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_25-mockbuild_2013_07_01_09_31-b00.
    INFO 23:34:36,569 HelpFormatter - Date/Time: 2016/02/13 23:34:36
    INFO 23:34:36,569 HelpFormatter - -------------------------------------------------------
    INFO 23:34:36,569 HelpFormatter - -------------------------------------------------------
    WARN 23:34:46,099 GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values for --variant_index_type and --variant_index_parameter
    9575 [main] WARN org.broadinstitute.gatk.engine.GATKVCFUtils - Naming your output file using the .g.vcf extension will automatically set the appropriate values for --variant_index_type and --variant_index_parameter
    .

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah perhaps the logic was put in and I didn't realize -- sounds like that's working ok then.

    Do you have logs for the runs that were supposed to produce the missing files? What kind of pipelining system do you use?

  • robertbrobertb torontoMember

    Unfortunately I inherited these bams from someone so I'm not clear on the details and I'm a little suspicious about what they did exactly. What I do know is the following:

    1) mapped the reads with bwa 0.7.9 mem ( human high seq reads 30X coverage )
    2) Some combination of sort, mark dups, mergebams, relaign and bqsr using GATK version 3.3 or 3.4

    I inherited the recalibrated bams although no record of bqsr appears in the headers. They say that they ran it and I believe them but the absence of a record concerns me because I didn't get the same result with my bam files.

    One thing that I did do however was use GATK 3.5 for the HaplotypeCaller step. I believe that they were using 3.3 or 3.4 up to that point.

    Thanks for all your help, it's really appreciated especially at this hour!

    Here's my HC script:

    if [ ! ${1} ]; then echo "$0 basenameOfBQSRbam"; exit; fi

    echo ${1} && \
    \

    HaplotypeCaller

    for i in {1..22}; do echo -ne "$i "; \
    sqsub -r 36h -o /work/rb14sp/CONlog/${1}.chr${i}.HC.log -f serial --mpp 8G \
    java -Xmx6g -jar /work/rb14sp/GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R /work/lianglab/bin/GATK/resources/hg19/ucsc.hg19.fasta \
    --dbsnp /work/lianglab/bin/GATK/resources/hg19/dbsnp_138.hg19.vcf \
    -L chr${i} \
    -I /scratch/rb14sp/Control/BAM/${1}.bam \
    -o /scratch/rb14sp/Control/VCF/${1}.chr${i}.g.vcf \
    --emitRefConfidence GVCF \
    2>/work/rb14sp/CONlog/$1.vcf.log; done && \
    \
    echo -ne "chrother " && \
    sqsub -r 20h -o /work/rb14sp/CONlog/${1}.chr${i}.HC.log -f serial --mpp 8G \
    java -Xmx6g -jar /work/rb14sp/GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -R /work/lianglab/bin/GATK/resources/hg19/ucsc.hg19.fasta \
    --dbsnp /work/lianglab/bin/GATK/resources/hg19/dbsnp_138.hg19.vcf \
    -XL chr1 -XL chr2 -XL chr3 -XL chr4 -XL chr5 -XL chr6 -XL chr7 -XL chr8 -XL chr9 -XL chr10 -XL chr11 -XL chr12 \
    -XL chr13 -XL chr14 -XL chr15 -XL chr16 -XL chr17 -XL chr18 -XL chr19 -XL chr20 -XL chr21 -XL chr22 \
    -I /scratch/rb14sp/Control/BAM/${1}.bam \
    -o /scratch/rb14sp/Control/VCF/${1}.chrother.g.vcf \
    --emitRefConfidence GVCF; exit

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @robertb, if you have files missing when running through a script like this you need to look at the jobscheduler logs to see what happened. Probably some of the submissions failed.

Sign In or Register to comment.